An Introduction To Matrix Decomposition and Graphical Model

131
An Introduction To Matrix Decomposition and Graphical Model Lei Zhang/Lead Researcher Microsoft Research Asia 2012-04-17

description

An Introduction To Matrix Decomposition and Graphical Model. Lei Zhang/Lead Researcher Microsoft Research Asia 2012-04-17. Outline. Matrix Decomposition PCA, SVD, NMF LDA, ICA, Sparse Coding, etc. Graphical Model Basic concepts in probabilistic machine learning EM pLSA LDA - PowerPoint PPT Presentation

Transcript of An Introduction To Matrix Decomposition and Graphical Model

Page 1: An Introduction To Matrix Decomposition and Graphical Model

An Introduction To Matrix Decomposition and Graphical Model

Lei ZhangLead ResearcherMicrosoft Research Asia

2012-04-17

Outline

bull Matrix Decompositionndash PCA SVD NMFndash LDA ICA Sparse Coding etc

bull Graphical Modelndash Basic concepts in probabilistic machine learningndash EMndash pLSAndash LDA

bull Two Applicationsndash Document decomposition for ldquolong queryrdquo retrievalndash Modeling Threaded Discussions

What Is Matrix Decomposition

bull We wish to decompose the matrix A by writing it as a product of two or more matrices

Antimesm = BntimeskCktimesm

bull Suppose A B C are column matricesndash Antimesm = (a1 a2 hellip am) each ai is a n-dim data samplendash Bntimesk = (b1 b2 hellip bk) each bj is a n-dim basis and space B consists of k

basesndash Cktimesm = (c1 c2 hellip cm) each ci is the k-dim coordinates of ai projected to

space B

Why We Need Matrix Decomposition

bull Given one data samplea1 = Bntimeskc1

(a11 a12 hellip a1n)T = (b1 b2 hellip bk) (c11 c12 hellip c1k)T

bull Another data sample a2 = Bntimeskc2

bull More data sample am = Bntimeskcm

bull Together (m data samples) (a1 a2 hellip am) = Bntimesk (c1 c2 hellip cm)

Antimesm = BntimeskCktimesm

Why We Need Matrix Decomposition

(a1 a2 hellip am) = Bntimesk (c1 c2 hellip cm) Antimesm = BntimeskCktimesm

bull We wish to find a set of new basis B to represent data samples A and A will become C in the new space

bull In general B captures the common features in A while C carries specific characteristics of the original samples

bull In PCA B is eigenvectorsbull In SVD B is right (column) eigenvectorsbull In LDA B is discriminant directionsbull In NMF B is local features

PRINCIPLE COMPONENT ANALYSIS

Definition ndash Eigenvalue amp Eigenvector

Given a m x m matrix C for any λ and w if

Then λ is called eigenvalue and w is called eigenvector

Definition ndash Principle Component Analysis

ndash Principle Component Analysis (PCA)ndash Karhunen-Loeve transformation (KL transformation)

bull Let A be a n times m data matrix in which the rows represent data samples

bull Each row is a data vector each column represents a variablebull A is centered the estimated mean is subtracted from each

column so each column has zero meanbull Covariance matrix C (m x m)

Principle Component Analysisbull C can be decomposed as follows C=UΛUT

bull Λ is a diagonal matrix diag(λ1 λ2hellipλn) each λi is an eigenvaluebull U is an orthogonal matrix each column is an eigenvector

UTU=I U-1=UT

Maximizing Variancebull The objective of the rotation transformation is to find the

maximal variancebull Projection of data along w is Awbull Variance σ2

w= (Aw)T(Aw) = wTATAw = wTCwwhere C = ATA is the covariance matrix of the data (A is centered)

bull Task maximize variance subject to constraint wTw=1

Optimization Problembull Maximize

λ is the Lagrange multiplierbull Differentiating with respect to w yields

bull Eigenvalue equation Cw = λw where C = ATA bull Once the first principal component is found we continue in the same

fashion to look for the next one which is orthogonal to (all) the principal component(s) already found

Property Data Decompositionbull PCA can be treated as data decomposition

a=UUTa=(u1u2hellipun) (u1u2hellipun)T a=(u1u2hellipun) (ltu1agtltu2agthellipltunagt)T

=(u1u2hellipun) (b1 b2 hellip bn)T

= Σ biui

Face Recognition ndash Eigenfacebull Turk MA Pentland AP Face recognition using eigenfaces

CVPR 1991 (Citation 2654)bull The eigenface approach

ndash images are points in a vector spacendash use PCA to reduce dimensionalityndash face spacendash compare projections onto face space to recognize faces

PageRank ndash Power Iteration

bull Column j has nonzero elements in positions corresponding to outlinks of j (Nj in total)

bull Row i has nonzero element in positions corresponding to inlinks Ii

Column-Stochastic amp Irreduciblebull Column-Stochastic

bull where

bull Irreducible

Iterative PageRank Calculationbull For k=12hellip

bull Equivalently (λ=1 A is a Markov chain transition matrix)

bull Why can we use power iteration to find the first eigenvector

Convergence of the power iteration

bull Expand the initial approximation r0 in terms of the eigenvectors

SINGULAR VALUE DECOMPOSITION

SVD - Definitionbull Any m x n matrix A with m ge n can be factorized

bull

Singular Values And Singular Vectors

bull The diagonal elements σj of are the singular values of the matrix A

bull The columns of U and V are the left singular vectors and right singular vectors respectively

bull Equivalent form of SVD

Matrix approximation

bull Theorem Let Uk = (u1 u2 hellip uk) Vk = (v1 v2 hellip vk) and Σk = diag(σ1 σ2 hellip σk) and define

bull Then

bull It means that the best approximation of rank k for the matrix A is

SVD and PCAbull We can writebull Remember that in PCA we treat A as a row matrixbull V is just eigenvectors for A

ndash Each column in V is an eigenvector of row matrix Andash we use V to approximate a row in A

bull Equivalently we can writebull U is just eigenvectors for AT

ndash Each column in U is an eigenvector of column matrix Andash We use U to approximate a column in A

Example - LSI bull Build a term-by-document matrix A

bull Compute the SVD of A A = UΣVT

bull Approximate A by

ndash Uk Orthogonal basis that we use to approximate all the documents

ndash Dk Column j hold the coordinates of document j in the new basisndash Dk is the projection of A onto the subspace spanned by Uk

SVD and PCAbull For symmetric A SVD is closely related to PCA

bull PCA A = UΛUT

ndash U and Λ are eigenvectors and eigenvaluesbull SVD A = UΛVT

ndash U is left(column) eigenvectorsndash V is right(row) eigenvectorsndash Λ is the same eigenvalues

bull For symmetric A column eigenvectors equal to row eigenvectors

bull Note the difference of A in PCA and SVDndash SVD A is directly the data eg term-by-document matrixndash PCA A is covariance matrix A=XTX each row in X is a sample

Latent Semantic Indexing (LSI)1 Document file preparation preprocessing

ndash Indexing collecting termsndash Use stop list eliminate rdquomeaninglessrdquo wordsndash Stemming

2 Construction term-by-document matrix sparse matrix storage

3 Query matching distance measures4 Data compression by low rank approximation SVD5 Ranking and relevance feedback

Latent Semantic Indexingbull Assumption there is some underlying latent semantic

structure in the data

bull Eg car and automobile occur in similar documents as do cows and sheep

bull This structure can be enhanced by projecting the data (the term-by-document matrix and the queries) onto a lower dimensional space using SVD

Similarity Measures

bull Term to term AAT = UΣ2UT = (UΣ)(UΣ)T

UΣ are the coordinates of A (rows) projected to space V

bull Document to documentATA = VΣ2VT = (VΣ)(VΣ)T

VΣ are the coordinates of A (columns) projected to space U

Similarity Measures

bull Term to document A = UΣVT = (UΣfrac12)(VΣfrac12)T

UΣfrac12 are the coordinates of A (rows) projected to space VVΣfrac12 are the coordinates of A (columns) projected to space U

HITS (Hyperlink Induced Topic Search)

bull Idea Web includes two flavors of prominent pagesndash authorities contain high-quality informationndash hubs are comprehensive lists of links to authorities

bull A page is a good authority if many hubs point to itbull A page is a good hub if it points to many authoritiesbull Good authorities are pointed to by good hubs and good hubs

point to good authorities

Hubs Authorities

Power Iteration

bull Each page i has both a hub score hi and an authority score aibull HITS successively refines these scores by computing

bull Define the adjacency matrix L of the directed web graph

bull Now

HITS and SVD

bull L rows are outlinks columns are inlinks

bull a will be the dominant eigenvector of the authority matrix LTL bull h will be the dominant eigenvector of the hub matrix LLT

bull They are in fact the first left and right singular vectors of Lbull We are in fact running SVD on the adjacency matrix

HITS vs PageRankbull PageRank may be computed once HITS is computed per

query

bull HITS takes query into account PageRank doesnrsquot

bull PageRank has no concept of hubs

bull HITS is sensitive to local topology insertion or deletion of a small number of nodes may change the scores a lot

bull PageRank more stable because of its random jump step

NMF ndash NON-NEGATIVE MATRIX FACTORIZATION

Definition

bull Given a nonnegative matrix Vntimesm find non-negative matrix factors Wntimesk and Hktimesm such that

VntimesmasympWntimeskHktimesm

bull V column matrix each column is a data sample (n-dimension)bull Wi k-basis represents one basebull H coordinates of V projected to W

vj asymp Wntimeskhj

Motivationbull Non-negativity is natural in many applications

bull Probability is also non-negative

bull Additive model to capture local structure

Multiplicative Update Algorithmbull Cost function Euclidean distance

bull Multiplicative Update

Multiplicative Update Algorithmbull Cost function Divergence

ndash Reduce to Kullback-Leibler divergence whenndash A and B can be regarded as normalized probability distributions

bull Multiplicative update

bull PLSA is NMF with KL divergence

NMF vs PCA

bull n = 2429 faces m = 19x19 pixelsbull Positive values are illustrated with black pixels and negative values

with red pixels

bull NMF Parts-based representationbull PCA Holistic representations

Referencebull D D Lee and H S Seung Algorithms for non-negative matrix

factorization (pdf) NIPS 2001bull D D Lee and H S Seung Learning the parts of objects by non-negative

matrix factorization (pdf) Nature 401 788-791 (1999)

Major Referencebull Saara Hyvoumlnen Linear Algebra Methods for Data Mining Spring 2007

University of Helsinki (Highly recommend)

Outlinebull Basic concepts

ndash Likelihood iidndash ML MAP and Bayesian Inferencendash Expectation-Maximizationndash Mixture Gaussian Parameter estimation

bull pLSAndash Motivationndash Derivation amp Geometry propertiesndash Applications

bull LDAndash Motivation - Why to add a hyper parameterndash Dirichlet Distributionndash Variational EMndash Relations with other topic modalsndash Incorporating category information

bull Summary

Not Included

bull General graphical model theoriesbull Markov random field (belief propagation)bull Detailed derivation of LDA

BASIC CONCEPTS

What Is Machine Learning

Databull Let x = (x1 x2 xD) T denote a data point and D = x(1) x(2) x(N) a

data set D is sometimes associated with desired outputs y1 y2

Predictionsbull We are generally interested in predicting something based on the observed

data setbull Given D what can we say about x(N+1)

Modelbull To make predictions we need to make some assumptions We can often

express these assumptions in the form of a model with some parameters θbull Given data D we learn the model parameters from which we can predict

new data pointsbull The model can often be expressed as a probability distribution over data

points

Likelihood Function

bull Given a set of parameter values probability density function (PDF) will show that some data are more probable than other data

bull Inversely given the observed data and a model of interest Likelihood function is defined as

L(θ) = fθ(x|θ) = p(x|θ)

bull That is likelihood function L(θ) will show that some parameters are more likely to have produced the data

Maximum Likelihood (ML)

bull Maximum likelihood will find the best model parameters that make the data ldquomost likelyrdquo generated from this model

bull Suppose we are given n data samples (x1 x2 hellip xn)

bull Maximum likelihood will find θ that maximize L(θ)

bull Predictive distribution

IID ndash Independent Identically Distributed

bull IID means

bull The problem is considerably simplified as

bull Usually log likehood is used

Reference

bull Zoubin Ghahramani Machine Learning (4F13) 2006 Cambridge(Introduction to Machine Learning Lectures 1-2 Slides)

bull Gregor Heinrich Parameter estimation for text analysis Technical Note 2005-2008

EXPECTATION MAXIMIZATION

Why We Need EM

bull The ExpectationndashMaximization (EM) algorithm is a method for ML learning of parameters in latent variable models

bull Why we need latent variables

bull To describe complex model Gaussian Mixture Model

bull To discover the intrinsic structure inside a data set Topic Models such as pLSA LDA

More General

bull Data set bull Likelihood

bull Goal learn maximum likelihood (ML) parameter values

bull The maximum likelihood procedure finds parameters θ such that

bull Because of the integral (or sum) over latent variables the likelihood can be a very complicated and hard to optimize function

The Expectation Maximization (EM) Algorithm

bull The EM algorithm finds a (local) maximum of a latent variable model likelihood It starts from arbitrary values of the parameters and iterates two steps

bull E step Fill in values of latent variables according to posterior given data

bull M step Maximize likelihood as if latent variables were not hidden

bull Decomposes difficult problems into series of tractable steps

Jensenrsquos Inequality

Lower Bounding the Log Likelihoodbull Observed data D = yn Latent variables X = xn Parameters θbull Goal Maximize the log likelihood (ie ML learning) wrt θ

bull Any distribution q(X) over the hidden variables can be used to obtain a lower bound on the log likelihood using Jensenrsquos inequality

bull where H[q] is the entropy of q(X)

The E and M Steps of EM

bull The lower bound on the log likelihood is given by

bull EM alternates betweenbull E step optimize wrt distribution over hidden variables

holding params fixed

bull M step maximize wrt parameters holding hidden distribution fixed

The E Step

bull E step for fixed θ

bull The second term is the Kullback-Leibler divergencebull This means that for fixed θ F is bounded above by L and achieves

that bound whenbull So the E step simply sets

The M Step

bull M step maximize wrt parameters holding hidden distribution q fixed

bull The second equality comes from fact that entropy of q(X) does not depend directly on θ

bull The specific form of the M step depends on the model Often the maximum wrt θ can be found analytically

EM Never Decreases the Likelihood

bull The E and M steps together never decrease the log likelihood

bull The E step brings F(q θ) to the likelihood L(θ) 顶bull The M-step maximizes F(q θ) wrt θ 抬bull F(q θ) lt L(θ) by Jensen ndash or equivalently from the non-

negativity of KL

Reference

bull Zoubin Ghahramani Machine Learning (4F13) 2006 Cambridge(Unsupervised learning Lecture 5 Slides)

bull Christopher M Bishop (2006) Pattern Recognition and Machine Learning Springer

WHY DO WE NEED GRAPHICAL MODEL

Why Do We Need Graphical Models

bull Consndash Graphical model is so complex even with a few circleshellipndash We have to make too many assumptions

bull Prosndash We do need probability to explain our world But joint probability is

hard to computendash Graphical model can help us analyze and understand our problemsndash Graphs are an intuitive way of representing and visualizing the

relationships between many variablesndash With a graphical model we can decouple joint probability to

conditional probabilities which are usually easier

Directed Acyclic Graphical Models (Bayesian Networks)

bull A DAG Model Bayesian network corresponds to a factorization of the joint probability distribution

p(ABCDE) = p(A)p(B)p(C|AB)p(D|BC)p(E|CD)

bull In general

bull where pa(i) are the parents of node i

Directed Graphs for Statistical ModelsPlate Notation

bull A data set of N points generated from a Gaussian

PLSA ndash PROBABILISTIC LATENT SEMANTIC ANALYSIS

Latent Semantic Indexing (LSI) Review

bull For natural Language Queries simple term matching does not work effectivelyndash Ambiguous terms ndash Same Queries vary due to personal styles

bull Latent semantic indexingndash Creates this lsquolatent semantic spacersquo (hidden meaning)

bull LSI puts documents together even if they donrsquot have common words if the docs share frequently co-occurring terms

bull Disadvantagesndash Statistical foundation is missing

pLSA ndash Probabilistic Latent Semantic Analysis

bull Automated Document Indexing and Information retrievalbull Identification of Latent Classes using an Expectation

Maximization (EM) Algorithmbull Shown to solve

ndash Polysemybull Java could mean ldquocoffeerdquo and also the ldquoPL Javardquobull Cricket is a ldquogamerdquo and also an ldquoinsectrdquo

ndash Synonymybull ldquocomputerrdquo ldquopcrdquo ldquodesktoprdquo all could mean the same

bull Has a better statistical foundation than LSA

pLSA

M

Nd

d

z

w

M

d

z1

w1

z2

w2

z3

w3

zN

wN

hellip

z1 hellip zN are variables ziє[1K]K is the number of latent topics

pLSA

d1

z1

w1

z2

w2

zN1

wN1

hellip

d2

z1

w1

z2

w2

zN2

wN2

hellip

dM

z1

w1

z2

w2

zNm

wNm

hellip

p(w|z=1) p(w|z=2) p(w | z=Nm) are shared for all documents

Likelihood

Joint Probability vs Likelihood

bull Joint probability

bull Likelihood (only for observed variables)

bull p(d) is assumed to be uniform

Document Decomposition

bull Each document can be decomposed as

bull This is similar to the matrix decomposition if we consider each discrete distribution as a vector

p(w|d) = ZVtimesk p(z|d)

bull With many documents we hope to find latent topics as common basis

pLSA ndash Objective Function

bull pLSA tries to maximize the log likelihood

bull Due to the summation over z inside log we have to resort to EM

EM Steps

bull E-Stepndash Expectation of the likelihood function is calculated with the current

parameter valuesbull M-Step

ndash Update the parameters with the calculated posterior probabilitiesndash Find the parameters that maximizes the likelihood function

Lower Bounding the Log Likelihood

EM Steps

bull The E-Step

bull The M-Step

Latent Subspace

pLSA vs LSAbull LSA and PLSA perform dimensionality reduction

ndash In LSA by keeping only K singular valuesndash In PLSA by having K aspects

bull Comparison to SVDndash U Matrix related to P(z|d) (doc to aspect)ndash V Matrix related to P(w|z) (aspect to term)ndash E Matrix related to P(z) (aspect strength)

pLSA vs LSAbull The main difference is the way the approximation is done

bull PLSA generates a model (aspect model) and maximizes its predictive power

bull Selecting the proper value of K is heuristic in LSA

bull Model selection in statistics can determine optimal K in PLSA

Applications

bull Text mining topic discovering

bull Scene Classification

Text Mining

Scene Classification

Classification Result

Reference

bull Thomas Hofmann Probabilistic latent semantic analysis In Proc of Uncertainty in Artificial Intelligence UAI99 Stockholm 1999

bull Bosch A Zisserman A and Munoz X Scene Classification via pLSAProceedings of the European Conference on Computer Vision (2006)

bull Sivic J Russell B C Efros A A Zisserman A and Freeman W T Discovering Object Categories in Image Collections MIT AI Lab Memo AIM-2005-005 February 2005

LDA ndash LATENT DIRICHILET ALLOCATION

Problems in pLSA

bull pLSA provides no probabilistic model at the document level Each doc has its own topic mixture proportion

bull The number of parameters in the model grows linearly with M (the number of documents in the training set)

Problems in pLSA

bull There is no constraint for distributions p(z|di)

bull Easy to lead to serious problems with over-fitting

d1

z1

w1

z2

w2

zN1

wN1

hellip

d2

z1

w1

z2

w2

zN2

wN2

hellip

dm

z1

w1

z2

w2

zNm

wNm

hellip

p(z|d1) p(z|d2) p(z|dm)

Dirichlet Distribution

bull In the LDA model the topic mixture proportions for each document are assumed to follow some distribution

bull Requirement for such a distributionndash The samples (mixture proportions) generated from it are K-tuples of non-

negative numbers that sum to one That is the samples are multinormialsndash Easy to optimize

bull Dirichlet Distribution is one of such distributions

bull The space of all of these multinomials has a nice geometric interpretation as a (k-1)-simplex

Dirichlet Distribution

bull Definition

bull The density is zero outside this open (K minus 1)-dimensional simplex

k

i ii

ki ik

i i

ki i

kk

xx

xxxxp i

1

11

1

12121

1 0 st

)Γ()(Γ)|(

bull Various parameter α

(6 2 2) (3 7 5)

(2 3 4) (6 2 6)

Example Dirichlet Distributions (K=3)

Example Dirichlet Distributions (K=3)

bull Equal αi different

α0=01 α0=1 α0=10

k

i i10

The LDA Model

z4z3z2z1

w4w3w2w1

b

z4z3z2z1

w4w3w2w1

z4z3z2z1

w4w3w2w1

bull For each documentbull Choose ~Dirichlet()bull For each of the N words wn

ndash Choose a topic znraquo Multinomial()

ndash Choose a word wn from p(wn|znb) a multinomial probability conditioned on the topic zn

The LDA Model

bull For each documentbull Choose ~Dirichlet()bull For each of the N words wn

ndash Choose a topic znraquo Multinomial()

ndash Choose a word wn from p(wn|znb) a multinomial probability conditioned on the topic zn

Joint Probability

bull Given parameter α and β

where

Likelihood

bull Joint Probability

bull Marginal distribution of a document

bull Likelihood over all the documents

Inference

bull The likelihood can be computed by summing each documentbull Jansenrsquos inequality in EM

Inference

bull In E-Step we need to compute the posterior distribution of the hidden variables

bull Unfortunately this distribution is intractable to compute in general

bull We have to resort to variational approach

Variational Inference

bull In variational inference we consider a simplified graphical model with variational parameters and minimize the KL Divergence between the variational and posterior distributions

Variantional Inference

bull The difference between the lower bound and the likelihood is the KL divergence

bull Maximizing the lower bound L(b) with respect to and is equivalent to minimizing the KL divergence

VBEM vs EM

bull Only different in the E-Step

bull In standard EM q(X) is directly set to p(X|Dθ) and let KL=0bull In VBEM it is intractable to compute p(X|Dθ) Instead it

approximates p(X|Dθ) by a variational distribution q(X) by minimizing KL(q(X) | P(X|D θ)

bull This is also equivalent to maximizing the lower bound L(θ)

Parameter Estimation

bull Given a corpus of documents we would like to find the parameters and b which maximize the likelihood of the observed data

bull Strategy (Variational EM)

bull Lower bound log p(w|b) by a function L(b)bull Repeat until convergence

ndash E Maximize L(b) with respect to the variational parameters ndash M Maximize the bound with respect to parameters and b

Parameter Estimation

bull E-Step Variational Inference ndash repeat until convergence

bull M-Step Parameter estimation

β

α can be implemented using the Newton-Raphson method

Topic Examples in a 100-topic LDA Model)bull 16000 documents from a subset of the TREC AP corpus

Classification (50-topic LDA + SVM)

bull Reuters-21578 dataset ndash contains 8000 documents and 15818 words

(a) EARN vs NOT EARN (b) GRAIN vs NOT GRAIN

Problems in LDA

bull Dirichlet Distribution is helpful to avoid over-fitting But the assumption might be too strong

z4z3z2z1

w4w3w2w1

b

z4z3z2z1

w4w3w2w1

z4z3z2z1

w4w3w2w1

A Bayesian Hierarchical Model for Learning Natural Scene Categories

bull Incorporating category information

MNd

π

z

x

θ

β

Codebookbull 174 Local Image Patches

bull DetectionEvenly Sampled GridRandom SamplingSaliency DetectorLowersquos DoG Detector

bull RepresentationNormalized 11x11 gray values128-dim SIFT

Topic Distribution in Different Categories

Topic Hierarchical Clustering

More Topic Models

bull Dynamic topic models ICML 2006bull Correlated Topic Model NIPS 2005bull Hierarchical Dirichlet Process Journal of the American

Statistical Association 2003bull Nonparametric Bayes pachinko allocation UAI 2007bull Supervised LDA NIPS 2007bull MedLDA ndash Maximum Margin Discrimant LDA ICML 2009bull hellip

Are you really into Graphical Models

bull Describing Visual Scenes using Transformed Dirichlet Processes E Sudderth A Torralba W Freeman and A Willsky NIPS Dec 2005

Reference

bull David M Blei Andrew Y Ng Michael I Jordan Latent Dirichlet Allocation Journal of Machine Learning Research (JMLR) 2003

bull Matthew J Beal Variational Algorithms for Approximate Bayesian Inference PhD Thesis University of Cambridge 1998

bull L Fei-Fei and P Perona A Bayesian Hierarchical Model for Learning Natural Scene Categories CVPR 2005

Outline

bull Matrix Decompositionndash PCA SVD NMFndash LDA ICA Sparse Coding etc

bull Graphical Modelndash Basic concepts in probabilistic machine learningndash EMndash pLSAndash LDA

bull Two Applicationsndash Document decomposition for ldquolong queryrdquo retrieval ICCV 2009ndash Modeling Threaded Discussions SIGIR 2009

Large-Scale Indexing for ldquoLong Queryrdquo Retrieval (Similarity Search)

Xiao Zhang Zhiwei Li Lei Zhang Wei-Ying Ma and Heung-Yeung Shum

ICCV 2009

The Long Query Problem

bull If a query contains 1000 keywordsndash Need to access 1000 inverted listsndash The intersection of 1000 inverted lists may be emptyndash The union of 1000 inverted list may be the whole corpus

bull Dimension reductionTerm1 Term2 Term3 Term4 hellip TermN

Img1 1 2 0 0 hellip 2

f1 f2 hellip fM

Img1 02 01 hellip 003

Topic Projection

Dim = 1 million

Dim = 200

Key Idea Dimension Reduction + Residual Error Preservation

bull p original TF-IDF vector in vocabulary spacebull X projection matrix for dimension reductionbull w low dimensional feature vectorbull residual error

p Xw

An image = 0 01 02 03 04 05 06 07 08 09 1

0

2

4

6

8

10

12

14

16

+ a few words(10 words)

Orthogonal Decomposition

p Xw 1 11 1 1

2 21 2 1 2

1 1

1

k

k

W k W

W W Wk W

p x xp x x w

p wp x x

Base vector

Low dimensional representation

Residual

An image = 0 01 02 03 04 05 06 07 08 09 1

0

2

4

6

8

10

12

14

16

+ a few words(10 words)

T Tp q p q

p q

w w

X1 X2 X3 hellip Xk

A Probabilistic Implementation

x is a switch variable It controls a word generated from

bull a topic specific distribution

bull a document specific distribution

bull a background distribution

( | )p w d 1( 0 | ) ( | ) ( | )K

kp x d p w z k p z k d

( 1| ) ( | )p x d p w d ( 2 | ) ( | )p x d p w

C Chemudugunta etal Modeling General and Specific Aspects of Documents with a Probabilistic Topic Model NIPS 2006

Search (Online)

DS1 DS2 hellip

DS1 DS2 hellip

DS1 DS2 hellip

DS1 DS2 hellip

DS1 DS2 hellip

hellip

hellip

hellip

hellip

hellip

LSH Index

Doc 300 Doc 401 hellip

A query = 0 01 02 03 04 05 06 07 08 09 1

0

2

4

6

8

10

12

14

16

+ a few words

Re-rankingDoc 401 hellip

Doc 1

Doc 2

Doc 300

Doc 401

Doc N

Doc 300

Index 10M Images 46GBSearch Speed lt 100ms

Doc Meta

Search Example

Query Image

Search Example

Query Image

Simultaneously Modeling Semantics and Structure of Threaded Discussions A Sparse

Coding Approach and Its Applications

Chen LIN Jiang-Ming YANG Rui CAI Xin-jing WANG Wei WANG Lei ZHANG

SIGIR 2009

123

Semantic amp structure

124

SemanticTopics

StructureWho reply to who

Optimize them together

Model semantic

Model structure

125

Reply reconstruction

126

DocumentSimilarity

TopicSimilarity

StructureSimilarity

BaselinesNP

Reply to Nearest PostRR

Reply to RootDS

Document SimilarityLDA

Latent Dirichlet Allocation Project documents to topic space

SWB Special Words Topic Model with Background distribution Project documents to topic and junk topic space

127

Evaluation

method Slashdot Apple

All Posts Good Posts All Posts Good Posts

NP 0021 0012 0289 0239

RR 0183 0319 0269 0474

DS 0463 0643 0409 0628

LDA 0465 0644 0410 0648

SWB 0463 0644 0410 0641

SMSS 0524 0737 0517 0772

128

Expert finding

Reply reconstruction

Network construction

Expert finding

Methods

HITS

PageRank

hellip

129

Baselines

LMFormal Models for Expert Finding in Enterprise Corpora SIGIR

06Achieves stable performance in expert finding task using a

language modelPageRank

Benchmark nodal ranking methodHITS

Find hub nodes and authority nodeEABIF

Personalized Recommendation Driven by Information Flow SIGIR rsquo06

Find most influential node130

Evaluation

131

bull Bayesian estimate

Method MRR MAP P10

LM 0821 0698 0800

EABIF(ori) 0674 0362 0243

EABIF(rec) 0742 0318 0281

PageRank(ori) 0675 0377 0263

PageRank(rec) 0743 0321 0266

HITS(ori) 0906 0832 0900

HITS(rec) 0938 0822 0906

Summary

bull Matrix and probability are fundamental mathematics in information retrieval and computer visionndash Matrix decomposition ndash a good practice to learn matrixndash Graphical model ndash a good practice to learn probability

bull Graphical model is a good tool to analyze problems

bull The essence of decomposition is to discover a set of mid-level features to describe original documentsimages

bull It is more adaptable for various applications than matrix decomposition

  • An Introduction To Matrix Decomposition and Graphical Model
  • Outline
  • What Is Matrix Decomposition
  • Why We Need Matrix Decomposition
  • Why We Need Matrix Decomposition (2)
  • Principle Component Analysis
  • Definition ndash Eigenvalue amp Eigenvector
  • Definition ndash Principle Component Analysis
  • Principle Component Analysis (2)
  • Maximizing Variance
  • Optimization Problem
  • Property Data Decomposition
  • Face Recognition ndash Eigenface
  • Slide 14
  • Slide 15
  • PageRank ndash Power Iteration
  • Column-Stochastic amp Irreducible
  • Iterative PageRank Calculation
  • Convergence of the power iteration
  • Singular value decomposition
  • SVD - Definition
  • Singular Values And Singular Vectors
  • Matrix approximation
  • SVD and PCA
  • Example - LSI
  • SVD and PCA (2)
  • Latent Semantic Indexing (LSI)
  • Latent Semantic Indexing
  • Similarity Measures
  • Similarity Measures (2)
  • HITS (Hyperlink Induced Topic Search)
  • Power Iteration
  • HITS and SVD
  • HITS vs PageRank
  • NMF ndash Non-Negative Matrix Factorization
  • Definition
  • Motivation
  • Multiplicative Update Algorithm
  • Multiplicative Update Algorithm (2)
  • NMF vs PCA
  • Reference
  • Major Reference
  • Outline (2)
  • Not Included
  • Basic Concepts
  • What Is Machine Learning
  • Likelihood Function
  • Maximum Likelihood (ML)
  • IID ndash Independent Identically Distributed
  • Reference (2)
  • Expectation Maximization
  • Why We Need EM
  • More General
  • The Expectation Maximization (EM) Algorithm
  • Jensenrsquos Inequality
  • Lower Bounding the Log Likelihood
  • The E and M Steps of EM
  • The E Step
  • The M Step
  • EM Never Decreases the Likelihood
  • Reference (3)
  • Why Do We Need Graphical Model
  • Why Do We Need Graphical Models
  • Directed Acyclic Graphical Models (Bayesian Networks)
  • Directed Graphs for Statistical Models Plate Notation
  • pLSA ndash Probabilistic Latent Semantic Analysis
  • Latent Semantic Indexing (LSI) Review
  • pLSA ndash Probabilistic Latent Semantic Analysis (2)
  • pLSA
  • pLSA (2)
  • Joint Probability vs Likelihood
  • Document Decomposition
  • pLSA ndash Objective Function
  • EM Steps
  • Lower Bounding the Log Likelihood (2)
  • EM Steps (2)
  • Latent Subspace
  • pLSA vs LSA
  • pLSA vs LSA (2)
  • Applications
  • Text Mining
  • Scene Classification
  • Classification Result
  • Reference (4)
  • LDA ndash Latent Dirichilet Allocation
  • Problems in pLSA
  • Problems in pLSA (2)
  • Dirichlet Distribution
  • Dirichlet Distribution (2)
  • Example Dirichlet Distributions (K=3)
  • Example Dirichlet Distributions (K=3) (2)
  • The LDA Model
  • The LDA Model (2)
  • Joint Probability
  • Likelihood
  • Inference
  • Inference (2)
  • Variational Inference
  • Variantional Inference
  • VBEM vs EM
  • Parameter Estimation
  • Parameter Estimation (2)
  • Topic Examples in a 100-topic LDA Model)
  • Classification (50-topic LDA + SVM)
  • Problems in LDA
  • A Bayesian Hierarchical Model for Learning Natural Scene Catego
  • Codebook
  • Topic Distribution in Different Categories
  • Topic Hierarchical Clustering
  • More Topic Models
  • Are you really into Graphical Models
  • Reference (5)
  • Outline (3)
  • Slide 115
  • The Long Query Problem
  • Key Idea Dimension Reduction + Residual Error Preservation
  • Orthogonal Decomposition
  • A Probabilistic Implementation
  • Search (Online)
  • Search Example
  • Search Example (2)
  • Simultaneously Modeling Semantics and Structure of Threaded Dis
  • Semantic amp structure
  • Optimize them together
  • Reply reconstruction
  • Baselines
  • Evaluation
  • Expert finding
  • Baselines (2)
  • Evaluation (2)
  • Summary
Page 2: An Introduction To Matrix Decomposition and Graphical Model

Outline

bull Matrix Decompositionndash PCA SVD NMFndash LDA ICA Sparse Coding etc

bull Graphical Modelndash Basic concepts in probabilistic machine learningndash EMndash pLSAndash LDA

bull Two Applicationsndash Document decomposition for ldquolong queryrdquo retrievalndash Modeling Threaded Discussions

What Is Matrix Decomposition

bull We wish to decompose the matrix A by writing it as a product of two or more matrices

Antimesm = BntimeskCktimesm

bull Suppose A B C are column matricesndash Antimesm = (a1 a2 hellip am) each ai is a n-dim data samplendash Bntimesk = (b1 b2 hellip bk) each bj is a n-dim basis and space B consists of k

basesndash Cktimesm = (c1 c2 hellip cm) each ci is the k-dim coordinates of ai projected to

space B

Why We Need Matrix Decomposition

bull Given one data samplea1 = Bntimeskc1

(a11 a12 hellip a1n)T = (b1 b2 hellip bk) (c11 c12 hellip c1k)T

bull Another data sample a2 = Bntimeskc2

bull More data sample am = Bntimeskcm

bull Together (m data samples) (a1 a2 hellip am) = Bntimesk (c1 c2 hellip cm)

Antimesm = BntimeskCktimesm

Why We Need Matrix Decomposition

(a1 a2 hellip am) = Bntimesk (c1 c2 hellip cm) Antimesm = BntimeskCktimesm

bull We wish to find a set of new basis B to represent data samples A and A will become C in the new space

bull In general B captures the common features in A while C carries specific characteristics of the original samples

bull In PCA B is eigenvectorsbull In SVD B is right (column) eigenvectorsbull In LDA B is discriminant directionsbull In NMF B is local features

PRINCIPLE COMPONENT ANALYSIS

Definition ndash Eigenvalue amp Eigenvector

Given a m x m matrix C for any λ and w if

Then λ is called eigenvalue and w is called eigenvector

Definition ndash Principle Component Analysis

ndash Principle Component Analysis (PCA)ndash Karhunen-Loeve transformation (KL transformation)

bull Let A be a n times m data matrix in which the rows represent data samples

bull Each row is a data vector each column represents a variablebull A is centered the estimated mean is subtracted from each

column so each column has zero meanbull Covariance matrix C (m x m)

Principle Component Analysisbull C can be decomposed as follows C=UΛUT

bull Λ is a diagonal matrix diag(λ1 λ2hellipλn) each λi is an eigenvaluebull U is an orthogonal matrix each column is an eigenvector

UTU=I U-1=UT

Maximizing Variancebull The objective of the rotation transformation is to find the

maximal variancebull Projection of data along w is Awbull Variance σ2

w= (Aw)T(Aw) = wTATAw = wTCwwhere C = ATA is the covariance matrix of the data (A is centered)

bull Task maximize variance subject to constraint wTw=1

Optimization Problembull Maximize

λ is the Lagrange multiplierbull Differentiating with respect to w yields

bull Eigenvalue equation Cw = λw where C = ATA bull Once the first principal component is found we continue in the same

fashion to look for the next one which is orthogonal to (all) the principal component(s) already found

Property Data Decompositionbull PCA can be treated as data decomposition

a=UUTa=(u1u2hellipun) (u1u2hellipun)T a=(u1u2hellipun) (ltu1agtltu2agthellipltunagt)T

=(u1u2hellipun) (b1 b2 hellip bn)T

= Σ biui

Face Recognition ndash Eigenfacebull Turk MA Pentland AP Face recognition using eigenfaces

CVPR 1991 (Citation 2654)bull The eigenface approach

ndash images are points in a vector spacendash use PCA to reduce dimensionalityndash face spacendash compare projections onto face space to recognize faces

PageRank ndash Power Iteration

bull Column j has nonzero elements in positions corresponding to outlinks of j (Nj in total)

bull Row i has nonzero element in positions corresponding to inlinks Ii

Column-Stochastic amp Irreduciblebull Column-Stochastic

bull where

bull Irreducible

Iterative PageRank Calculationbull For k=12hellip

bull Equivalently (λ=1 A is a Markov chain transition matrix)

bull Why can we use power iteration to find the first eigenvector

Convergence of the power iteration

bull Expand the initial approximation r0 in terms of the eigenvectors

SINGULAR VALUE DECOMPOSITION

SVD - Definitionbull Any m x n matrix A with m ge n can be factorized

bull

Singular Values And Singular Vectors

bull The diagonal elements σj of are the singular values of the matrix A

bull The columns of U and V are the left singular vectors and right singular vectors respectively

bull Equivalent form of SVD

Matrix approximation

bull Theorem Let Uk = (u1 u2 hellip uk) Vk = (v1 v2 hellip vk) and Σk = diag(σ1 σ2 hellip σk) and define

bull Then

bull It means that the best approximation of rank k for the matrix A is

SVD and PCAbull We can writebull Remember that in PCA we treat A as a row matrixbull V is just eigenvectors for A

ndash Each column in V is an eigenvector of row matrix Andash we use V to approximate a row in A

bull Equivalently we can writebull U is just eigenvectors for AT

ndash Each column in U is an eigenvector of column matrix Andash We use U to approximate a column in A

Example - LSI bull Build a term-by-document matrix A

bull Compute the SVD of A A = UΣVT

bull Approximate A by

ndash Uk Orthogonal basis that we use to approximate all the documents

ndash Dk Column j hold the coordinates of document j in the new basisndash Dk is the projection of A onto the subspace spanned by Uk

SVD and PCAbull For symmetric A SVD is closely related to PCA

bull PCA A = UΛUT

ndash U and Λ are eigenvectors and eigenvaluesbull SVD A = UΛVT

ndash U is left(column) eigenvectorsndash V is right(row) eigenvectorsndash Λ is the same eigenvalues

bull For symmetric A column eigenvectors equal to row eigenvectors

bull Note the difference of A in PCA and SVDndash SVD A is directly the data eg term-by-document matrixndash PCA A is covariance matrix A=XTX each row in X is a sample

Latent Semantic Indexing (LSI)1 Document file preparation preprocessing

ndash Indexing collecting termsndash Use stop list eliminate rdquomeaninglessrdquo wordsndash Stemming

2 Construction term-by-document matrix sparse matrix storage

3 Query matching distance measures4 Data compression by low rank approximation SVD5 Ranking and relevance feedback

Latent Semantic Indexingbull Assumption there is some underlying latent semantic

structure in the data

bull Eg car and automobile occur in similar documents as do cows and sheep

bull This structure can be enhanced by projecting the data (the term-by-document matrix and the queries) onto a lower dimensional space using SVD

Similarity Measures

bull Term to term AAT = UΣ2UT = (UΣ)(UΣ)T

UΣ are the coordinates of A (rows) projected to space V

bull Document to documentATA = VΣ2VT = (VΣ)(VΣ)T

VΣ are the coordinates of A (columns) projected to space U

Similarity Measures

bull Term to document A = UΣVT = (UΣfrac12)(VΣfrac12)T

UΣfrac12 are the coordinates of A (rows) projected to space VVΣfrac12 are the coordinates of A (columns) projected to space U

HITS (Hyperlink Induced Topic Search)

bull Idea Web includes two flavors of prominent pagesndash authorities contain high-quality informationndash hubs are comprehensive lists of links to authorities

bull A page is a good authority if many hubs point to itbull A page is a good hub if it points to many authoritiesbull Good authorities are pointed to by good hubs and good hubs

point to good authorities

Hubs Authorities

Power Iteration

bull Each page i has both a hub score hi and an authority score aibull HITS successively refines these scores by computing

bull Define the adjacency matrix L of the directed web graph

bull Now

HITS and SVD

bull L rows are outlinks columns are inlinks

bull a will be the dominant eigenvector of the authority matrix LTL bull h will be the dominant eigenvector of the hub matrix LLT

bull They are in fact the first left and right singular vectors of Lbull We are in fact running SVD on the adjacency matrix

HITS vs PageRankbull PageRank may be computed once HITS is computed per

query

bull HITS takes query into account PageRank doesnrsquot

bull PageRank has no concept of hubs

bull HITS is sensitive to local topology insertion or deletion of a small number of nodes may change the scores a lot

bull PageRank more stable because of its random jump step

NMF ndash NON-NEGATIVE MATRIX FACTORIZATION

Definition

bull Given a nonnegative matrix Vntimesm find non-negative matrix factors Wntimesk and Hktimesm such that

VntimesmasympWntimeskHktimesm

bull V column matrix each column is a data sample (n-dimension)bull Wi k-basis represents one basebull H coordinates of V projected to W

vj asymp Wntimeskhj

Motivationbull Non-negativity is natural in many applications

bull Probability is also non-negative

bull Additive model to capture local structure

Multiplicative Update Algorithmbull Cost function Euclidean distance

bull Multiplicative Update

Multiplicative Update Algorithmbull Cost function Divergence

ndash Reduce to Kullback-Leibler divergence whenndash A and B can be regarded as normalized probability distributions

bull Multiplicative update

bull PLSA is NMF with KL divergence

NMF vs PCA

bull n = 2429 faces m = 19x19 pixelsbull Positive values are illustrated with black pixels and negative values

with red pixels

bull NMF Parts-based representationbull PCA Holistic representations

Referencebull D D Lee and H S Seung Algorithms for non-negative matrix

factorization (pdf) NIPS 2001bull D D Lee and H S Seung Learning the parts of objects by non-negative

matrix factorization (pdf) Nature 401 788-791 (1999)

Major Referencebull Saara Hyvoumlnen Linear Algebra Methods for Data Mining Spring 2007

University of Helsinki (Highly recommend)

Outlinebull Basic concepts

ndash Likelihood iidndash ML MAP and Bayesian Inferencendash Expectation-Maximizationndash Mixture Gaussian Parameter estimation

bull pLSAndash Motivationndash Derivation amp Geometry propertiesndash Applications

bull LDAndash Motivation - Why to add a hyper parameterndash Dirichlet Distributionndash Variational EMndash Relations with other topic modalsndash Incorporating category information

bull Summary

Not Included

bull General graphical model theoriesbull Markov random field (belief propagation)bull Detailed derivation of LDA

BASIC CONCEPTS

What Is Machine Learning

Databull Let x = (x1 x2 xD) T denote a data point and D = x(1) x(2) x(N) a

data set D is sometimes associated with desired outputs y1 y2

Predictionsbull We are generally interested in predicting something based on the observed

data setbull Given D what can we say about x(N+1)

Modelbull To make predictions we need to make some assumptions We can often

express these assumptions in the form of a model with some parameters θbull Given data D we learn the model parameters from which we can predict

new data pointsbull The model can often be expressed as a probability distribution over data

points

Likelihood Function

bull Given a set of parameter values probability density function (PDF) will show that some data are more probable than other data

bull Inversely given the observed data and a model of interest Likelihood function is defined as

L(θ) = fθ(x|θ) = p(x|θ)

bull That is likelihood function L(θ) will show that some parameters are more likely to have produced the data

Maximum Likelihood (ML)

bull Maximum likelihood will find the best model parameters that make the data ldquomost likelyrdquo generated from this model

bull Suppose we are given n data samples (x1 x2 hellip xn)

bull Maximum likelihood will find θ that maximize L(θ)

bull Predictive distribution

IID ndash Independent Identically Distributed

bull IID means

bull The problem is considerably simplified as

bull Usually log likehood is used

Reference

bull Zoubin Ghahramani Machine Learning (4F13) 2006 Cambridge(Introduction to Machine Learning Lectures 1-2 Slides)

bull Gregor Heinrich Parameter estimation for text analysis Technical Note 2005-2008

EXPECTATION MAXIMIZATION

Why We Need EM

bull The ExpectationndashMaximization (EM) algorithm is a method for ML learning of parameters in latent variable models

bull Why we need latent variables

bull To describe complex model Gaussian Mixture Model

bull To discover the intrinsic structure inside a data set Topic Models such as pLSA LDA

More General

bull Data set bull Likelihood

bull Goal learn maximum likelihood (ML) parameter values

bull The maximum likelihood procedure finds parameters θ such that

bull Because of the integral (or sum) over latent variables the likelihood can be a very complicated and hard to optimize function

The Expectation Maximization (EM) Algorithm

bull The EM algorithm finds a (local) maximum of a latent variable model likelihood It starts from arbitrary values of the parameters and iterates two steps

bull E step Fill in values of latent variables according to posterior given data

bull M step Maximize likelihood as if latent variables were not hidden

bull Decomposes difficult problems into series of tractable steps

Jensenrsquos Inequality

Lower Bounding the Log Likelihoodbull Observed data D = yn Latent variables X = xn Parameters θbull Goal Maximize the log likelihood (ie ML learning) wrt θ

bull Any distribution q(X) over the hidden variables can be used to obtain a lower bound on the log likelihood using Jensenrsquos inequality

bull where H[q] is the entropy of q(X)

The E and M Steps of EM

bull The lower bound on the log likelihood is given by

bull EM alternates betweenbull E step optimize wrt distribution over hidden variables

holding params fixed

bull M step maximize wrt parameters holding hidden distribution fixed

The E Step

bull E step for fixed θ

bull The second term is the Kullback-Leibler divergencebull This means that for fixed θ F is bounded above by L and achieves

that bound whenbull So the E step simply sets

The M Step

bull M step maximize wrt parameters holding hidden distribution q fixed

bull The second equality comes from fact that entropy of q(X) does not depend directly on θ

bull The specific form of the M step depends on the model Often the maximum wrt θ can be found analytically

EM Never Decreases the Likelihood

bull The E and M steps together never decrease the log likelihood

bull The E step brings F(q θ) to the likelihood L(θ) 顶bull The M-step maximizes F(q θ) wrt θ 抬bull F(q θ) lt L(θ) by Jensen ndash or equivalently from the non-

negativity of KL

Reference

bull Zoubin Ghahramani Machine Learning (4F13) 2006 Cambridge(Unsupervised learning Lecture 5 Slides)

bull Christopher M Bishop (2006) Pattern Recognition and Machine Learning Springer

WHY DO WE NEED GRAPHICAL MODEL

Why Do We Need Graphical Models

bull Consndash Graphical model is so complex even with a few circleshellipndash We have to make too many assumptions

bull Prosndash We do need probability to explain our world But joint probability is

hard to computendash Graphical model can help us analyze and understand our problemsndash Graphs are an intuitive way of representing and visualizing the

relationships between many variablesndash With a graphical model we can decouple joint probability to

conditional probabilities which are usually easier

Directed Acyclic Graphical Models (Bayesian Networks)

bull A DAG Model Bayesian network corresponds to a factorization of the joint probability distribution

p(ABCDE) = p(A)p(B)p(C|AB)p(D|BC)p(E|CD)

bull In general

bull where pa(i) are the parents of node i

Directed Graphs for Statistical ModelsPlate Notation

bull A data set of N points generated from a Gaussian

PLSA ndash PROBABILISTIC LATENT SEMANTIC ANALYSIS

Latent Semantic Indexing (LSI) Review

bull For natural Language Queries simple term matching does not work effectivelyndash Ambiguous terms ndash Same Queries vary due to personal styles

bull Latent semantic indexingndash Creates this lsquolatent semantic spacersquo (hidden meaning)

bull LSI puts documents together even if they donrsquot have common words if the docs share frequently co-occurring terms

bull Disadvantagesndash Statistical foundation is missing

pLSA ndash Probabilistic Latent Semantic Analysis

bull Automated Document Indexing and Information retrievalbull Identification of Latent Classes using an Expectation

Maximization (EM) Algorithmbull Shown to solve

ndash Polysemybull Java could mean ldquocoffeerdquo and also the ldquoPL Javardquobull Cricket is a ldquogamerdquo and also an ldquoinsectrdquo

ndash Synonymybull ldquocomputerrdquo ldquopcrdquo ldquodesktoprdquo all could mean the same

bull Has a better statistical foundation than LSA

pLSA

M

Nd

d

z

w

M

d

z1

w1

z2

w2

z3

w3

zN

wN

hellip

z1 hellip zN are variables ziє[1K]K is the number of latent topics

pLSA

d1

z1

w1

z2

w2

zN1

wN1

hellip

d2

z1

w1

z2

w2

zN2

wN2

hellip

dM

z1

w1

z2

w2

zNm

wNm

hellip

p(w|z=1) p(w|z=2) p(w | z=Nm) are shared for all documents

Likelihood

Joint Probability vs Likelihood

bull Joint probability

bull Likelihood (only for observed variables)

bull p(d) is assumed to be uniform

Document Decomposition

bull Each document can be decomposed as

bull This is similar to the matrix decomposition if we consider each discrete distribution as a vector

p(w|d) = ZVtimesk p(z|d)

bull With many documents we hope to find latent topics as common basis

pLSA ndash Objective Function

bull pLSA tries to maximize the log likelihood

bull Due to the summation over z inside log we have to resort to EM

EM Steps

bull E-Stepndash Expectation of the likelihood function is calculated with the current

parameter valuesbull M-Step

ndash Update the parameters with the calculated posterior probabilitiesndash Find the parameters that maximizes the likelihood function

Lower Bounding the Log Likelihood

EM Steps

bull The E-Step

bull The M-Step

Latent Subspace

pLSA vs LSAbull LSA and PLSA perform dimensionality reduction

ndash In LSA by keeping only K singular valuesndash In PLSA by having K aspects

bull Comparison to SVDndash U Matrix related to P(z|d) (doc to aspect)ndash V Matrix related to P(w|z) (aspect to term)ndash E Matrix related to P(z) (aspect strength)

pLSA vs LSAbull The main difference is the way the approximation is done

bull PLSA generates a model (aspect model) and maximizes its predictive power

bull Selecting the proper value of K is heuristic in LSA

bull Model selection in statistics can determine optimal K in PLSA

Applications

bull Text mining topic discovering

bull Scene Classification

Text Mining

Scene Classification

Classification Result

Reference

bull Thomas Hofmann Probabilistic latent semantic analysis In Proc of Uncertainty in Artificial Intelligence UAI99 Stockholm 1999

bull Bosch A Zisserman A and Munoz X Scene Classification via pLSAProceedings of the European Conference on Computer Vision (2006)

bull Sivic J Russell B C Efros A A Zisserman A and Freeman W T Discovering Object Categories in Image Collections MIT AI Lab Memo AIM-2005-005 February 2005

LDA ndash LATENT DIRICHILET ALLOCATION

Problems in pLSA

bull pLSA provides no probabilistic model at the document level Each doc has its own topic mixture proportion

bull The number of parameters in the model grows linearly with M (the number of documents in the training set)

Problems in pLSA

bull There is no constraint for distributions p(z|di)

bull Easy to lead to serious problems with over-fitting

d1

z1

w1

z2

w2

zN1

wN1

hellip

d2

z1

w1

z2

w2

zN2

wN2

hellip

dm

z1

w1

z2

w2

zNm

wNm

hellip

p(z|d1) p(z|d2) p(z|dm)

Dirichlet Distribution

bull In the LDA model the topic mixture proportions for each document are assumed to follow some distribution

bull Requirement for such a distributionndash The samples (mixture proportions) generated from it are K-tuples of non-

negative numbers that sum to one That is the samples are multinormialsndash Easy to optimize

bull Dirichlet Distribution is one of such distributions

bull The space of all of these multinomials has a nice geometric interpretation as a (k-1)-simplex

Dirichlet Distribution

bull Definition

bull The density is zero outside this open (K minus 1)-dimensional simplex

k

i ii

ki ik

i i

ki i

kk

xx

xxxxp i

1

11

1

12121

1 0 st

)Γ()(Γ)|(

bull Various parameter α

(6 2 2) (3 7 5)

(2 3 4) (6 2 6)

Example Dirichlet Distributions (K=3)

Example Dirichlet Distributions (K=3)

bull Equal αi different

α0=01 α0=1 α0=10

k

i i10

The LDA Model

z4z3z2z1

w4w3w2w1

b

z4z3z2z1

w4w3w2w1

z4z3z2z1

w4w3w2w1

bull For each documentbull Choose ~Dirichlet()bull For each of the N words wn

ndash Choose a topic znraquo Multinomial()

ndash Choose a word wn from p(wn|znb) a multinomial probability conditioned on the topic zn

The LDA Model

bull For each documentbull Choose ~Dirichlet()bull For each of the N words wn

ndash Choose a topic znraquo Multinomial()

ndash Choose a word wn from p(wn|znb) a multinomial probability conditioned on the topic zn

Joint Probability

bull Given parameter α and β

where

Likelihood

bull Joint Probability

bull Marginal distribution of a document

bull Likelihood over all the documents

Inference

bull The likelihood can be computed by summing each documentbull Jansenrsquos inequality in EM

Inference

bull In E-Step we need to compute the posterior distribution of the hidden variables

bull Unfortunately this distribution is intractable to compute in general

bull We have to resort to variational approach

Variational Inference

bull In variational inference we consider a simplified graphical model with variational parameters and minimize the KL Divergence between the variational and posterior distributions

Variantional Inference

bull The difference between the lower bound and the likelihood is the KL divergence

bull Maximizing the lower bound L(b) with respect to and is equivalent to minimizing the KL divergence

VBEM vs EM

bull Only different in the E-Step

bull In standard EM q(X) is directly set to p(X|Dθ) and let KL=0bull In VBEM it is intractable to compute p(X|Dθ) Instead it

approximates p(X|Dθ) by a variational distribution q(X) by minimizing KL(q(X) | P(X|D θ)

bull This is also equivalent to maximizing the lower bound L(θ)

Parameter Estimation

bull Given a corpus of documents we would like to find the parameters and b which maximize the likelihood of the observed data

bull Strategy (Variational EM)

bull Lower bound log p(w|b) by a function L(b)bull Repeat until convergence

ndash E Maximize L(b) with respect to the variational parameters ndash M Maximize the bound with respect to parameters and b

Parameter Estimation

bull E-Step Variational Inference ndash repeat until convergence

bull M-Step Parameter estimation

β

α can be implemented using the Newton-Raphson method

Topic Examples in a 100-topic LDA Model)bull 16000 documents from a subset of the TREC AP corpus

Classification (50-topic LDA + SVM)

bull Reuters-21578 dataset ndash contains 8000 documents and 15818 words

(a) EARN vs NOT EARN (b) GRAIN vs NOT GRAIN

Problems in LDA

bull Dirichlet Distribution is helpful to avoid over-fitting But the assumption might be too strong

z4z3z2z1

w4w3w2w1

b

z4z3z2z1

w4w3w2w1

z4z3z2z1

w4w3w2w1

A Bayesian Hierarchical Model for Learning Natural Scene Categories

bull Incorporating category information

MNd

π

z

x

θ

β

Codebookbull 174 Local Image Patches

bull DetectionEvenly Sampled GridRandom SamplingSaliency DetectorLowersquos DoG Detector

bull RepresentationNormalized 11x11 gray values128-dim SIFT

Topic Distribution in Different Categories

Topic Hierarchical Clustering

More Topic Models

bull Dynamic topic models ICML 2006bull Correlated Topic Model NIPS 2005bull Hierarchical Dirichlet Process Journal of the American

Statistical Association 2003bull Nonparametric Bayes pachinko allocation UAI 2007bull Supervised LDA NIPS 2007bull MedLDA ndash Maximum Margin Discrimant LDA ICML 2009bull hellip

Are you really into Graphical Models

bull Describing Visual Scenes using Transformed Dirichlet Processes E Sudderth A Torralba W Freeman and A Willsky NIPS Dec 2005

Reference

bull David M Blei Andrew Y Ng Michael I Jordan Latent Dirichlet Allocation Journal of Machine Learning Research (JMLR) 2003

bull Matthew J Beal Variational Algorithms for Approximate Bayesian Inference PhD Thesis University of Cambridge 1998

bull L Fei-Fei and P Perona A Bayesian Hierarchical Model for Learning Natural Scene Categories CVPR 2005

Outline

bull Matrix Decompositionndash PCA SVD NMFndash LDA ICA Sparse Coding etc

bull Graphical Modelndash Basic concepts in probabilistic machine learningndash EMndash pLSAndash LDA

bull Two Applicationsndash Document decomposition for ldquolong queryrdquo retrieval ICCV 2009ndash Modeling Threaded Discussions SIGIR 2009

Large-Scale Indexing for ldquoLong Queryrdquo Retrieval (Similarity Search)

Xiao Zhang Zhiwei Li Lei Zhang Wei-Ying Ma and Heung-Yeung Shum

ICCV 2009

The Long Query Problem

bull If a query contains 1000 keywordsndash Need to access 1000 inverted listsndash The intersection of 1000 inverted lists may be emptyndash The union of 1000 inverted list may be the whole corpus

bull Dimension reductionTerm1 Term2 Term3 Term4 hellip TermN

Img1 1 2 0 0 hellip 2

f1 f2 hellip fM

Img1 02 01 hellip 003

Topic Projection

Dim = 1 million

Dim = 200

Key Idea Dimension Reduction + Residual Error Preservation

bull p original TF-IDF vector in vocabulary spacebull X projection matrix for dimension reductionbull w low dimensional feature vectorbull residual error

p Xw

An image = 0 01 02 03 04 05 06 07 08 09 1

0

2

4

6

8

10

12

14

16

+ a few words(10 words)

Orthogonal Decomposition

p Xw 1 11 1 1

2 21 2 1 2

1 1

1

k

k

W k W

W W Wk W

p x xp x x w

p wp x x

Base vector

Low dimensional representation

Residual

An image = 0 01 02 03 04 05 06 07 08 09 1

0

2

4

6

8

10

12

14

16

+ a few words(10 words)

T Tp q p q

p q

w w

X1 X2 X3 hellip Xk

A Probabilistic Implementation

x is a switch variable It controls a word generated from

bull a topic specific distribution

bull a document specific distribution

bull a background distribution

( | )p w d 1( 0 | ) ( | ) ( | )K

kp x d p w z k p z k d

( 1| ) ( | )p x d p w d ( 2 | ) ( | )p x d p w

C Chemudugunta etal Modeling General and Specific Aspects of Documents with a Probabilistic Topic Model NIPS 2006

Search (Online)

DS1 DS2 hellip

DS1 DS2 hellip

DS1 DS2 hellip

DS1 DS2 hellip

DS1 DS2 hellip

hellip

hellip

hellip

hellip

hellip

LSH Index

Doc 300 Doc 401 hellip

A query = 0 01 02 03 04 05 06 07 08 09 1

0

2

4

6

8

10

12

14

16

+ a few words

Re-rankingDoc 401 hellip

Doc 1

Doc 2

Doc 300

Doc 401

Doc N

Doc 300

Index 10M Images 46GBSearch Speed lt 100ms

Doc Meta

Search Example

Query Image

Search Example

Query Image

Simultaneously Modeling Semantics and Structure of Threaded Discussions A Sparse

Coding Approach and Its Applications

Chen LIN Jiang-Ming YANG Rui CAI Xin-jing WANG Wei WANG Lei ZHANG

SIGIR 2009

123

Semantic amp structure

124

SemanticTopics

StructureWho reply to who

Optimize them together

Model semantic

Model structure

125

Reply reconstruction

126

DocumentSimilarity

TopicSimilarity

StructureSimilarity

BaselinesNP

Reply to Nearest PostRR

Reply to RootDS

Document SimilarityLDA

Latent Dirichlet Allocation Project documents to topic space

SWB Special Words Topic Model with Background distribution Project documents to topic and junk topic space

127

Evaluation

method Slashdot Apple

All Posts Good Posts All Posts Good Posts

NP 0021 0012 0289 0239

RR 0183 0319 0269 0474

DS 0463 0643 0409 0628

LDA 0465 0644 0410 0648

SWB 0463 0644 0410 0641

SMSS 0524 0737 0517 0772

128

Expert finding

Reply reconstruction

Network construction

Expert finding

Methods

HITS

PageRank

hellip

129

Baselines

LMFormal Models for Expert Finding in Enterprise Corpora SIGIR

06Achieves stable performance in expert finding task using a

language modelPageRank

Benchmark nodal ranking methodHITS

Find hub nodes and authority nodeEABIF

Personalized Recommendation Driven by Information Flow SIGIR rsquo06

Find most influential node130

Evaluation

131

bull Bayesian estimate

Method MRR MAP P10

LM 0821 0698 0800

EABIF(ori) 0674 0362 0243

EABIF(rec) 0742 0318 0281

PageRank(ori) 0675 0377 0263

PageRank(rec) 0743 0321 0266

HITS(ori) 0906 0832 0900

HITS(rec) 0938 0822 0906

Summary

bull Matrix and probability are fundamental mathematics in information retrieval and computer visionndash Matrix decomposition ndash a good practice to learn matrixndash Graphical model ndash a good practice to learn probability

bull Graphical model is a good tool to analyze problems

bull The essence of decomposition is to discover a set of mid-level features to describe original documentsimages

bull It is more adaptable for various applications than matrix decomposition

  • An Introduction To Matrix Decomposition and Graphical Model
  • Outline
  • What Is Matrix Decomposition
  • Why We Need Matrix Decomposition
  • Why We Need Matrix Decomposition (2)
  • Principle Component Analysis
  • Definition ndash Eigenvalue amp Eigenvector
  • Definition ndash Principle Component Analysis
  • Principle Component Analysis (2)
  • Maximizing Variance
  • Optimization Problem
  • Property Data Decomposition
  • Face Recognition ndash Eigenface
  • Slide 14
  • Slide 15
  • PageRank ndash Power Iteration
  • Column-Stochastic amp Irreducible
  • Iterative PageRank Calculation
  • Convergence of the power iteration
  • Singular value decomposition
  • SVD - Definition
  • Singular Values And Singular Vectors
  • Matrix approximation
  • SVD and PCA
  • Example - LSI
  • SVD and PCA (2)
  • Latent Semantic Indexing (LSI)
  • Latent Semantic Indexing
  • Similarity Measures
  • Similarity Measures (2)
  • HITS (Hyperlink Induced Topic Search)
  • Power Iteration
  • HITS and SVD
  • HITS vs PageRank
  • NMF ndash Non-Negative Matrix Factorization
  • Definition
  • Motivation
  • Multiplicative Update Algorithm
  • Multiplicative Update Algorithm (2)
  • NMF vs PCA
  • Reference
  • Major Reference
  • Outline (2)
  • Not Included
  • Basic Concepts
  • What Is Machine Learning
  • Likelihood Function
  • Maximum Likelihood (ML)
  • IID ndash Independent Identically Distributed
  • Reference (2)
  • Expectation Maximization
  • Why We Need EM
  • More General
  • The Expectation Maximization (EM) Algorithm
  • Jensenrsquos Inequality
  • Lower Bounding the Log Likelihood
  • The E and M Steps of EM
  • The E Step
  • The M Step
  • EM Never Decreases the Likelihood
  • Reference (3)
  • Why Do We Need Graphical Model
  • Why Do We Need Graphical Models
  • Directed Acyclic Graphical Models (Bayesian Networks)
  • Directed Graphs for Statistical Models Plate Notation
  • pLSA ndash Probabilistic Latent Semantic Analysis
  • Latent Semantic Indexing (LSI) Review
  • pLSA ndash Probabilistic Latent Semantic Analysis (2)
  • pLSA
  • pLSA (2)
  • Joint Probability vs Likelihood
  • Document Decomposition
  • pLSA ndash Objective Function
  • EM Steps
  • Lower Bounding the Log Likelihood (2)
  • EM Steps (2)
  • Latent Subspace
  • pLSA vs LSA
  • pLSA vs LSA (2)
  • Applications
  • Text Mining
  • Scene Classification
  • Classification Result
  • Reference (4)
  • LDA ndash Latent Dirichilet Allocation
  • Problems in pLSA
  • Problems in pLSA (2)
  • Dirichlet Distribution
  • Dirichlet Distribution (2)
  • Example Dirichlet Distributions (K=3)
  • Example Dirichlet Distributions (K=3) (2)
  • The LDA Model
  • The LDA Model (2)
  • Joint Probability
  • Likelihood
  • Inference
  • Inference (2)
  • Variational Inference
  • Variantional Inference
  • VBEM vs EM
  • Parameter Estimation
  • Parameter Estimation (2)
  • Topic Examples in a 100-topic LDA Model)
  • Classification (50-topic LDA + SVM)
  • Problems in LDA
  • A Bayesian Hierarchical Model for Learning Natural Scene Catego
  • Codebook
  • Topic Distribution in Different Categories
  • Topic Hierarchical Clustering
  • More Topic Models
  • Are you really into Graphical Models
  • Reference (5)
  • Outline (3)
  • Slide 115
  • The Long Query Problem
  • Key Idea Dimension Reduction + Residual Error Preservation
  • Orthogonal Decomposition
  • A Probabilistic Implementation
  • Search (Online)
  • Search Example
  • Search Example (2)
  • Simultaneously Modeling Semantics and Structure of Threaded Dis
  • Semantic amp structure
  • Optimize them together
  • Reply reconstruction
  • Baselines
  • Evaluation
  • Expert finding
  • Baselines (2)
  • Evaluation (2)
  • Summary
Page 3: An Introduction To Matrix Decomposition and Graphical Model

What Is Matrix Decomposition

bull We wish to decompose the matrix A by writing it as a product of two or more matrices

Antimesm = BntimeskCktimesm

bull Suppose A B C are column matricesndash Antimesm = (a1 a2 hellip am) each ai is a n-dim data samplendash Bntimesk = (b1 b2 hellip bk) each bj is a n-dim basis and space B consists of k

basesndash Cktimesm = (c1 c2 hellip cm) each ci is the k-dim coordinates of ai projected to

space B

Why We Need Matrix Decomposition

bull Given one data samplea1 = Bntimeskc1

(a11 a12 hellip a1n)T = (b1 b2 hellip bk) (c11 c12 hellip c1k)T

bull Another data sample a2 = Bntimeskc2

bull More data sample am = Bntimeskcm

bull Together (m data samples) (a1 a2 hellip am) = Bntimesk (c1 c2 hellip cm)

Antimesm = BntimeskCktimesm

Why We Need Matrix Decomposition

(a1 a2 hellip am) = Bntimesk (c1 c2 hellip cm) Antimesm = BntimeskCktimesm

bull We wish to find a set of new basis B to represent data samples A and A will become C in the new space

bull In general B captures the common features in A while C carries specific characteristics of the original samples

bull In PCA B is eigenvectorsbull In SVD B is right (column) eigenvectorsbull In LDA B is discriminant directionsbull In NMF B is local features

PRINCIPLE COMPONENT ANALYSIS

Definition ndash Eigenvalue amp Eigenvector

Given a m x m matrix C for any λ and w if

Then λ is called eigenvalue and w is called eigenvector

Definition ndash Principle Component Analysis

ndash Principle Component Analysis (PCA)ndash Karhunen-Loeve transformation (KL transformation)

bull Let A be a n times m data matrix in which the rows represent data samples

bull Each row is a data vector each column represents a variablebull A is centered the estimated mean is subtracted from each

column so each column has zero meanbull Covariance matrix C (m x m)

Principle Component Analysisbull C can be decomposed as follows C=UΛUT

bull Λ is a diagonal matrix diag(λ1 λ2hellipλn) each λi is an eigenvaluebull U is an orthogonal matrix each column is an eigenvector

UTU=I U-1=UT

Maximizing Variancebull The objective of the rotation transformation is to find the

maximal variancebull Projection of data along w is Awbull Variance σ2

w= (Aw)T(Aw) = wTATAw = wTCwwhere C = ATA is the covariance matrix of the data (A is centered)

bull Task maximize variance subject to constraint wTw=1

Optimization Problembull Maximize

λ is the Lagrange multiplierbull Differentiating with respect to w yields

bull Eigenvalue equation Cw = λw where C = ATA bull Once the first principal component is found we continue in the same

fashion to look for the next one which is orthogonal to (all) the principal component(s) already found

Property Data Decompositionbull PCA can be treated as data decomposition

a=UUTa=(u1u2hellipun) (u1u2hellipun)T a=(u1u2hellipun) (ltu1agtltu2agthellipltunagt)T

=(u1u2hellipun) (b1 b2 hellip bn)T

= Σ biui

Face Recognition ndash Eigenfacebull Turk MA Pentland AP Face recognition using eigenfaces

CVPR 1991 (Citation 2654)bull The eigenface approach

ndash images are points in a vector spacendash use PCA to reduce dimensionalityndash face spacendash compare projections onto face space to recognize faces

PageRank ndash Power Iteration

bull Column j has nonzero elements in positions corresponding to outlinks of j (Nj in total)

bull Row i has nonzero element in positions corresponding to inlinks Ii

Column-Stochastic amp Irreduciblebull Column-Stochastic

bull where

bull Irreducible

Iterative PageRank Calculationbull For k=12hellip

bull Equivalently (λ=1 A is a Markov chain transition matrix)

bull Why can we use power iteration to find the first eigenvector

Convergence of the power iteration

bull Expand the initial approximation r0 in terms of the eigenvectors

SINGULAR VALUE DECOMPOSITION

SVD - Definitionbull Any m x n matrix A with m ge n can be factorized

bull

Singular Values And Singular Vectors

bull The diagonal elements σj of are the singular values of the matrix A

bull The columns of U and V are the left singular vectors and right singular vectors respectively

bull Equivalent form of SVD

Matrix approximation

bull Theorem Let Uk = (u1 u2 hellip uk) Vk = (v1 v2 hellip vk) and Σk = diag(σ1 σ2 hellip σk) and define

bull Then

bull It means that the best approximation of rank k for the matrix A is

SVD and PCAbull We can writebull Remember that in PCA we treat A as a row matrixbull V is just eigenvectors for A

ndash Each column in V is an eigenvector of row matrix Andash we use V to approximate a row in A

bull Equivalently we can writebull U is just eigenvectors for AT

ndash Each column in U is an eigenvector of column matrix Andash We use U to approximate a column in A

Example - LSI bull Build a term-by-document matrix A

bull Compute the SVD of A A = UΣVT

bull Approximate A by

ndash Uk Orthogonal basis that we use to approximate all the documents

ndash Dk Column j hold the coordinates of document j in the new basisndash Dk is the projection of A onto the subspace spanned by Uk

SVD and PCAbull For symmetric A SVD is closely related to PCA

bull PCA A = UΛUT

ndash U and Λ are eigenvectors and eigenvaluesbull SVD A = UΛVT

ndash U is left(column) eigenvectorsndash V is right(row) eigenvectorsndash Λ is the same eigenvalues

bull For symmetric A column eigenvectors equal to row eigenvectors

bull Note the difference of A in PCA and SVDndash SVD A is directly the data eg term-by-document matrixndash PCA A is covariance matrix A=XTX each row in X is a sample

Latent Semantic Indexing (LSI)1 Document file preparation preprocessing

ndash Indexing collecting termsndash Use stop list eliminate rdquomeaninglessrdquo wordsndash Stemming

2 Construction term-by-document matrix sparse matrix storage

3 Query matching distance measures4 Data compression by low rank approximation SVD5 Ranking and relevance feedback

Latent Semantic Indexingbull Assumption there is some underlying latent semantic

structure in the data

bull Eg car and automobile occur in similar documents as do cows and sheep

bull This structure can be enhanced by projecting the data (the term-by-document matrix and the queries) onto a lower dimensional space using SVD

Similarity Measures

bull Term to term AAT = UΣ2UT = (UΣ)(UΣ)T

UΣ are the coordinates of A (rows) projected to space V

bull Document to documentATA = VΣ2VT = (VΣ)(VΣ)T

VΣ are the coordinates of A (columns) projected to space U

Similarity Measures

bull Term to document A = UΣVT = (UΣfrac12)(VΣfrac12)T

UΣfrac12 are the coordinates of A (rows) projected to space VVΣfrac12 are the coordinates of A (columns) projected to space U

HITS (Hyperlink Induced Topic Search)

bull Idea Web includes two flavors of prominent pagesndash authorities contain high-quality informationndash hubs are comprehensive lists of links to authorities

bull A page is a good authority if many hubs point to itbull A page is a good hub if it points to many authoritiesbull Good authorities are pointed to by good hubs and good hubs

point to good authorities

Hubs Authorities

Power Iteration

bull Each page i has both a hub score hi and an authority score aibull HITS successively refines these scores by computing

bull Define the adjacency matrix L of the directed web graph

bull Now

HITS and SVD

bull L rows are outlinks columns are inlinks

bull a will be the dominant eigenvector of the authority matrix LTL bull h will be the dominant eigenvector of the hub matrix LLT

bull They are in fact the first left and right singular vectors of Lbull We are in fact running SVD on the adjacency matrix

HITS vs PageRankbull PageRank may be computed once HITS is computed per

query

bull HITS takes query into account PageRank doesnrsquot

bull PageRank has no concept of hubs

bull HITS is sensitive to local topology insertion or deletion of a small number of nodes may change the scores a lot

bull PageRank more stable because of its random jump step

NMF ndash NON-NEGATIVE MATRIX FACTORIZATION

Definition

bull Given a nonnegative matrix Vntimesm find non-negative matrix factors Wntimesk and Hktimesm such that

VntimesmasympWntimeskHktimesm

bull V column matrix each column is a data sample (n-dimension)bull Wi k-basis represents one basebull H coordinates of V projected to W

vj asymp Wntimeskhj

Motivationbull Non-negativity is natural in many applications

bull Probability is also non-negative

bull Additive model to capture local structure

Multiplicative Update Algorithmbull Cost function Euclidean distance

bull Multiplicative Update

Multiplicative Update Algorithmbull Cost function Divergence

ndash Reduce to Kullback-Leibler divergence whenndash A and B can be regarded as normalized probability distributions

bull Multiplicative update

bull PLSA is NMF with KL divergence

NMF vs PCA

bull n = 2429 faces m = 19x19 pixelsbull Positive values are illustrated with black pixels and negative values

with red pixels

bull NMF Parts-based representationbull PCA Holistic representations

Referencebull D D Lee and H S Seung Algorithms for non-negative matrix

factorization (pdf) NIPS 2001bull D D Lee and H S Seung Learning the parts of objects by non-negative

matrix factorization (pdf) Nature 401 788-791 (1999)

Major Referencebull Saara Hyvoumlnen Linear Algebra Methods for Data Mining Spring 2007

University of Helsinki (Highly recommend)

Outlinebull Basic concepts

ndash Likelihood iidndash ML MAP and Bayesian Inferencendash Expectation-Maximizationndash Mixture Gaussian Parameter estimation

bull pLSAndash Motivationndash Derivation amp Geometry propertiesndash Applications

bull LDAndash Motivation - Why to add a hyper parameterndash Dirichlet Distributionndash Variational EMndash Relations with other topic modalsndash Incorporating category information

bull Summary

Not Included

bull General graphical model theoriesbull Markov random field (belief propagation)bull Detailed derivation of LDA

BASIC CONCEPTS

What Is Machine Learning

Databull Let x = (x1 x2 xD) T denote a data point and D = x(1) x(2) x(N) a

data set D is sometimes associated with desired outputs y1 y2

Predictionsbull We are generally interested in predicting something based on the observed

data setbull Given D what can we say about x(N+1)

Modelbull To make predictions we need to make some assumptions We can often

express these assumptions in the form of a model with some parameters θbull Given data D we learn the model parameters from which we can predict

new data pointsbull The model can often be expressed as a probability distribution over data

points

Likelihood Function

bull Given a set of parameter values probability density function (PDF) will show that some data are more probable than other data

bull Inversely given the observed data and a model of interest Likelihood function is defined as

L(θ) = fθ(x|θ) = p(x|θ)

bull That is likelihood function L(θ) will show that some parameters are more likely to have produced the data

Maximum Likelihood (ML)

bull Maximum likelihood will find the best model parameters that make the data ldquomost likelyrdquo generated from this model

bull Suppose we are given n data samples (x1 x2 hellip xn)

bull Maximum likelihood will find θ that maximize L(θ)

bull Predictive distribution

IID ndash Independent Identically Distributed

bull IID means

bull The problem is considerably simplified as

bull Usually log likehood is used

Reference

bull Zoubin Ghahramani Machine Learning (4F13) 2006 Cambridge(Introduction to Machine Learning Lectures 1-2 Slides)

bull Gregor Heinrich Parameter estimation for text analysis Technical Note 2005-2008

EXPECTATION MAXIMIZATION

Why We Need EM

bull The ExpectationndashMaximization (EM) algorithm is a method for ML learning of parameters in latent variable models

bull Why we need latent variables

bull To describe complex model Gaussian Mixture Model

bull To discover the intrinsic structure inside a data set Topic Models such as pLSA LDA

More General

bull Data set bull Likelihood

bull Goal learn maximum likelihood (ML) parameter values

bull The maximum likelihood procedure finds parameters θ such that

bull Because of the integral (or sum) over latent variables the likelihood can be a very complicated and hard to optimize function

The Expectation Maximization (EM) Algorithm

bull The EM algorithm finds a (local) maximum of a latent variable model likelihood It starts from arbitrary values of the parameters and iterates two steps

bull E step Fill in values of latent variables according to posterior given data

bull M step Maximize likelihood as if latent variables were not hidden

bull Decomposes difficult problems into series of tractable steps

Jensenrsquos Inequality

Lower Bounding the Log Likelihoodbull Observed data D = yn Latent variables X = xn Parameters θbull Goal Maximize the log likelihood (ie ML learning) wrt θ

bull Any distribution q(X) over the hidden variables can be used to obtain a lower bound on the log likelihood using Jensenrsquos inequality

bull where H[q] is the entropy of q(X)

The E and M Steps of EM

bull The lower bound on the log likelihood is given by

bull EM alternates betweenbull E step optimize wrt distribution over hidden variables

holding params fixed

bull M step maximize wrt parameters holding hidden distribution fixed

The E Step

bull E step for fixed θ

bull The second term is the Kullback-Leibler divergencebull This means that for fixed θ F is bounded above by L and achieves

that bound whenbull So the E step simply sets

The M Step

bull M step maximize wrt parameters holding hidden distribution q fixed

bull The second equality comes from fact that entropy of q(X) does not depend directly on θ

bull The specific form of the M step depends on the model Often the maximum wrt θ can be found analytically

EM Never Decreases the Likelihood

bull The E and M steps together never decrease the log likelihood

bull The E step brings F(q θ) to the likelihood L(θ) 顶bull The M-step maximizes F(q θ) wrt θ 抬bull F(q θ) lt L(θ) by Jensen ndash or equivalently from the non-

negativity of KL

Reference

bull Zoubin Ghahramani Machine Learning (4F13) 2006 Cambridge(Unsupervised learning Lecture 5 Slides)

bull Christopher M Bishop (2006) Pattern Recognition and Machine Learning Springer

WHY DO WE NEED GRAPHICAL MODEL

Why Do We Need Graphical Models

bull Consndash Graphical model is so complex even with a few circleshellipndash We have to make too many assumptions

bull Prosndash We do need probability to explain our world But joint probability is

hard to computendash Graphical model can help us analyze and understand our problemsndash Graphs are an intuitive way of representing and visualizing the

relationships between many variablesndash With a graphical model we can decouple joint probability to

conditional probabilities which are usually easier

Directed Acyclic Graphical Models (Bayesian Networks)

bull A DAG Model Bayesian network corresponds to a factorization of the joint probability distribution

p(ABCDE) = p(A)p(B)p(C|AB)p(D|BC)p(E|CD)

bull In general

bull where pa(i) are the parents of node i

Directed Graphs for Statistical ModelsPlate Notation

bull A data set of N points generated from a Gaussian

PLSA ndash PROBABILISTIC LATENT SEMANTIC ANALYSIS

Latent Semantic Indexing (LSI) Review

bull For natural Language Queries simple term matching does not work effectivelyndash Ambiguous terms ndash Same Queries vary due to personal styles

bull Latent semantic indexingndash Creates this lsquolatent semantic spacersquo (hidden meaning)

bull LSI puts documents together even if they donrsquot have common words if the docs share frequently co-occurring terms

bull Disadvantagesndash Statistical foundation is missing

pLSA ndash Probabilistic Latent Semantic Analysis

bull Automated Document Indexing and Information retrievalbull Identification of Latent Classes using an Expectation

Maximization (EM) Algorithmbull Shown to solve

ndash Polysemybull Java could mean ldquocoffeerdquo and also the ldquoPL Javardquobull Cricket is a ldquogamerdquo and also an ldquoinsectrdquo

ndash Synonymybull ldquocomputerrdquo ldquopcrdquo ldquodesktoprdquo all could mean the same

bull Has a better statistical foundation than LSA

pLSA

M

Nd

d

z

w

M

d

z1

w1

z2

w2

z3

w3

zN

wN

hellip

z1 hellip zN are variables ziє[1K]K is the number of latent topics

pLSA

d1

z1

w1

z2

w2

zN1

wN1

hellip

d2

z1

w1

z2

w2

zN2

wN2

hellip

dM

z1

w1

z2

w2

zNm

wNm

hellip

p(w|z=1) p(w|z=2) p(w | z=Nm) are shared for all documents

Likelihood

Joint Probability vs Likelihood

bull Joint probability

bull Likelihood (only for observed variables)

bull p(d) is assumed to be uniform

Document Decomposition

bull Each document can be decomposed as

bull This is similar to the matrix decomposition if we consider each discrete distribution as a vector

p(w|d) = ZVtimesk p(z|d)

bull With many documents we hope to find latent topics as common basis

pLSA ndash Objective Function

bull pLSA tries to maximize the log likelihood

bull Due to the summation over z inside log we have to resort to EM

EM Steps

bull E-Stepndash Expectation of the likelihood function is calculated with the current

parameter valuesbull M-Step

ndash Update the parameters with the calculated posterior probabilitiesndash Find the parameters that maximizes the likelihood function

Lower Bounding the Log Likelihood

EM Steps

bull The E-Step

bull The M-Step

Latent Subspace

pLSA vs LSAbull LSA and PLSA perform dimensionality reduction

ndash In LSA by keeping only K singular valuesndash In PLSA by having K aspects

bull Comparison to SVDndash U Matrix related to P(z|d) (doc to aspect)ndash V Matrix related to P(w|z) (aspect to term)ndash E Matrix related to P(z) (aspect strength)

pLSA vs LSAbull The main difference is the way the approximation is done

bull PLSA generates a model (aspect model) and maximizes its predictive power

bull Selecting the proper value of K is heuristic in LSA

bull Model selection in statistics can determine optimal K in PLSA

Applications

bull Text mining topic discovering

bull Scene Classification

Text Mining

Scene Classification

Classification Result

Reference

bull Thomas Hofmann Probabilistic latent semantic analysis In Proc of Uncertainty in Artificial Intelligence UAI99 Stockholm 1999

bull Bosch A Zisserman A and Munoz X Scene Classification via pLSAProceedings of the European Conference on Computer Vision (2006)

bull Sivic J Russell B C Efros A A Zisserman A and Freeman W T Discovering Object Categories in Image Collections MIT AI Lab Memo AIM-2005-005 February 2005

LDA ndash LATENT DIRICHILET ALLOCATION

Problems in pLSA

bull pLSA provides no probabilistic model at the document level Each doc has its own topic mixture proportion

bull The number of parameters in the model grows linearly with M (the number of documents in the training set)

Problems in pLSA

bull There is no constraint for distributions p(z|di)

bull Easy to lead to serious problems with over-fitting

d1

z1

w1

z2

w2

zN1

wN1

hellip

d2

z1

w1

z2

w2

zN2

wN2

hellip

dm

z1

w1

z2

w2

zNm

wNm

hellip

p(z|d1) p(z|d2) p(z|dm)

Dirichlet Distribution

bull In the LDA model the topic mixture proportions for each document are assumed to follow some distribution

bull Requirement for such a distributionndash The samples (mixture proportions) generated from it are K-tuples of non-

negative numbers that sum to one That is the samples are multinormialsndash Easy to optimize

bull Dirichlet Distribution is one of such distributions

bull The space of all of these multinomials has a nice geometric interpretation as a (k-1)-simplex

Dirichlet Distribution

bull Definition

bull The density is zero outside this open (K minus 1)-dimensional simplex

k

i ii

ki ik

i i

ki i

kk

xx

xxxxp i

1

11

1

12121

1 0 st

)Γ()(Γ)|(

bull Various parameter α

(6 2 2) (3 7 5)

(2 3 4) (6 2 6)

Example Dirichlet Distributions (K=3)

Example Dirichlet Distributions (K=3)

bull Equal αi different

α0=01 α0=1 α0=10

k

i i10

The LDA Model

z4z3z2z1

w4w3w2w1

b

z4z3z2z1

w4w3w2w1

z4z3z2z1

w4w3w2w1

bull For each documentbull Choose ~Dirichlet()bull For each of the N words wn

ndash Choose a topic znraquo Multinomial()

ndash Choose a word wn from p(wn|znb) a multinomial probability conditioned on the topic zn

The LDA Model

bull For each documentbull Choose ~Dirichlet()bull For each of the N words wn

ndash Choose a topic znraquo Multinomial()

ndash Choose a word wn from p(wn|znb) a multinomial probability conditioned on the topic zn

Joint Probability

bull Given parameter α and β

where

Likelihood

bull Joint Probability

bull Marginal distribution of a document

bull Likelihood over all the documents

Inference

bull The likelihood can be computed by summing each documentbull Jansenrsquos inequality in EM

Inference

bull In E-Step we need to compute the posterior distribution of the hidden variables

bull Unfortunately this distribution is intractable to compute in general

bull We have to resort to variational approach

Variational Inference

bull In variational inference we consider a simplified graphical model with variational parameters and minimize the KL Divergence between the variational and posterior distributions

Variantional Inference

bull The difference between the lower bound and the likelihood is the KL divergence

bull Maximizing the lower bound L(b) with respect to and is equivalent to minimizing the KL divergence

VBEM vs EM

bull Only different in the E-Step

bull In standard EM q(X) is directly set to p(X|Dθ) and let KL=0bull In VBEM it is intractable to compute p(X|Dθ) Instead it

approximates p(X|Dθ) by a variational distribution q(X) by minimizing KL(q(X) | P(X|D θ)

bull This is also equivalent to maximizing the lower bound L(θ)

Parameter Estimation

bull Given a corpus of documents we would like to find the parameters and b which maximize the likelihood of the observed data

bull Strategy (Variational EM)

bull Lower bound log p(w|b) by a function L(b)bull Repeat until convergence

ndash E Maximize L(b) with respect to the variational parameters ndash M Maximize the bound with respect to parameters and b

Parameter Estimation

bull E-Step Variational Inference ndash repeat until convergence

bull M-Step Parameter estimation

β

α can be implemented using the Newton-Raphson method

Topic Examples in a 100-topic LDA Model)bull 16000 documents from a subset of the TREC AP corpus

Classification (50-topic LDA + SVM)

bull Reuters-21578 dataset ndash contains 8000 documents and 15818 words

(a) EARN vs NOT EARN (b) GRAIN vs NOT GRAIN

Problems in LDA

bull Dirichlet Distribution is helpful to avoid over-fitting But the assumption might be too strong

z4z3z2z1

w4w3w2w1

b

z4z3z2z1

w4w3w2w1

z4z3z2z1

w4w3w2w1

A Bayesian Hierarchical Model for Learning Natural Scene Categories

bull Incorporating category information

MNd

π

z

x

θ

β

Codebookbull 174 Local Image Patches

bull DetectionEvenly Sampled GridRandom SamplingSaliency DetectorLowersquos DoG Detector

bull RepresentationNormalized 11x11 gray values128-dim SIFT

Topic Distribution in Different Categories

Topic Hierarchical Clustering

More Topic Models

bull Dynamic topic models ICML 2006bull Correlated Topic Model NIPS 2005bull Hierarchical Dirichlet Process Journal of the American

Statistical Association 2003bull Nonparametric Bayes pachinko allocation UAI 2007bull Supervised LDA NIPS 2007bull MedLDA ndash Maximum Margin Discrimant LDA ICML 2009bull hellip

Are you really into Graphical Models

bull Describing Visual Scenes using Transformed Dirichlet Processes E Sudderth A Torralba W Freeman and A Willsky NIPS Dec 2005

Reference

bull David M Blei Andrew Y Ng Michael I Jordan Latent Dirichlet Allocation Journal of Machine Learning Research (JMLR) 2003

bull Matthew J Beal Variational Algorithms for Approximate Bayesian Inference PhD Thesis University of Cambridge 1998

bull L Fei-Fei and P Perona A Bayesian Hierarchical Model for Learning Natural Scene Categories CVPR 2005

Outline

bull Matrix Decompositionndash PCA SVD NMFndash LDA ICA Sparse Coding etc

bull Graphical Modelndash Basic concepts in probabilistic machine learningndash EMndash pLSAndash LDA

bull Two Applicationsndash Document decomposition for ldquolong queryrdquo retrieval ICCV 2009ndash Modeling Threaded Discussions SIGIR 2009

Large-Scale Indexing for ldquoLong Queryrdquo Retrieval (Similarity Search)

Xiao Zhang Zhiwei Li Lei Zhang Wei-Ying Ma and Heung-Yeung Shum

ICCV 2009

The Long Query Problem

bull If a query contains 1000 keywordsndash Need to access 1000 inverted listsndash The intersection of 1000 inverted lists may be emptyndash The union of 1000 inverted list may be the whole corpus

bull Dimension reductionTerm1 Term2 Term3 Term4 hellip TermN

Img1 1 2 0 0 hellip 2

f1 f2 hellip fM

Img1 02 01 hellip 003

Topic Projection

Dim = 1 million

Dim = 200

Key Idea Dimension Reduction + Residual Error Preservation

bull p original TF-IDF vector in vocabulary spacebull X projection matrix for dimension reductionbull w low dimensional feature vectorbull residual error

p Xw

An image = 0 01 02 03 04 05 06 07 08 09 1

0

2

4

6

8

10

12

14

16

+ a few words(10 words)

Orthogonal Decomposition

p Xw 1 11 1 1

2 21 2 1 2

1 1

1

k

k

W k W

W W Wk W

p x xp x x w

p wp x x

Base vector

Low dimensional representation

Residual

An image = 0 01 02 03 04 05 06 07 08 09 1

0

2

4

6

8

10

12

14

16

+ a few words(10 words)

T Tp q p q

p q

w w

X1 X2 X3 hellip Xk

A Probabilistic Implementation

x is a switch variable It controls a word generated from

bull a topic specific distribution

bull a document specific distribution

bull a background distribution

( | )p w d 1( 0 | ) ( | ) ( | )K

kp x d p w z k p z k d

( 1| ) ( | )p x d p w d ( 2 | ) ( | )p x d p w

C Chemudugunta etal Modeling General and Specific Aspects of Documents with a Probabilistic Topic Model NIPS 2006

Search (Online)

DS1 DS2 hellip

DS1 DS2 hellip

DS1 DS2 hellip

DS1 DS2 hellip

DS1 DS2 hellip

hellip

hellip

hellip

hellip

hellip

LSH Index

Doc 300 Doc 401 hellip

A query = 0 01 02 03 04 05 06 07 08 09 1

0

2

4

6

8

10

12

14

16

+ a few words

Re-rankingDoc 401 hellip

Doc 1

Doc 2

Doc 300

Doc 401

Doc N

Doc 300

Index 10M Images 46GBSearch Speed lt 100ms

Doc Meta

Search Example

Query Image

Search Example

Query Image

Simultaneously Modeling Semantics and Structure of Threaded Discussions A Sparse

Coding Approach and Its Applications

Chen LIN Jiang-Ming YANG Rui CAI Xin-jing WANG Wei WANG Lei ZHANG

SIGIR 2009

123

Semantic amp structure

124

SemanticTopics

StructureWho reply to who

Optimize them together

Model semantic

Model structure

125

Reply reconstruction

126

DocumentSimilarity

TopicSimilarity

StructureSimilarity

BaselinesNP

Reply to Nearest PostRR

Reply to RootDS

Document SimilarityLDA

Latent Dirichlet Allocation Project documents to topic space

SWB Special Words Topic Model with Background distribution Project documents to topic and junk topic space

127

Evaluation

method Slashdot Apple

All Posts Good Posts All Posts Good Posts

NP 0021 0012 0289 0239

RR 0183 0319 0269 0474

DS 0463 0643 0409 0628

LDA 0465 0644 0410 0648

SWB 0463 0644 0410 0641

SMSS 0524 0737 0517 0772

128

Expert finding

Reply reconstruction

Network construction

Expert finding

Methods

HITS

PageRank

hellip

129

Baselines

LMFormal Models for Expert Finding in Enterprise Corpora SIGIR

06Achieves stable performance in expert finding task using a

language modelPageRank

Benchmark nodal ranking methodHITS

Find hub nodes and authority nodeEABIF

Personalized Recommendation Driven by Information Flow SIGIR rsquo06

Find most influential node130

Evaluation

131

bull Bayesian estimate

Method MRR MAP P10

LM 0821 0698 0800

EABIF(ori) 0674 0362 0243

EABIF(rec) 0742 0318 0281

PageRank(ori) 0675 0377 0263

PageRank(rec) 0743 0321 0266

HITS(ori) 0906 0832 0900

HITS(rec) 0938 0822 0906

Summary

bull Matrix and probability are fundamental mathematics in information retrieval and computer visionndash Matrix decomposition ndash a good practice to learn matrixndash Graphical model ndash a good practice to learn probability

bull Graphical model is a good tool to analyze problems

bull The essence of decomposition is to discover a set of mid-level features to describe original documentsimages

bull It is more adaptable for various applications than matrix decomposition

  • An Introduction To Matrix Decomposition and Graphical Model
  • Outline
  • What Is Matrix Decomposition
  • Why We Need Matrix Decomposition
  • Why We Need Matrix Decomposition (2)
  • Principle Component Analysis
  • Definition ndash Eigenvalue amp Eigenvector
  • Definition ndash Principle Component Analysis
  • Principle Component Analysis (2)
  • Maximizing Variance
  • Optimization Problem
  • Property Data Decomposition
  • Face Recognition ndash Eigenface
  • Slide 14
  • Slide 15
  • PageRank ndash Power Iteration
  • Column-Stochastic amp Irreducible
  • Iterative PageRank Calculation
  • Convergence of the power iteration
  • Singular value decomposition
  • SVD - Definition
  • Singular Values And Singular Vectors
  • Matrix approximation
  • SVD and PCA
  • Example - LSI
  • SVD and PCA (2)
  • Latent Semantic Indexing (LSI)
  • Latent Semantic Indexing
  • Similarity Measures
  • Similarity Measures (2)
  • HITS (Hyperlink Induced Topic Search)
  • Power Iteration
  • HITS and SVD
  • HITS vs PageRank
  • NMF ndash Non-Negative Matrix Factorization
  • Definition
  • Motivation
  • Multiplicative Update Algorithm
  • Multiplicative Update Algorithm (2)
  • NMF vs PCA
  • Reference
  • Major Reference
  • Outline (2)
  • Not Included
  • Basic Concepts
  • What Is Machine Learning
  • Likelihood Function
  • Maximum Likelihood (ML)
  • IID ndash Independent Identically Distributed
  • Reference (2)
  • Expectation Maximization
  • Why We Need EM
  • More General
  • The Expectation Maximization (EM) Algorithm
  • Jensenrsquos Inequality
  • Lower Bounding the Log Likelihood
  • The E and M Steps of EM
  • The E Step
  • The M Step
  • EM Never Decreases the Likelihood
  • Reference (3)
  • Why Do We Need Graphical Model
  • Why Do We Need Graphical Models
  • Directed Acyclic Graphical Models (Bayesian Networks)
  • Directed Graphs for Statistical Models Plate Notation
  • pLSA ndash Probabilistic Latent Semantic Analysis
  • Latent Semantic Indexing (LSI) Review
  • pLSA ndash Probabilistic Latent Semantic Analysis (2)
  • pLSA
  • pLSA (2)
  • Joint Probability vs Likelihood
  • Document Decomposition
  • pLSA ndash Objective Function
  • EM Steps
  • Lower Bounding the Log Likelihood (2)
  • EM Steps (2)
  • Latent Subspace
  • pLSA vs LSA
  • pLSA vs LSA (2)
  • Applications
  • Text Mining
  • Scene Classification
  • Classification Result
  • Reference (4)
  • LDA ndash Latent Dirichilet Allocation
  • Problems in pLSA
  • Problems in pLSA (2)
  • Dirichlet Distribution
  • Dirichlet Distribution (2)
  • Example Dirichlet Distributions (K=3)
  • Example Dirichlet Distributions (K=3) (2)
  • The LDA Model
  • The LDA Model (2)
  • Joint Probability
  • Likelihood
  • Inference
  • Inference (2)
  • Variational Inference
  • Variantional Inference
  • VBEM vs EM
  • Parameter Estimation
  • Parameter Estimation (2)
  • Topic Examples in a 100-topic LDA Model)
  • Classification (50-topic LDA + SVM)
  • Problems in LDA
  • A Bayesian Hierarchical Model for Learning Natural Scene Catego
  • Codebook
  • Topic Distribution in Different Categories
  • Topic Hierarchical Clustering
  • More Topic Models
  • Are you really into Graphical Models
  • Reference (5)
  • Outline (3)
  • Slide 115
  • The Long Query Problem
  • Key Idea Dimension Reduction + Residual Error Preservation
  • Orthogonal Decomposition
  • A Probabilistic Implementation
  • Search (Online)
  • Search Example
  • Search Example (2)
  • Simultaneously Modeling Semantics and Structure of Threaded Dis
  • Semantic amp structure
  • Optimize them together
  • Reply reconstruction
  • Baselines
  • Evaluation
  • Expert finding
  • Baselines (2)
  • Evaluation (2)
  • Summary
Page 4: An Introduction To Matrix Decomposition and Graphical Model

Why We Need Matrix Decomposition

bull Given one data samplea1 = Bntimeskc1

(a11 a12 hellip a1n)T = (b1 b2 hellip bk) (c11 c12 hellip c1k)T

bull Another data sample a2 = Bntimeskc2

bull More data sample am = Bntimeskcm

bull Together (m data samples) (a1 a2 hellip am) = Bntimesk (c1 c2 hellip cm)

Antimesm = BntimeskCktimesm

Why We Need Matrix Decomposition

(a1 a2 hellip am) = Bntimesk (c1 c2 hellip cm) Antimesm = BntimeskCktimesm

bull We wish to find a set of new basis B to represent data samples A and A will become C in the new space

bull In general B captures the common features in A while C carries specific characteristics of the original samples

bull In PCA B is eigenvectorsbull In SVD B is right (column) eigenvectorsbull In LDA B is discriminant directionsbull In NMF B is local features

PRINCIPLE COMPONENT ANALYSIS

Definition ndash Eigenvalue amp Eigenvector

Given a m x m matrix C for any λ and w if

Then λ is called eigenvalue and w is called eigenvector

Definition ndash Principle Component Analysis

ndash Principle Component Analysis (PCA)ndash Karhunen-Loeve transformation (KL transformation)

bull Let A be a n times m data matrix in which the rows represent data samples

bull Each row is a data vector each column represents a variablebull A is centered the estimated mean is subtracted from each

column so each column has zero meanbull Covariance matrix C (m x m)

Principle Component Analysisbull C can be decomposed as follows C=UΛUT

bull Λ is a diagonal matrix diag(λ1 λ2hellipλn) each λi is an eigenvaluebull U is an orthogonal matrix each column is an eigenvector

UTU=I U-1=UT

Maximizing Variancebull The objective of the rotation transformation is to find the

maximal variancebull Projection of data along w is Awbull Variance σ2

w= (Aw)T(Aw) = wTATAw = wTCwwhere C = ATA is the covariance matrix of the data (A is centered)

bull Task maximize variance subject to constraint wTw=1

Optimization Problembull Maximize

λ is the Lagrange multiplierbull Differentiating with respect to w yields

bull Eigenvalue equation Cw = λw where C = ATA bull Once the first principal component is found we continue in the same

fashion to look for the next one which is orthogonal to (all) the principal component(s) already found

Property Data Decompositionbull PCA can be treated as data decomposition

a=UUTa=(u1u2hellipun) (u1u2hellipun)T a=(u1u2hellipun) (ltu1agtltu2agthellipltunagt)T

=(u1u2hellipun) (b1 b2 hellip bn)T

= Σ biui

Face Recognition ndash Eigenfacebull Turk MA Pentland AP Face recognition using eigenfaces

CVPR 1991 (Citation 2654)bull The eigenface approach

ndash images are points in a vector spacendash use PCA to reduce dimensionalityndash face spacendash compare projections onto face space to recognize faces

PageRank ndash Power Iteration

bull Column j has nonzero elements in positions corresponding to outlinks of j (Nj in total)

bull Row i has nonzero element in positions corresponding to inlinks Ii

Column-Stochastic amp Irreduciblebull Column-Stochastic

bull where

bull Irreducible

Iterative PageRank Calculationbull For k=12hellip

bull Equivalently (λ=1 A is a Markov chain transition matrix)

bull Why can we use power iteration to find the first eigenvector

Convergence of the power iteration

bull Expand the initial approximation r0 in terms of the eigenvectors

SINGULAR VALUE DECOMPOSITION

SVD - Definitionbull Any m x n matrix A with m ge n can be factorized

bull

Singular Values And Singular Vectors

bull The diagonal elements σj of are the singular values of the matrix A

bull The columns of U and V are the left singular vectors and right singular vectors respectively

bull Equivalent form of SVD

Matrix approximation

bull Theorem Let Uk = (u1 u2 hellip uk) Vk = (v1 v2 hellip vk) and Σk = diag(σ1 σ2 hellip σk) and define

bull Then

bull It means that the best approximation of rank k for the matrix A is

SVD and PCAbull We can writebull Remember that in PCA we treat A as a row matrixbull V is just eigenvectors for A

ndash Each column in V is an eigenvector of row matrix Andash we use V to approximate a row in A

bull Equivalently we can writebull U is just eigenvectors for AT

ndash Each column in U is an eigenvector of column matrix Andash We use U to approximate a column in A

Example - LSI bull Build a term-by-document matrix A

bull Compute the SVD of A A = UΣVT

bull Approximate A by

ndash Uk Orthogonal basis that we use to approximate all the documents

ndash Dk Column j hold the coordinates of document j in the new basisndash Dk is the projection of A onto the subspace spanned by Uk

SVD and PCAbull For symmetric A SVD is closely related to PCA

bull PCA A = UΛUT

ndash U and Λ are eigenvectors and eigenvaluesbull SVD A = UΛVT

ndash U is left(column) eigenvectorsndash V is right(row) eigenvectorsndash Λ is the same eigenvalues

bull For symmetric A column eigenvectors equal to row eigenvectors

bull Note the difference of A in PCA and SVDndash SVD A is directly the data eg term-by-document matrixndash PCA A is covariance matrix A=XTX each row in X is a sample

Latent Semantic Indexing (LSI)1 Document file preparation preprocessing

ndash Indexing collecting termsndash Use stop list eliminate rdquomeaninglessrdquo wordsndash Stemming

2 Construction term-by-document matrix sparse matrix storage

3 Query matching distance measures4 Data compression by low rank approximation SVD5 Ranking and relevance feedback

Latent Semantic Indexingbull Assumption there is some underlying latent semantic

structure in the data

bull Eg car and automobile occur in similar documents as do cows and sheep

bull This structure can be enhanced by projecting the data (the term-by-document matrix and the queries) onto a lower dimensional space using SVD

Similarity Measures

bull Term to term AAT = UΣ2UT = (UΣ)(UΣ)T

UΣ are the coordinates of A (rows) projected to space V

bull Document to documentATA = VΣ2VT = (VΣ)(VΣ)T

VΣ are the coordinates of A (columns) projected to space U

Similarity Measures

bull Term to document A = UΣVT = (UΣfrac12)(VΣfrac12)T

UΣfrac12 are the coordinates of A (rows) projected to space VVΣfrac12 are the coordinates of A (columns) projected to space U

HITS (Hyperlink Induced Topic Search)

bull Idea Web includes two flavors of prominent pagesndash authorities contain high-quality informationndash hubs are comprehensive lists of links to authorities

bull A page is a good authority if many hubs point to itbull A page is a good hub if it points to many authoritiesbull Good authorities are pointed to by good hubs and good hubs

point to good authorities

Hubs Authorities

Power Iteration

bull Each page i has both a hub score hi and an authority score aibull HITS successively refines these scores by computing

bull Define the adjacency matrix L of the directed web graph

bull Now

HITS and SVD

bull L rows are outlinks columns are inlinks

bull a will be the dominant eigenvector of the authority matrix LTL bull h will be the dominant eigenvector of the hub matrix LLT

bull They are in fact the first left and right singular vectors of Lbull We are in fact running SVD on the adjacency matrix

HITS vs PageRankbull PageRank may be computed once HITS is computed per

query

bull HITS takes query into account PageRank doesnrsquot

bull PageRank has no concept of hubs

bull HITS is sensitive to local topology insertion or deletion of a small number of nodes may change the scores a lot

bull PageRank more stable because of its random jump step

NMF ndash NON-NEGATIVE MATRIX FACTORIZATION

Definition

bull Given a nonnegative matrix Vntimesm find non-negative matrix factors Wntimesk and Hktimesm such that

VntimesmasympWntimeskHktimesm

bull V column matrix each column is a data sample (n-dimension)bull Wi k-basis represents one basebull H coordinates of V projected to W

vj asymp Wntimeskhj

Motivationbull Non-negativity is natural in many applications

bull Probability is also non-negative

bull Additive model to capture local structure

Multiplicative Update Algorithmbull Cost function Euclidean distance

bull Multiplicative Update

Multiplicative Update Algorithmbull Cost function Divergence

ndash Reduce to Kullback-Leibler divergence whenndash A and B can be regarded as normalized probability distributions

bull Multiplicative update

bull PLSA is NMF with KL divergence

NMF vs PCA

bull n = 2429 faces m = 19x19 pixelsbull Positive values are illustrated with black pixels and negative values

with red pixels

bull NMF Parts-based representationbull PCA Holistic representations

Referencebull D D Lee and H S Seung Algorithms for non-negative matrix

factorization (pdf) NIPS 2001bull D D Lee and H S Seung Learning the parts of objects by non-negative

matrix factorization (pdf) Nature 401 788-791 (1999)

Major Referencebull Saara Hyvoumlnen Linear Algebra Methods for Data Mining Spring 2007

University of Helsinki (Highly recommend)

Outlinebull Basic concepts

ndash Likelihood iidndash ML MAP and Bayesian Inferencendash Expectation-Maximizationndash Mixture Gaussian Parameter estimation

bull pLSAndash Motivationndash Derivation amp Geometry propertiesndash Applications

bull LDAndash Motivation - Why to add a hyper parameterndash Dirichlet Distributionndash Variational EMndash Relations with other topic modalsndash Incorporating category information

bull Summary

Not Included

bull General graphical model theoriesbull Markov random field (belief propagation)bull Detailed derivation of LDA

BASIC CONCEPTS

What Is Machine Learning

Databull Let x = (x1 x2 xD) T denote a data point and D = x(1) x(2) x(N) a

data set D is sometimes associated with desired outputs y1 y2

Predictionsbull We are generally interested in predicting something based on the observed

data setbull Given D what can we say about x(N+1)

Modelbull To make predictions we need to make some assumptions We can often

express these assumptions in the form of a model with some parameters θbull Given data D we learn the model parameters from which we can predict

new data pointsbull The model can often be expressed as a probability distribution over data

points

Likelihood Function

bull Given a set of parameter values probability density function (PDF) will show that some data are more probable than other data

bull Inversely given the observed data and a model of interest Likelihood function is defined as

L(θ) = fθ(x|θ) = p(x|θ)

bull That is likelihood function L(θ) will show that some parameters are more likely to have produced the data

Maximum Likelihood (ML)

bull Maximum likelihood will find the best model parameters that make the data ldquomost likelyrdquo generated from this model

bull Suppose we are given n data samples (x1 x2 hellip xn)

bull Maximum likelihood will find θ that maximize L(θ)

bull Predictive distribution

IID ndash Independent Identically Distributed

bull IID means

bull The problem is considerably simplified as

bull Usually log likehood is used

Reference

bull Zoubin Ghahramani Machine Learning (4F13) 2006 Cambridge(Introduction to Machine Learning Lectures 1-2 Slides)

bull Gregor Heinrich Parameter estimation for text analysis Technical Note 2005-2008

EXPECTATION MAXIMIZATION

Why We Need EM

bull The ExpectationndashMaximization (EM) algorithm is a method for ML learning of parameters in latent variable models

bull Why we need latent variables

bull To describe complex model Gaussian Mixture Model

bull To discover the intrinsic structure inside a data set Topic Models such as pLSA LDA

More General

bull Data set bull Likelihood

bull Goal learn maximum likelihood (ML) parameter values

bull The maximum likelihood procedure finds parameters θ such that

bull Because of the integral (or sum) over latent variables the likelihood can be a very complicated and hard to optimize function

The Expectation Maximization (EM) Algorithm

bull The EM algorithm finds a (local) maximum of a latent variable model likelihood It starts from arbitrary values of the parameters and iterates two steps

bull E step Fill in values of latent variables according to posterior given data

bull M step Maximize likelihood as if latent variables were not hidden

bull Decomposes difficult problems into series of tractable steps

Jensenrsquos Inequality

Lower Bounding the Log Likelihoodbull Observed data D = yn Latent variables X = xn Parameters θbull Goal Maximize the log likelihood (ie ML learning) wrt θ

bull Any distribution q(X) over the hidden variables can be used to obtain a lower bound on the log likelihood using Jensenrsquos inequality

bull where H[q] is the entropy of q(X)

The E and M Steps of EM

bull The lower bound on the log likelihood is given by

bull EM alternates betweenbull E step optimize wrt distribution over hidden variables

holding params fixed

bull M step maximize wrt parameters holding hidden distribution fixed

The E Step

bull E step for fixed θ

bull The second term is the Kullback-Leibler divergencebull This means that for fixed θ F is bounded above by L and achieves

that bound whenbull So the E step simply sets

The M Step

bull M step maximize wrt parameters holding hidden distribution q fixed

bull The second equality comes from fact that entropy of q(X) does not depend directly on θ

bull The specific form of the M step depends on the model Often the maximum wrt θ can be found analytically

EM Never Decreases the Likelihood

bull The E and M steps together never decrease the log likelihood

bull The E step brings F(q θ) to the likelihood L(θ) 顶bull The M-step maximizes F(q θ) wrt θ 抬bull F(q θ) lt L(θ) by Jensen ndash or equivalently from the non-

negativity of KL

Reference

bull Zoubin Ghahramani Machine Learning (4F13) 2006 Cambridge(Unsupervised learning Lecture 5 Slides)

bull Christopher M Bishop (2006) Pattern Recognition and Machine Learning Springer

WHY DO WE NEED GRAPHICAL MODEL

Why Do We Need Graphical Models

bull Consndash Graphical model is so complex even with a few circleshellipndash We have to make too many assumptions

bull Prosndash We do need probability to explain our world But joint probability is

hard to computendash Graphical model can help us analyze and understand our problemsndash Graphs are an intuitive way of representing and visualizing the

relationships between many variablesndash With a graphical model we can decouple joint probability to

conditional probabilities which are usually easier

Directed Acyclic Graphical Models (Bayesian Networks)

bull A DAG Model Bayesian network corresponds to a factorization of the joint probability distribution

p(ABCDE) = p(A)p(B)p(C|AB)p(D|BC)p(E|CD)

bull In general

bull where pa(i) are the parents of node i

Directed Graphs for Statistical ModelsPlate Notation

bull A data set of N points generated from a Gaussian

PLSA ndash PROBABILISTIC LATENT SEMANTIC ANALYSIS

Latent Semantic Indexing (LSI) Review

bull For natural Language Queries simple term matching does not work effectivelyndash Ambiguous terms ndash Same Queries vary due to personal styles

bull Latent semantic indexingndash Creates this lsquolatent semantic spacersquo (hidden meaning)

bull LSI puts documents together even if they donrsquot have common words if the docs share frequently co-occurring terms

bull Disadvantagesndash Statistical foundation is missing

pLSA ndash Probabilistic Latent Semantic Analysis

bull Automated Document Indexing and Information retrievalbull Identification of Latent Classes using an Expectation

Maximization (EM) Algorithmbull Shown to solve

ndash Polysemybull Java could mean ldquocoffeerdquo and also the ldquoPL Javardquobull Cricket is a ldquogamerdquo and also an ldquoinsectrdquo

ndash Synonymybull ldquocomputerrdquo ldquopcrdquo ldquodesktoprdquo all could mean the same

bull Has a better statistical foundation than LSA

pLSA

M

Nd

d

z

w

M

d

z1

w1

z2

w2

z3

w3

zN

wN

hellip

z1 hellip zN are variables ziє[1K]K is the number of latent topics

pLSA

d1

z1

w1

z2

w2

zN1

wN1

hellip

d2

z1

w1

z2

w2

zN2

wN2

hellip

dM

z1

w1

z2

w2

zNm

wNm

hellip

p(w|z=1) p(w|z=2) p(w | z=Nm) are shared for all documents

Likelihood

Joint Probability vs Likelihood

bull Joint probability

bull Likelihood (only for observed variables)

bull p(d) is assumed to be uniform

Document Decomposition

bull Each document can be decomposed as

bull This is similar to the matrix decomposition if we consider each discrete distribution as a vector

p(w|d) = ZVtimesk p(z|d)

bull With many documents we hope to find latent topics as common basis

pLSA ndash Objective Function

bull pLSA tries to maximize the log likelihood

bull Due to the summation over z inside log we have to resort to EM

EM Steps

bull E-Stepndash Expectation of the likelihood function is calculated with the current

parameter valuesbull M-Step

ndash Update the parameters with the calculated posterior probabilitiesndash Find the parameters that maximizes the likelihood function

Lower Bounding the Log Likelihood

EM Steps

bull The E-Step

bull The M-Step

Latent Subspace

pLSA vs LSAbull LSA and PLSA perform dimensionality reduction

ndash In LSA by keeping only K singular valuesndash In PLSA by having K aspects

bull Comparison to SVDndash U Matrix related to P(z|d) (doc to aspect)ndash V Matrix related to P(w|z) (aspect to term)ndash E Matrix related to P(z) (aspect strength)

pLSA vs LSAbull The main difference is the way the approximation is done

bull PLSA generates a model (aspect model) and maximizes its predictive power

bull Selecting the proper value of K is heuristic in LSA

bull Model selection in statistics can determine optimal K in PLSA

Applications

bull Text mining topic discovering

bull Scene Classification

Text Mining

Scene Classification

Classification Result

Reference

bull Thomas Hofmann Probabilistic latent semantic analysis In Proc of Uncertainty in Artificial Intelligence UAI99 Stockholm 1999

bull Bosch A Zisserman A and Munoz X Scene Classification via pLSAProceedings of the European Conference on Computer Vision (2006)

bull Sivic J Russell B C Efros A A Zisserman A and Freeman W T Discovering Object Categories in Image Collections MIT AI Lab Memo AIM-2005-005 February 2005

LDA ndash LATENT DIRICHILET ALLOCATION

Problems in pLSA

bull pLSA provides no probabilistic model at the document level Each doc has its own topic mixture proportion

bull The number of parameters in the model grows linearly with M (the number of documents in the training set)

Problems in pLSA

bull There is no constraint for distributions p(z|di)

bull Easy to lead to serious problems with over-fitting

d1

z1

w1

z2

w2

zN1

wN1

hellip

d2

z1

w1

z2

w2

zN2

wN2

hellip

dm

z1

w1

z2

w2

zNm

wNm

hellip

p(z|d1) p(z|d2) p(z|dm)

Dirichlet Distribution

bull In the LDA model the topic mixture proportions for each document are assumed to follow some distribution

bull Requirement for such a distributionndash The samples (mixture proportions) generated from it are K-tuples of non-

negative numbers that sum to one That is the samples are multinormialsndash Easy to optimize

bull Dirichlet Distribution is one of such distributions

bull The space of all of these multinomials has a nice geometric interpretation as a (k-1)-simplex

Dirichlet Distribution

bull Definition

bull The density is zero outside this open (K minus 1)-dimensional simplex

k

i ii

ki ik

i i

ki i

kk

xx

xxxxp i

1

11

1

12121

1 0 st

)Γ()(Γ)|(

bull Various parameter α

(6 2 2) (3 7 5)

(2 3 4) (6 2 6)

Example Dirichlet Distributions (K=3)

Example Dirichlet Distributions (K=3)

bull Equal αi different

α0=01 α0=1 α0=10

k

i i10

The LDA Model

z4z3z2z1

w4w3w2w1

b

z4z3z2z1

w4w3w2w1

z4z3z2z1

w4w3w2w1

bull For each documentbull Choose ~Dirichlet()bull For each of the N words wn

ndash Choose a topic znraquo Multinomial()

ndash Choose a word wn from p(wn|znb) a multinomial probability conditioned on the topic zn

The LDA Model

bull For each documentbull Choose ~Dirichlet()bull For each of the N words wn

ndash Choose a topic znraquo Multinomial()

ndash Choose a word wn from p(wn|znb) a multinomial probability conditioned on the topic zn

Joint Probability

bull Given parameter α and β

where

Likelihood

bull Joint Probability

bull Marginal distribution of a document

bull Likelihood over all the documents

Inference

bull The likelihood can be computed by summing each documentbull Jansenrsquos inequality in EM

Inference

bull In E-Step we need to compute the posterior distribution of the hidden variables

bull Unfortunately this distribution is intractable to compute in general

bull We have to resort to variational approach

Variational Inference

bull In variational inference we consider a simplified graphical model with variational parameters and minimize the KL Divergence between the variational and posterior distributions

Variantional Inference

bull The difference between the lower bound and the likelihood is the KL divergence

bull Maximizing the lower bound L(b) with respect to and is equivalent to minimizing the KL divergence

VBEM vs EM

bull Only different in the E-Step

bull In standard EM q(X) is directly set to p(X|Dθ) and let KL=0bull In VBEM it is intractable to compute p(X|Dθ) Instead it

approximates p(X|Dθ) by a variational distribution q(X) by minimizing KL(q(X) | P(X|D θ)

bull This is also equivalent to maximizing the lower bound L(θ)

Parameter Estimation

bull Given a corpus of documents we would like to find the parameters and b which maximize the likelihood of the observed data

bull Strategy (Variational EM)

bull Lower bound log p(w|b) by a function L(b)bull Repeat until convergence

ndash E Maximize L(b) with respect to the variational parameters ndash M Maximize the bound with respect to parameters and b

Parameter Estimation

bull E-Step Variational Inference ndash repeat until convergence

bull M-Step Parameter estimation

β

α can be implemented using the Newton-Raphson method

Topic Examples in a 100-topic LDA Model)bull 16000 documents from a subset of the TREC AP corpus

Classification (50-topic LDA + SVM)

bull Reuters-21578 dataset ndash contains 8000 documents and 15818 words

(a) EARN vs NOT EARN (b) GRAIN vs NOT GRAIN

Problems in LDA

bull Dirichlet Distribution is helpful to avoid over-fitting But the assumption might be too strong

z4z3z2z1

w4w3w2w1

b

z4z3z2z1

w4w3w2w1

z4z3z2z1

w4w3w2w1

A Bayesian Hierarchical Model for Learning Natural Scene Categories

bull Incorporating category information

MNd

π

z

x

θ

β

Codebookbull 174 Local Image Patches

bull DetectionEvenly Sampled GridRandom SamplingSaliency DetectorLowersquos DoG Detector

bull RepresentationNormalized 11x11 gray values128-dim SIFT

Topic Distribution in Different Categories

Topic Hierarchical Clustering

More Topic Models

bull Dynamic topic models ICML 2006bull Correlated Topic Model NIPS 2005bull Hierarchical Dirichlet Process Journal of the American

Statistical Association 2003bull Nonparametric Bayes pachinko allocation UAI 2007bull Supervised LDA NIPS 2007bull MedLDA ndash Maximum Margin Discrimant LDA ICML 2009bull hellip

Are you really into Graphical Models

bull Describing Visual Scenes using Transformed Dirichlet Processes E Sudderth A Torralba W Freeman and A Willsky NIPS Dec 2005

Reference

bull David M Blei Andrew Y Ng Michael I Jordan Latent Dirichlet Allocation Journal of Machine Learning Research (JMLR) 2003

bull Matthew J Beal Variational Algorithms for Approximate Bayesian Inference PhD Thesis University of Cambridge 1998

bull L Fei-Fei and P Perona A Bayesian Hierarchical Model for Learning Natural Scene Categories CVPR 2005

Outline

bull Matrix Decompositionndash PCA SVD NMFndash LDA ICA Sparse Coding etc

bull Graphical Modelndash Basic concepts in probabilistic machine learningndash EMndash pLSAndash LDA

bull Two Applicationsndash Document decomposition for ldquolong queryrdquo retrieval ICCV 2009ndash Modeling Threaded Discussions SIGIR 2009

Large-Scale Indexing for ldquoLong Queryrdquo Retrieval (Similarity Search)

Xiao Zhang Zhiwei Li Lei Zhang Wei-Ying Ma and Heung-Yeung Shum

ICCV 2009

The Long Query Problem

bull If a query contains 1000 keywordsndash Need to access 1000 inverted listsndash The intersection of 1000 inverted lists may be emptyndash The union of 1000 inverted list may be the whole corpus

bull Dimension reductionTerm1 Term2 Term3 Term4 hellip TermN

Img1 1 2 0 0 hellip 2

f1 f2 hellip fM

Img1 02 01 hellip 003

Topic Projection

Dim = 1 million

Dim = 200

Key Idea Dimension Reduction + Residual Error Preservation

bull p original TF-IDF vector in vocabulary spacebull X projection matrix for dimension reductionbull w low dimensional feature vectorbull residual error

p Xw

An image = 0 01 02 03 04 05 06 07 08 09 1

0

2

4

6

8

10

12

14

16

+ a few words(10 words)

Orthogonal Decomposition

p Xw 1 11 1 1

2 21 2 1 2

1 1

1

k

k

W k W

W W Wk W

p x xp x x w

p wp x x

Base vector

Low dimensional representation

Residual

An image = 0 01 02 03 04 05 06 07 08 09 1

0

2

4

6

8

10

12

14

16

+ a few words(10 words)

T Tp q p q

p q

w w

X1 X2 X3 hellip Xk

A Probabilistic Implementation

x is a switch variable It controls a word generated from

bull a topic specific distribution

bull a document specific distribution

bull a background distribution

( | )p w d 1( 0 | ) ( | ) ( | )K

kp x d p w z k p z k d

( 1| ) ( | )p x d p w d ( 2 | ) ( | )p x d p w

C Chemudugunta etal Modeling General and Specific Aspects of Documents with a Probabilistic Topic Model NIPS 2006

Search (Online)

DS1 DS2 hellip

DS1 DS2 hellip

DS1 DS2 hellip

DS1 DS2 hellip

DS1 DS2 hellip

hellip

hellip

hellip

hellip

hellip

LSH Index

Doc 300 Doc 401 hellip

A query = 0 01 02 03 04 05 06 07 08 09 1

0

2

4

6

8

10

12

14

16

+ a few words

Re-rankingDoc 401 hellip

Doc 1

Doc 2

Doc 300

Doc 401

Doc N

Doc 300

Index 10M Images 46GBSearch Speed lt 100ms

Doc Meta

Search Example

Query Image

Search Example

Query Image

Simultaneously Modeling Semantics and Structure of Threaded Discussions A Sparse

Coding Approach and Its Applications

Chen LIN Jiang-Ming YANG Rui CAI Xin-jing WANG Wei WANG Lei ZHANG

SIGIR 2009

123

Semantic amp structure

124

SemanticTopics

StructureWho reply to who

Optimize them together

Model semantic

Model structure

125

Reply reconstruction

126

DocumentSimilarity

TopicSimilarity

StructureSimilarity

BaselinesNP

Reply to Nearest PostRR

Reply to RootDS

Document SimilarityLDA

Latent Dirichlet Allocation Project documents to topic space

SWB Special Words Topic Model with Background distribution Project documents to topic and junk topic space

127

Evaluation

method Slashdot Apple

All Posts Good Posts All Posts Good Posts

NP 0021 0012 0289 0239

RR 0183 0319 0269 0474

DS 0463 0643 0409 0628

LDA 0465 0644 0410 0648

SWB 0463 0644 0410 0641

SMSS 0524 0737 0517 0772

128

Expert finding

Reply reconstruction

Network construction

Expert finding

Methods

HITS

PageRank

hellip

129

Baselines

LMFormal Models for Expert Finding in Enterprise Corpora SIGIR

06Achieves stable performance in expert finding task using a

language modelPageRank

Benchmark nodal ranking methodHITS

Find hub nodes and authority nodeEABIF

Personalized Recommendation Driven by Information Flow SIGIR rsquo06

Find most influential node130

Evaluation

131

bull Bayesian estimate

Method MRR MAP P10

LM 0821 0698 0800

EABIF(ori) 0674 0362 0243

EABIF(rec) 0742 0318 0281

PageRank(ori) 0675 0377 0263

PageRank(rec) 0743 0321 0266

HITS(ori) 0906 0832 0900

HITS(rec) 0938 0822 0906

Summary

bull Matrix and probability are fundamental mathematics in information retrieval and computer visionndash Matrix decomposition ndash a good practice to learn matrixndash Graphical model ndash a good practice to learn probability

bull Graphical model is a good tool to analyze problems

bull The essence of decomposition is to discover a set of mid-level features to describe original documentsimages

bull It is more adaptable for various applications than matrix decomposition

  • An Introduction To Matrix Decomposition and Graphical Model
  • Outline
  • What Is Matrix Decomposition
  • Why We Need Matrix Decomposition
  • Why We Need Matrix Decomposition (2)
  • Principle Component Analysis
  • Definition ndash Eigenvalue amp Eigenvector
  • Definition ndash Principle Component Analysis
  • Principle Component Analysis (2)
  • Maximizing Variance
  • Optimization Problem
  • Property Data Decomposition
  • Face Recognition ndash Eigenface
  • Slide 14
  • Slide 15
  • PageRank ndash Power Iteration
  • Column-Stochastic amp Irreducible
  • Iterative PageRank Calculation
  • Convergence of the power iteration
  • Singular value decomposition
  • SVD - Definition
  • Singular Values And Singular Vectors
  • Matrix approximation
  • SVD and PCA
  • Example - LSI
  • SVD and PCA (2)
  • Latent Semantic Indexing (LSI)
  • Latent Semantic Indexing
  • Similarity Measures
  • Similarity Measures (2)
  • HITS (Hyperlink Induced Topic Search)
  • Power Iteration
  • HITS and SVD
  • HITS vs PageRank
  • NMF ndash Non-Negative Matrix Factorization
  • Definition
  • Motivation
  • Multiplicative Update Algorithm
  • Multiplicative Update Algorithm (2)
  • NMF vs PCA
  • Reference
  • Major Reference
  • Outline (2)
  • Not Included
  • Basic Concepts
  • What Is Machine Learning
  • Likelihood Function
  • Maximum Likelihood (ML)
  • IID ndash Independent Identically Distributed
  • Reference (2)
  • Expectation Maximization
  • Why We Need EM
  • More General
  • The Expectation Maximization (EM) Algorithm
  • Jensenrsquos Inequality
  • Lower Bounding the Log Likelihood
  • The E and M Steps of EM
  • The E Step
  • The M Step
  • EM Never Decreases the Likelihood
  • Reference (3)
  • Why Do We Need Graphical Model
  • Why Do We Need Graphical Models
  • Directed Acyclic Graphical Models (Bayesian Networks)
  • Directed Graphs for Statistical Models Plate Notation
  • pLSA ndash Probabilistic Latent Semantic Analysis
  • Latent Semantic Indexing (LSI) Review
  • pLSA ndash Probabilistic Latent Semantic Analysis (2)
  • pLSA
  • pLSA (2)
  • Joint Probability vs Likelihood
  • Document Decomposition
  • pLSA ndash Objective Function
  • EM Steps
  • Lower Bounding the Log Likelihood (2)
  • EM Steps (2)
  • Latent Subspace
  • pLSA vs LSA
  • pLSA vs LSA (2)
  • Applications
  • Text Mining
  • Scene Classification
  • Classification Result
  • Reference (4)
  • LDA ndash Latent Dirichilet Allocation
  • Problems in pLSA
  • Problems in pLSA (2)
  • Dirichlet Distribution
  • Dirichlet Distribution (2)
  • Example Dirichlet Distributions (K=3)
  • Example Dirichlet Distributions (K=3) (2)
  • The LDA Model
  • The LDA Model (2)
  • Joint Probability
  • Likelihood
  • Inference
  • Inference (2)
  • Variational Inference
  • Variantional Inference
  • VBEM vs EM
  • Parameter Estimation
  • Parameter Estimation (2)
  • Topic Examples in a 100-topic LDA Model)
  • Classification (50-topic LDA + SVM)
  • Problems in LDA
  • A Bayesian Hierarchical Model for Learning Natural Scene Catego
  • Codebook
  • Topic Distribution in Different Categories
  • Topic Hierarchical Clustering
  • More Topic Models
  • Are you really into Graphical Models
  • Reference (5)
  • Outline (3)
  • Slide 115
  • The Long Query Problem
  • Key Idea Dimension Reduction + Residual Error Preservation
  • Orthogonal Decomposition
  • A Probabilistic Implementation
  • Search (Online)
  • Search Example
  • Search Example (2)
  • Simultaneously Modeling Semantics and Structure of Threaded Dis
  • Semantic amp structure
  • Optimize them together
  • Reply reconstruction
  • Baselines
  • Evaluation
  • Expert finding
  • Baselines (2)
  • Evaluation (2)
  • Summary
Page 5: An Introduction To Matrix Decomposition and Graphical Model

Why We Need Matrix Decomposition

(a1 a2 hellip am) = Bntimesk (c1 c2 hellip cm) Antimesm = BntimeskCktimesm

bull We wish to find a set of new basis B to represent data samples A and A will become C in the new space

bull In general B captures the common features in A while C carries specific characteristics of the original samples

bull In PCA B is eigenvectorsbull In SVD B is right (column) eigenvectorsbull In LDA B is discriminant directionsbull In NMF B is local features

PRINCIPLE COMPONENT ANALYSIS

Definition ndash Eigenvalue amp Eigenvector

Given a m x m matrix C for any λ and w if

Then λ is called eigenvalue and w is called eigenvector

Definition ndash Principle Component Analysis

ndash Principle Component Analysis (PCA)ndash Karhunen-Loeve transformation (KL transformation)

bull Let A be a n times m data matrix in which the rows represent data samples

bull Each row is a data vector each column represents a variablebull A is centered the estimated mean is subtracted from each

column so each column has zero meanbull Covariance matrix C (m x m)

Principle Component Analysisbull C can be decomposed as follows C=UΛUT

bull Λ is a diagonal matrix diag(λ1 λ2hellipλn) each λi is an eigenvaluebull U is an orthogonal matrix each column is an eigenvector

UTU=I U-1=UT

Maximizing Variancebull The objective of the rotation transformation is to find the

maximal variancebull Projection of data along w is Awbull Variance σ2

w= (Aw)T(Aw) = wTATAw = wTCwwhere C = ATA is the covariance matrix of the data (A is centered)

bull Task maximize variance subject to constraint wTw=1

Optimization Problembull Maximize

λ is the Lagrange multiplierbull Differentiating with respect to w yields

bull Eigenvalue equation Cw = λw where C = ATA bull Once the first principal component is found we continue in the same

fashion to look for the next one which is orthogonal to (all) the principal component(s) already found

Property Data Decompositionbull PCA can be treated as data decomposition

a=UUTa=(u1u2hellipun) (u1u2hellipun)T a=(u1u2hellipun) (ltu1agtltu2agthellipltunagt)T

=(u1u2hellipun) (b1 b2 hellip bn)T

= Σ biui

Face Recognition ndash Eigenfacebull Turk MA Pentland AP Face recognition using eigenfaces

CVPR 1991 (Citation 2654)bull The eigenface approach

ndash images are points in a vector spacendash use PCA to reduce dimensionalityndash face spacendash compare projections onto face space to recognize faces

PageRank ndash Power Iteration

bull Column j has nonzero elements in positions corresponding to outlinks of j (Nj in total)

bull Row i has nonzero element in positions corresponding to inlinks Ii

Column-Stochastic amp Irreduciblebull Column-Stochastic

bull where

bull Irreducible

Iterative PageRank Calculationbull For k=12hellip

bull Equivalently (λ=1 A is a Markov chain transition matrix)

bull Why can we use power iteration to find the first eigenvector

Convergence of the power iteration

bull Expand the initial approximation r0 in terms of the eigenvectors

SINGULAR VALUE DECOMPOSITION

SVD - Definitionbull Any m x n matrix A with m ge n can be factorized

bull

Singular Values And Singular Vectors

bull The diagonal elements σj of are the singular values of the matrix A

bull The columns of U and V are the left singular vectors and right singular vectors respectively

bull Equivalent form of SVD

Matrix approximation

bull Theorem Let Uk = (u1 u2 hellip uk) Vk = (v1 v2 hellip vk) and Σk = diag(σ1 σ2 hellip σk) and define

bull Then

bull It means that the best approximation of rank k for the matrix A is

SVD and PCAbull We can writebull Remember that in PCA we treat A as a row matrixbull V is just eigenvectors for A

ndash Each column in V is an eigenvector of row matrix Andash we use V to approximate a row in A

bull Equivalently we can writebull U is just eigenvectors for AT

ndash Each column in U is an eigenvector of column matrix Andash We use U to approximate a column in A

Example - LSI bull Build a term-by-document matrix A

bull Compute the SVD of A A = UΣVT

bull Approximate A by

ndash Uk Orthogonal basis that we use to approximate all the documents

ndash Dk Column j hold the coordinates of document j in the new basisndash Dk is the projection of A onto the subspace spanned by Uk

SVD and PCAbull For symmetric A SVD is closely related to PCA

bull PCA A = UΛUT

ndash U and Λ are eigenvectors and eigenvaluesbull SVD A = UΛVT

ndash U is left(column) eigenvectorsndash V is right(row) eigenvectorsndash Λ is the same eigenvalues

bull For symmetric A column eigenvectors equal to row eigenvectors

bull Note the difference of A in PCA and SVDndash SVD A is directly the data eg term-by-document matrixndash PCA A is covariance matrix A=XTX each row in X is a sample

Latent Semantic Indexing (LSI)1 Document file preparation preprocessing

ndash Indexing collecting termsndash Use stop list eliminate rdquomeaninglessrdquo wordsndash Stemming

2 Construction term-by-document matrix sparse matrix storage

3 Query matching distance measures4 Data compression by low rank approximation SVD5 Ranking and relevance feedback

Latent Semantic Indexingbull Assumption there is some underlying latent semantic

structure in the data

bull Eg car and automobile occur in similar documents as do cows and sheep

bull This structure can be enhanced by projecting the data (the term-by-document matrix and the queries) onto a lower dimensional space using SVD

Similarity Measures

bull Term to term AAT = UΣ2UT = (UΣ)(UΣ)T

UΣ are the coordinates of A (rows) projected to space V

bull Document to documentATA = VΣ2VT = (VΣ)(VΣ)T

VΣ are the coordinates of A (columns) projected to space U

Similarity Measures

bull Term to document A = UΣVT = (UΣfrac12)(VΣfrac12)T

UΣfrac12 are the coordinates of A (rows) projected to space VVΣfrac12 are the coordinates of A (columns) projected to space U

HITS (Hyperlink Induced Topic Search)

bull Idea Web includes two flavors of prominent pagesndash authorities contain high-quality informationndash hubs are comprehensive lists of links to authorities

bull A page is a good authority if many hubs point to itbull A page is a good hub if it points to many authoritiesbull Good authorities are pointed to by good hubs and good hubs

point to good authorities

Hubs Authorities

Power Iteration

bull Each page i has both a hub score hi and an authority score aibull HITS successively refines these scores by computing

bull Define the adjacency matrix L of the directed web graph

bull Now

HITS and SVD

bull L rows are outlinks columns are inlinks

bull a will be the dominant eigenvector of the authority matrix LTL bull h will be the dominant eigenvector of the hub matrix LLT

bull They are in fact the first left and right singular vectors of Lbull We are in fact running SVD on the adjacency matrix

HITS vs PageRankbull PageRank may be computed once HITS is computed per

query

bull HITS takes query into account PageRank doesnrsquot

bull PageRank has no concept of hubs

bull HITS is sensitive to local topology insertion or deletion of a small number of nodes may change the scores a lot

bull PageRank more stable because of its random jump step

NMF ndash NON-NEGATIVE MATRIX FACTORIZATION

Definition

bull Given a nonnegative matrix Vntimesm find non-negative matrix factors Wntimesk and Hktimesm such that

VntimesmasympWntimeskHktimesm

bull V column matrix each column is a data sample (n-dimension)bull Wi k-basis represents one basebull H coordinates of V projected to W

vj asymp Wntimeskhj

Motivationbull Non-negativity is natural in many applications

bull Probability is also non-negative

bull Additive model to capture local structure

Multiplicative Update Algorithmbull Cost function Euclidean distance

bull Multiplicative Update

Multiplicative Update Algorithmbull Cost function Divergence

ndash Reduce to Kullback-Leibler divergence whenndash A and B can be regarded as normalized probability distributions

bull Multiplicative update

bull PLSA is NMF with KL divergence

NMF vs PCA

bull n = 2429 faces m = 19x19 pixelsbull Positive values are illustrated with black pixels and negative values

with red pixels

bull NMF Parts-based representationbull PCA Holistic representations

Referencebull D D Lee and H S Seung Algorithms for non-negative matrix

factorization (pdf) NIPS 2001bull D D Lee and H S Seung Learning the parts of objects by non-negative

matrix factorization (pdf) Nature 401 788-791 (1999)

Major Referencebull Saara Hyvoumlnen Linear Algebra Methods for Data Mining Spring 2007

University of Helsinki (Highly recommend)

Outlinebull Basic concepts

ndash Likelihood iidndash ML MAP and Bayesian Inferencendash Expectation-Maximizationndash Mixture Gaussian Parameter estimation

bull pLSAndash Motivationndash Derivation amp Geometry propertiesndash Applications

bull LDAndash Motivation - Why to add a hyper parameterndash Dirichlet Distributionndash Variational EMndash Relations with other topic modalsndash Incorporating category information

bull Summary

Not Included

bull General graphical model theoriesbull Markov random field (belief propagation)bull Detailed derivation of LDA

BASIC CONCEPTS

What Is Machine Learning

Databull Let x = (x1 x2 xD) T denote a data point and D = x(1) x(2) x(N) a

data set D is sometimes associated with desired outputs y1 y2

Predictionsbull We are generally interested in predicting something based on the observed

data setbull Given D what can we say about x(N+1)

Modelbull To make predictions we need to make some assumptions We can often

express these assumptions in the form of a model with some parameters θbull Given data D we learn the model parameters from which we can predict

new data pointsbull The model can often be expressed as a probability distribution over data

points

Likelihood Function

bull Given a set of parameter values probability density function (PDF) will show that some data are more probable than other data

bull Inversely given the observed data and a model of interest Likelihood function is defined as

L(θ) = fθ(x|θ) = p(x|θ)

bull That is likelihood function L(θ) will show that some parameters are more likely to have produced the data

Maximum Likelihood (ML)

bull Maximum likelihood will find the best model parameters that make the data ldquomost likelyrdquo generated from this model

bull Suppose we are given n data samples (x1 x2 hellip xn)

bull Maximum likelihood will find θ that maximize L(θ)

bull Predictive distribution

IID ndash Independent Identically Distributed

bull IID means

bull The problem is considerably simplified as

bull Usually log likehood is used

Reference

bull Zoubin Ghahramani Machine Learning (4F13) 2006 Cambridge(Introduction to Machine Learning Lectures 1-2 Slides)

bull Gregor Heinrich Parameter estimation for text analysis Technical Note 2005-2008

EXPECTATION MAXIMIZATION

Why We Need EM

bull The ExpectationndashMaximization (EM) algorithm is a method for ML learning of parameters in latent variable models

bull Why we need latent variables

bull To describe complex model Gaussian Mixture Model

bull To discover the intrinsic structure inside a data set Topic Models such as pLSA LDA

More General

bull Data set bull Likelihood

bull Goal learn maximum likelihood (ML) parameter values

bull The maximum likelihood procedure finds parameters θ such that

bull Because of the integral (or sum) over latent variables the likelihood can be a very complicated and hard to optimize function

The Expectation Maximization (EM) Algorithm

bull The EM algorithm finds a (local) maximum of a latent variable model likelihood It starts from arbitrary values of the parameters and iterates two steps

bull E step Fill in values of latent variables according to posterior given data

bull M step Maximize likelihood as if latent variables were not hidden

bull Decomposes difficult problems into series of tractable steps

Jensenrsquos Inequality

Lower Bounding the Log Likelihoodbull Observed data D = yn Latent variables X = xn Parameters θbull Goal Maximize the log likelihood (ie ML learning) wrt θ

bull Any distribution q(X) over the hidden variables can be used to obtain a lower bound on the log likelihood using Jensenrsquos inequality

bull where H[q] is the entropy of q(X)

The E and M Steps of EM

bull The lower bound on the log likelihood is given by

bull EM alternates betweenbull E step optimize wrt distribution over hidden variables

holding params fixed

bull M step maximize wrt parameters holding hidden distribution fixed

The E Step

bull E step for fixed θ

bull The second term is the Kullback-Leibler divergencebull This means that for fixed θ F is bounded above by L and achieves

that bound whenbull So the E step simply sets

The M Step

bull M step maximize wrt parameters holding hidden distribution q fixed

bull The second equality comes from fact that entropy of q(X) does not depend directly on θ

bull The specific form of the M step depends on the model Often the maximum wrt θ can be found analytically

EM Never Decreases the Likelihood

bull The E and M steps together never decrease the log likelihood

bull The E step brings F(q θ) to the likelihood L(θ) 顶bull The M-step maximizes F(q θ) wrt θ 抬bull F(q θ) lt L(θ) by Jensen ndash or equivalently from the non-

negativity of KL

Reference

bull Zoubin Ghahramani Machine Learning (4F13) 2006 Cambridge(Unsupervised learning Lecture 5 Slides)

bull Christopher M Bishop (2006) Pattern Recognition and Machine Learning Springer

WHY DO WE NEED GRAPHICAL MODEL

Why Do We Need Graphical Models

bull Consndash Graphical model is so complex even with a few circleshellipndash We have to make too many assumptions

bull Prosndash We do need probability to explain our world But joint probability is

hard to computendash Graphical model can help us analyze and understand our problemsndash Graphs are an intuitive way of representing and visualizing the

relationships between many variablesndash With a graphical model we can decouple joint probability to

conditional probabilities which are usually easier

Directed Acyclic Graphical Models (Bayesian Networks)

bull A DAG Model Bayesian network corresponds to a factorization of the joint probability distribution

p(ABCDE) = p(A)p(B)p(C|AB)p(D|BC)p(E|CD)

bull In general

bull where pa(i) are the parents of node i

Directed Graphs for Statistical ModelsPlate Notation

bull A data set of N points generated from a Gaussian

PLSA ndash PROBABILISTIC LATENT SEMANTIC ANALYSIS

Latent Semantic Indexing (LSI) Review

bull For natural Language Queries simple term matching does not work effectivelyndash Ambiguous terms ndash Same Queries vary due to personal styles

bull Latent semantic indexingndash Creates this lsquolatent semantic spacersquo (hidden meaning)

bull LSI puts documents together even if they donrsquot have common words if the docs share frequently co-occurring terms

bull Disadvantagesndash Statistical foundation is missing

pLSA ndash Probabilistic Latent Semantic Analysis

bull Automated Document Indexing and Information retrievalbull Identification of Latent Classes using an Expectation

Maximization (EM) Algorithmbull Shown to solve

ndash Polysemybull Java could mean ldquocoffeerdquo and also the ldquoPL Javardquobull Cricket is a ldquogamerdquo and also an ldquoinsectrdquo

ndash Synonymybull ldquocomputerrdquo ldquopcrdquo ldquodesktoprdquo all could mean the same

bull Has a better statistical foundation than LSA

pLSA

M

Nd

d

z

w

M

d

z1

w1

z2

w2

z3

w3

zN

wN

hellip

z1 hellip zN are variables ziє[1K]K is the number of latent topics

pLSA

d1

z1

w1

z2

w2

zN1

wN1

hellip

d2

z1

w1

z2

w2

zN2

wN2

hellip

dM

z1

w1

z2

w2

zNm

wNm

hellip

p(w|z=1) p(w|z=2) p(w | z=Nm) are shared for all documents

Likelihood

Joint Probability vs Likelihood

bull Joint probability

bull Likelihood (only for observed variables)

bull p(d) is assumed to be uniform

Document Decomposition

bull Each document can be decomposed as

bull This is similar to the matrix decomposition if we consider each discrete distribution as a vector

p(w|d) = ZVtimesk p(z|d)

bull With many documents we hope to find latent topics as common basis

pLSA ndash Objective Function

bull pLSA tries to maximize the log likelihood

bull Due to the summation over z inside log we have to resort to EM

EM Steps

bull E-Stepndash Expectation of the likelihood function is calculated with the current

parameter valuesbull M-Step

ndash Update the parameters with the calculated posterior probabilitiesndash Find the parameters that maximizes the likelihood function

Lower Bounding the Log Likelihood

EM Steps

bull The E-Step

bull The M-Step

Latent Subspace

pLSA vs LSAbull LSA and PLSA perform dimensionality reduction

ndash In LSA by keeping only K singular valuesndash In PLSA by having K aspects

bull Comparison to SVDndash U Matrix related to P(z|d) (doc to aspect)ndash V Matrix related to P(w|z) (aspect to term)ndash E Matrix related to P(z) (aspect strength)

pLSA vs LSAbull The main difference is the way the approximation is done

bull PLSA generates a model (aspect model) and maximizes its predictive power

bull Selecting the proper value of K is heuristic in LSA

bull Model selection in statistics can determine optimal K in PLSA

Applications

bull Text mining topic discovering

bull Scene Classification

Text Mining

Scene Classification

Classification Result

Reference

bull Thomas Hofmann Probabilistic latent semantic analysis In Proc of Uncertainty in Artificial Intelligence UAI99 Stockholm 1999

bull Bosch A Zisserman A and Munoz X Scene Classification via pLSAProceedings of the European Conference on Computer Vision (2006)

bull Sivic J Russell B C Efros A A Zisserman A and Freeman W T Discovering Object Categories in Image Collections MIT AI Lab Memo AIM-2005-005 February 2005

LDA ndash LATENT DIRICHILET ALLOCATION

Problems in pLSA

bull pLSA provides no probabilistic model at the document level Each doc has its own topic mixture proportion

bull The number of parameters in the model grows linearly with M (the number of documents in the training set)

Problems in pLSA

bull There is no constraint for distributions p(z|di)

bull Easy to lead to serious problems with over-fitting

d1

z1

w1

z2

w2

zN1

wN1

hellip

d2

z1

w1

z2

w2

zN2

wN2

hellip

dm

z1

w1

z2

w2

zNm

wNm

hellip

p(z|d1) p(z|d2) p(z|dm)

Dirichlet Distribution

bull In the LDA model the topic mixture proportions for each document are assumed to follow some distribution

bull Requirement for such a distributionndash The samples (mixture proportions) generated from it are K-tuples of non-

negative numbers that sum to one That is the samples are multinormialsndash Easy to optimize

bull Dirichlet Distribution is one of such distributions

bull The space of all of these multinomials has a nice geometric interpretation as a (k-1)-simplex

Dirichlet Distribution

bull Definition

bull The density is zero outside this open (K minus 1)-dimensional simplex

k

i ii

ki ik

i i

ki i

kk

xx

xxxxp i

1

11

1

12121

1 0 st

)Γ()(Γ)|(

bull Various parameter α

(6 2 2) (3 7 5)

(2 3 4) (6 2 6)

Example Dirichlet Distributions (K=3)

Example Dirichlet Distributions (K=3)

bull Equal αi different

α0=01 α0=1 α0=10

k

i i10

The LDA Model

z4z3z2z1

w4w3w2w1

b

z4z3z2z1

w4w3w2w1

z4z3z2z1

w4w3w2w1

bull For each documentbull Choose ~Dirichlet()bull For each of the N words wn

ndash Choose a topic znraquo Multinomial()

ndash Choose a word wn from p(wn|znb) a multinomial probability conditioned on the topic zn

The LDA Model

bull For each documentbull Choose ~Dirichlet()bull For each of the N words wn

ndash Choose a topic znraquo Multinomial()

ndash Choose a word wn from p(wn|znb) a multinomial probability conditioned on the topic zn

Joint Probability

bull Given parameter α and β

where

Likelihood

bull Joint Probability

bull Marginal distribution of a document

bull Likelihood over all the documents

Inference

bull The likelihood can be computed by summing each documentbull Jansenrsquos inequality in EM

Inference

bull In E-Step we need to compute the posterior distribution of the hidden variables

bull Unfortunately this distribution is intractable to compute in general

bull We have to resort to variational approach

Variational Inference

bull In variational inference we consider a simplified graphical model with variational parameters and minimize the KL Divergence between the variational and posterior distributions

Variantional Inference

bull The difference between the lower bound and the likelihood is the KL divergence

bull Maximizing the lower bound L(b) with respect to and is equivalent to minimizing the KL divergence

VBEM vs EM

bull Only different in the E-Step

bull In standard EM q(X) is directly set to p(X|Dθ) and let KL=0bull In VBEM it is intractable to compute p(X|Dθ) Instead it

approximates p(X|Dθ) by a variational distribution q(X) by minimizing KL(q(X) | P(X|D θ)

bull This is also equivalent to maximizing the lower bound L(θ)

Parameter Estimation

bull Given a corpus of documents we would like to find the parameters and b which maximize the likelihood of the observed data

bull Strategy (Variational EM)

bull Lower bound log p(w|b) by a function L(b)bull Repeat until convergence

ndash E Maximize L(b) with respect to the variational parameters ndash M Maximize the bound with respect to parameters and b

Parameter Estimation

bull E-Step Variational Inference ndash repeat until convergence

bull M-Step Parameter estimation

β

α can be implemented using the Newton-Raphson method

Topic Examples in a 100-topic LDA Model)bull 16000 documents from a subset of the TREC AP corpus

Classification (50-topic LDA + SVM)

bull Reuters-21578 dataset ndash contains 8000 documents and 15818 words

(a) EARN vs NOT EARN (b) GRAIN vs NOT GRAIN

Problems in LDA

bull Dirichlet Distribution is helpful to avoid over-fitting But the assumption might be too strong

z4z3z2z1

w4w3w2w1

b

z4z3z2z1

w4w3w2w1

z4z3z2z1

w4w3w2w1

A Bayesian Hierarchical Model for Learning Natural Scene Categories

bull Incorporating category information

MNd

π

z

x

θ

β

Codebookbull 174 Local Image Patches

bull DetectionEvenly Sampled GridRandom SamplingSaliency DetectorLowersquos DoG Detector

bull RepresentationNormalized 11x11 gray values128-dim SIFT

Topic Distribution in Different Categories

Topic Hierarchical Clustering

More Topic Models

bull Dynamic topic models ICML 2006bull Correlated Topic Model NIPS 2005bull Hierarchical Dirichlet Process Journal of the American

Statistical Association 2003bull Nonparametric Bayes pachinko allocation UAI 2007bull Supervised LDA NIPS 2007bull MedLDA ndash Maximum Margin Discrimant LDA ICML 2009bull hellip

Are you really into Graphical Models

bull Describing Visual Scenes using Transformed Dirichlet Processes E Sudderth A Torralba W Freeman and A Willsky NIPS Dec 2005

Reference

bull David M Blei Andrew Y Ng Michael I Jordan Latent Dirichlet Allocation Journal of Machine Learning Research (JMLR) 2003

bull Matthew J Beal Variational Algorithms for Approximate Bayesian Inference PhD Thesis University of Cambridge 1998

bull L Fei-Fei and P Perona A Bayesian Hierarchical Model for Learning Natural Scene Categories CVPR 2005

Outline

bull Matrix Decompositionndash PCA SVD NMFndash LDA ICA Sparse Coding etc

bull Graphical Modelndash Basic concepts in probabilistic machine learningndash EMndash pLSAndash LDA

bull Two Applicationsndash Document decomposition for ldquolong queryrdquo retrieval ICCV 2009ndash Modeling Threaded Discussions SIGIR 2009

Large-Scale Indexing for ldquoLong Queryrdquo Retrieval (Similarity Search)

Xiao Zhang Zhiwei Li Lei Zhang Wei-Ying Ma and Heung-Yeung Shum

ICCV 2009

The Long Query Problem

bull If a query contains 1000 keywordsndash Need to access 1000 inverted listsndash The intersection of 1000 inverted lists may be emptyndash The union of 1000 inverted list may be the whole corpus

bull Dimension reductionTerm1 Term2 Term3 Term4 hellip TermN

Img1 1 2 0 0 hellip 2

f1 f2 hellip fM

Img1 02 01 hellip 003

Topic Projection

Dim = 1 million

Dim = 200

Key Idea Dimension Reduction + Residual Error Preservation

bull p original TF-IDF vector in vocabulary spacebull X projection matrix for dimension reductionbull w low dimensional feature vectorbull residual error

p Xw

An image = 0 01 02 03 04 05 06 07 08 09 1

0

2

4

6

8

10

12

14

16

+ a few words(10 words)

Orthogonal Decomposition

p Xw 1 11 1 1

2 21 2 1 2

1 1

1

k

k

W k W

W W Wk W

p x xp x x w

p wp x x

Base vector

Low dimensional representation

Residual

An image = 0 01 02 03 04 05 06 07 08 09 1

0

2

4

6

8

10

12

14

16

+ a few words(10 words)

T Tp q p q

p q

w w

X1 X2 X3 hellip Xk

A Probabilistic Implementation

x is a switch variable It controls a word generated from

bull a topic specific distribution

bull a document specific distribution

bull a background distribution

( | )p w d 1( 0 | ) ( | ) ( | )K

kp x d p w z k p z k d

( 1| ) ( | )p x d p w d ( 2 | ) ( | )p x d p w

C Chemudugunta etal Modeling General and Specific Aspects of Documents with a Probabilistic Topic Model NIPS 2006

Search (Online)

DS1 DS2 hellip

DS1 DS2 hellip

DS1 DS2 hellip

DS1 DS2 hellip

DS1 DS2 hellip

hellip

hellip

hellip

hellip

hellip

LSH Index

Doc 300 Doc 401 hellip

A query = 0 01 02 03 04 05 06 07 08 09 1

0

2

4

6

8

10

12

14

16

+ a few words

Re-rankingDoc 401 hellip

Doc 1

Doc 2

Doc 300

Doc 401

Doc N

Doc 300

Index 10M Images 46GBSearch Speed lt 100ms

Doc Meta

Search Example

Query Image

Search Example

Query Image

Simultaneously Modeling Semantics and Structure of Threaded Discussions A Sparse

Coding Approach and Its Applications

Chen LIN Jiang-Ming YANG Rui CAI Xin-jing WANG Wei WANG Lei ZHANG

SIGIR 2009

123

Semantic amp structure

124

SemanticTopics

StructureWho reply to who

Optimize them together

Model semantic

Model structure

125

Reply reconstruction

126

DocumentSimilarity

TopicSimilarity

StructureSimilarity

BaselinesNP

Reply to Nearest PostRR

Reply to RootDS

Document SimilarityLDA

Latent Dirichlet Allocation Project documents to topic space

SWB Special Words Topic Model with Background distribution Project documents to topic and junk topic space

127

Evaluation

method Slashdot Apple

All Posts Good Posts All Posts Good Posts

NP 0021 0012 0289 0239

RR 0183 0319 0269 0474

DS 0463 0643 0409 0628

LDA 0465 0644 0410 0648

SWB 0463 0644 0410 0641

SMSS 0524 0737 0517 0772

128

Expert finding

Reply reconstruction

Network construction

Expert finding

Methods

HITS

PageRank

hellip

129

Baselines

LMFormal Models for Expert Finding in Enterprise Corpora SIGIR

06Achieves stable performance in expert finding task using a

language modelPageRank

Benchmark nodal ranking methodHITS

Find hub nodes and authority nodeEABIF

Personalized Recommendation Driven by Information Flow SIGIR rsquo06

Find most influential node130

Evaluation

131

bull Bayesian estimate

Method MRR MAP P10

LM 0821 0698 0800

EABIF(ori) 0674 0362 0243

EABIF(rec) 0742 0318 0281

PageRank(ori) 0675 0377 0263

PageRank(rec) 0743 0321 0266

HITS(ori) 0906 0832 0900

HITS(rec) 0938 0822 0906

Summary

bull Matrix and probability are fundamental mathematics in information retrieval and computer visionndash Matrix decomposition ndash a good practice to learn matrixndash Graphical model ndash a good practice to learn probability

bull Graphical model is a good tool to analyze problems

bull The essence of decomposition is to discover a set of mid-level features to describe original documentsimages

bull It is more adaptable for various applications than matrix decomposition

  • An Introduction To Matrix Decomposition and Graphical Model
  • Outline
  • What Is Matrix Decomposition
  • Why We Need Matrix Decomposition
  • Why We Need Matrix Decomposition (2)
  • Principle Component Analysis
  • Definition ndash Eigenvalue amp Eigenvector
  • Definition ndash Principle Component Analysis
  • Principle Component Analysis (2)
  • Maximizing Variance
  • Optimization Problem
  • Property Data Decomposition
  • Face Recognition ndash Eigenface
  • Slide 14
  • Slide 15
  • PageRank ndash Power Iteration
  • Column-Stochastic amp Irreducible
  • Iterative PageRank Calculation
  • Convergence of the power iteration
  • Singular value decomposition
  • SVD - Definition
  • Singular Values And Singular Vectors
  • Matrix approximation
  • SVD and PCA
  • Example - LSI
  • SVD and PCA (2)
  • Latent Semantic Indexing (LSI)
  • Latent Semantic Indexing
  • Similarity Measures
  • Similarity Measures (2)
  • HITS (Hyperlink Induced Topic Search)
  • Power Iteration
  • HITS and SVD
  • HITS vs PageRank
  • NMF ndash Non-Negative Matrix Factorization
  • Definition
  • Motivation
  • Multiplicative Update Algorithm
  • Multiplicative Update Algorithm (2)
  • NMF vs PCA
  • Reference
  • Major Reference
  • Outline (2)
  • Not Included
  • Basic Concepts
  • What Is Machine Learning
  • Likelihood Function
  • Maximum Likelihood (ML)
  • IID ndash Independent Identically Distributed
  • Reference (2)
  • Expectation Maximization
  • Why We Need EM
  • More General
  • The Expectation Maximization (EM) Algorithm
  • Jensenrsquos Inequality
  • Lower Bounding the Log Likelihood
  • The E and M Steps of EM
  • The E Step
  • The M Step
  • EM Never Decreases the Likelihood
  • Reference (3)
  • Why Do We Need Graphical Model
  • Why Do We Need Graphical Models
  • Directed Acyclic Graphical Models (Bayesian Networks)
  • Directed Graphs for Statistical Models Plate Notation
  • pLSA ndash Probabilistic Latent Semantic Analysis
  • Latent Semantic Indexing (LSI) Review
  • pLSA ndash Probabilistic Latent Semantic Analysis (2)
  • pLSA
  • pLSA (2)
  • Joint Probability vs Likelihood
  • Document Decomposition
  • pLSA ndash Objective Function
  • EM Steps
  • Lower Bounding the Log Likelihood (2)
  • EM Steps (2)
  • Latent Subspace
  • pLSA vs LSA
  • pLSA vs LSA (2)
  • Applications
  • Text Mining
  • Scene Classification
  • Classification Result
  • Reference (4)
  • LDA ndash Latent Dirichilet Allocation
  • Problems in pLSA
  • Problems in pLSA (2)
  • Dirichlet Distribution
  • Dirichlet Distribution (2)
  • Example Dirichlet Distributions (K=3)
  • Example Dirichlet Distributions (K=3) (2)
  • The LDA Model
  • The LDA Model (2)
  • Joint Probability
  • Likelihood
  • Inference
  • Inference (2)
  • Variational Inference
  • Variantional Inference
  • VBEM vs EM
  • Parameter Estimation
  • Parameter Estimation (2)
  • Topic Examples in a 100-topic LDA Model)
  • Classification (50-topic LDA + SVM)
  • Problems in LDA
  • A Bayesian Hierarchical Model for Learning Natural Scene Catego
  • Codebook
  • Topic Distribution in Different Categories
  • Topic Hierarchical Clustering
  • More Topic Models
  • Are you really into Graphical Models
  • Reference (5)
  • Outline (3)
  • Slide 115
  • The Long Query Problem
  • Key Idea Dimension Reduction + Residual Error Preservation
  • Orthogonal Decomposition
  • A Probabilistic Implementation
  • Search (Online)
  • Search Example
  • Search Example (2)
  • Simultaneously Modeling Semantics and Structure of Threaded Dis
  • Semantic amp structure
  • Optimize them together
  • Reply reconstruction
  • Baselines
  • Evaluation
  • Expert finding
  • Baselines (2)
  • Evaluation (2)
  • Summary
Page 6: An Introduction To Matrix Decomposition and Graphical Model

PRINCIPLE COMPONENT ANALYSIS

Definition ndash Eigenvalue amp Eigenvector

Given a m x m matrix C for any λ and w if

Then λ is called eigenvalue and w is called eigenvector

Definition ndash Principle Component Analysis

ndash Principle Component Analysis (PCA)ndash Karhunen-Loeve transformation (KL transformation)

bull Let A be a n times m data matrix in which the rows represent data samples

bull Each row is a data vector each column represents a variablebull A is centered the estimated mean is subtracted from each

column so each column has zero meanbull Covariance matrix C (m x m)

Principle Component Analysisbull C can be decomposed as follows C=UΛUT

bull Λ is a diagonal matrix diag(λ1 λ2hellipλn) each λi is an eigenvaluebull U is an orthogonal matrix each column is an eigenvector

UTU=I U-1=UT

Maximizing Variancebull The objective of the rotation transformation is to find the

maximal variancebull Projection of data along w is Awbull Variance σ2

w= (Aw)T(Aw) = wTATAw = wTCwwhere C = ATA is the covariance matrix of the data (A is centered)

bull Task maximize variance subject to constraint wTw=1

Optimization Problembull Maximize

λ is the Lagrange multiplierbull Differentiating with respect to w yields

bull Eigenvalue equation Cw = λw where C = ATA bull Once the first principal component is found we continue in the same

fashion to look for the next one which is orthogonal to (all) the principal component(s) already found

Property Data Decompositionbull PCA can be treated as data decomposition

a=UUTa=(u1u2hellipun) (u1u2hellipun)T a=(u1u2hellipun) (ltu1agtltu2agthellipltunagt)T

=(u1u2hellipun) (b1 b2 hellip bn)T

= Σ biui

Face Recognition ndash Eigenfacebull Turk MA Pentland AP Face recognition using eigenfaces

CVPR 1991 (Citation 2654)bull The eigenface approach

ndash images are points in a vector spacendash use PCA to reduce dimensionalityndash face spacendash compare projections onto face space to recognize faces

PageRank ndash Power Iteration

bull Column j has nonzero elements in positions corresponding to outlinks of j (Nj in total)

bull Row i has nonzero element in positions corresponding to inlinks Ii

Column-Stochastic amp Irreduciblebull Column-Stochastic

bull where

bull Irreducible

Iterative PageRank Calculationbull For k=12hellip

bull Equivalently (λ=1 A is a Markov chain transition matrix)

bull Why can we use power iteration to find the first eigenvector

Convergence of the power iteration

bull Expand the initial approximation r0 in terms of the eigenvectors

SINGULAR VALUE DECOMPOSITION

SVD - Definitionbull Any m x n matrix A with m ge n can be factorized

bull

Singular Values And Singular Vectors

bull The diagonal elements σj of are the singular values of the matrix A

bull The columns of U and V are the left singular vectors and right singular vectors respectively

bull Equivalent form of SVD

Matrix approximation

bull Theorem Let Uk = (u1 u2 hellip uk) Vk = (v1 v2 hellip vk) and Σk = diag(σ1 σ2 hellip σk) and define

bull Then

bull It means that the best approximation of rank k for the matrix A is

SVD and PCAbull We can writebull Remember that in PCA we treat A as a row matrixbull V is just eigenvectors for A

ndash Each column in V is an eigenvector of row matrix Andash we use V to approximate a row in A

bull Equivalently we can writebull U is just eigenvectors for AT

ndash Each column in U is an eigenvector of column matrix Andash We use U to approximate a column in A

Example - LSI bull Build a term-by-document matrix A

bull Compute the SVD of A A = UΣVT

bull Approximate A by

ndash Uk Orthogonal basis that we use to approximate all the documents

ndash Dk Column j hold the coordinates of document j in the new basisndash Dk is the projection of A onto the subspace spanned by Uk

SVD and PCAbull For symmetric A SVD is closely related to PCA

bull PCA A = UΛUT

ndash U and Λ are eigenvectors and eigenvaluesbull SVD A = UΛVT

ndash U is left(column) eigenvectorsndash V is right(row) eigenvectorsndash Λ is the same eigenvalues

bull For symmetric A column eigenvectors equal to row eigenvectors

bull Note the difference of A in PCA and SVDndash SVD A is directly the data eg term-by-document matrixndash PCA A is covariance matrix A=XTX each row in X is a sample

Latent Semantic Indexing (LSI)1 Document file preparation preprocessing

ndash Indexing collecting termsndash Use stop list eliminate rdquomeaninglessrdquo wordsndash Stemming

2 Construction term-by-document matrix sparse matrix storage

3 Query matching distance measures4 Data compression by low rank approximation SVD5 Ranking and relevance feedback

Latent Semantic Indexingbull Assumption there is some underlying latent semantic

structure in the data

bull Eg car and automobile occur in similar documents as do cows and sheep

bull This structure can be enhanced by projecting the data (the term-by-document matrix and the queries) onto a lower dimensional space using SVD

Similarity Measures

bull Term to term AAT = UΣ2UT = (UΣ)(UΣ)T

UΣ are the coordinates of A (rows) projected to space V

bull Document to documentATA = VΣ2VT = (VΣ)(VΣ)T

VΣ are the coordinates of A (columns) projected to space U

Similarity Measures

bull Term to document A = UΣVT = (UΣfrac12)(VΣfrac12)T

UΣfrac12 are the coordinates of A (rows) projected to space VVΣfrac12 are the coordinates of A (columns) projected to space U

HITS (Hyperlink Induced Topic Search)

bull Idea Web includes two flavors of prominent pagesndash authorities contain high-quality informationndash hubs are comprehensive lists of links to authorities

bull A page is a good authority if many hubs point to itbull A page is a good hub if it points to many authoritiesbull Good authorities are pointed to by good hubs and good hubs

point to good authorities

Hubs Authorities

Power Iteration

bull Each page i has both a hub score hi and an authority score aibull HITS successively refines these scores by computing

bull Define the adjacency matrix L of the directed web graph

bull Now

HITS and SVD

bull L rows are outlinks columns are inlinks

bull a will be the dominant eigenvector of the authority matrix LTL bull h will be the dominant eigenvector of the hub matrix LLT

bull They are in fact the first left and right singular vectors of Lbull We are in fact running SVD on the adjacency matrix

HITS vs PageRankbull PageRank may be computed once HITS is computed per

query

bull HITS takes query into account PageRank doesnrsquot

bull PageRank has no concept of hubs

bull HITS is sensitive to local topology insertion or deletion of a small number of nodes may change the scores a lot

bull PageRank more stable because of its random jump step

NMF ndash NON-NEGATIVE MATRIX FACTORIZATION

Definition

bull Given a nonnegative matrix Vntimesm find non-negative matrix factors Wntimesk and Hktimesm such that

VntimesmasympWntimeskHktimesm

bull V column matrix each column is a data sample (n-dimension)bull Wi k-basis represents one basebull H coordinates of V projected to W

vj asymp Wntimeskhj

Motivationbull Non-negativity is natural in many applications

bull Probability is also non-negative

bull Additive model to capture local structure

Multiplicative Update Algorithmbull Cost function Euclidean distance

bull Multiplicative Update

Multiplicative Update Algorithmbull Cost function Divergence

ndash Reduce to Kullback-Leibler divergence whenndash A and B can be regarded as normalized probability distributions

bull Multiplicative update

bull PLSA is NMF with KL divergence

NMF vs PCA

bull n = 2429 faces m = 19x19 pixelsbull Positive values are illustrated with black pixels and negative values

with red pixels

bull NMF Parts-based representationbull PCA Holistic representations

Referencebull D D Lee and H S Seung Algorithms for non-negative matrix

factorization (pdf) NIPS 2001bull D D Lee and H S Seung Learning the parts of objects by non-negative

matrix factorization (pdf) Nature 401 788-791 (1999)

Major Referencebull Saara Hyvoumlnen Linear Algebra Methods for Data Mining Spring 2007

University of Helsinki (Highly recommend)

Outlinebull Basic concepts

ndash Likelihood iidndash ML MAP and Bayesian Inferencendash Expectation-Maximizationndash Mixture Gaussian Parameter estimation

bull pLSAndash Motivationndash Derivation amp Geometry propertiesndash Applications

bull LDAndash Motivation - Why to add a hyper parameterndash Dirichlet Distributionndash Variational EMndash Relations with other topic modalsndash Incorporating category information

bull Summary

Not Included

bull General graphical model theoriesbull Markov random field (belief propagation)bull Detailed derivation of LDA

BASIC CONCEPTS

What Is Machine Learning

Databull Let x = (x1 x2 xD) T denote a data point and D = x(1) x(2) x(N) a

data set D is sometimes associated with desired outputs y1 y2

Predictionsbull We are generally interested in predicting something based on the observed

data setbull Given D what can we say about x(N+1)

Modelbull To make predictions we need to make some assumptions We can often

express these assumptions in the form of a model with some parameters θbull Given data D we learn the model parameters from which we can predict

new data pointsbull The model can often be expressed as a probability distribution over data

points

Likelihood Function

bull Given a set of parameter values probability density function (PDF) will show that some data are more probable than other data

bull Inversely given the observed data and a model of interest Likelihood function is defined as

L(θ) = fθ(x|θ) = p(x|θ)

bull That is likelihood function L(θ) will show that some parameters are more likely to have produced the data

Maximum Likelihood (ML)

bull Maximum likelihood will find the best model parameters that make the data ldquomost likelyrdquo generated from this model

bull Suppose we are given n data samples (x1 x2 hellip xn)

bull Maximum likelihood will find θ that maximize L(θ)

bull Predictive distribution

IID ndash Independent Identically Distributed

bull IID means

bull The problem is considerably simplified as

bull Usually log likehood is used

Reference

bull Zoubin Ghahramani Machine Learning (4F13) 2006 Cambridge(Introduction to Machine Learning Lectures 1-2 Slides)

bull Gregor Heinrich Parameter estimation for text analysis Technical Note 2005-2008

EXPECTATION MAXIMIZATION

Why We Need EM

bull The ExpectationndashMaximization (EM) algorithm is a method for ML learning of parameters in latent variable models

bull Why we need latent variables

bull To describe complex model Gaussian Mixture Model

bull To discover the intrinsic structure inside a data set Topic Models such as pLSA LDA

More General

bull Data set bull Likelihood

bull Goal learn maximum likelihood (ML) parameter values

bull The maximum likelihood procedure finds parameters θ such that

bull Because of the integral (or sum) over latent variables the likelihood can be a very complicated and hard to optimize function

The Expectation Maximization (EM) Algorithm

bull The EM algorithm finds a (local) maximum of a latent variable model likelihood It starts from arbitrary values of the parameters and iterates two steps

bull E step Fill in values of latent variables according to posterior given data

bull M step Maximize likelihood as if latent variables were not hidden

bull Decomposes difficult problems into series of tractable steps

Jensenrsquos Inequality

Lower Bounding the Log Likelihoodbull Observed data D = yn Latent variables X = xn Parameters θbull Goal Maximize the log likelihood (ie ML learning) wrt θ

bull Any distribution q(X) over the hidden variables can be used to obtain a lower bound on the log likelihood using Jensenrsquos inequality

bull where H[q] is the entropy of q(X)

The E and M Steps of EM

bull The lower bound on the log likelihood is given by

bull EM alternates betweenbull E step optimize wrt distribution over hidden variables

holding params fixed

bull M step maximize wrt parameters holding hidden distribution fixed

The E Step

bull E step for fixed θ

bull The second term is the Kullback-Leibler divergencebull This means that for fixed θ F is bounded above by L and achieves

that bound whenbull So the E step simply sets

The M Step

bull M step maximize wrt parameters holding hidden distribution q fixed

bull The second equality comes from fact that entropy of q(X) does not depend directly on θ

bull The specific form of the M step depends on the model Often the maximum wrt θ can be found analytically

EM Never Decreases the Likelihood

bull The E and M steps together never decrease the log likelihood

bull The E step brings F(q θ) to the likelihood L(θ) 顶bull The M-step maximizes F(q θ) wrt θ 抬bull F(q θ) lt L(θ) by Jensen ndash or equivalently from the non-

negativity of KL

Reference

bull Zoubin Ghahramani Machine Learning (4F13) 2006 Cambridge(Unsupervised learning Lecture 5 Slides)

bull Christopher M Bishop (2006) Pattern Recognition and Machine Learning Springer

WHY DO WE NEED GRAPHICAL MODEL

Why Do We Need Graphical Models

bull Consndash Graphical model is so complex even with a few circleshellipndash We have to make too many assumptions

bull Prosndash We do need probability to explain our world But joint probability is

hard to computendash Graphical model can help us analyze and understand our problemsndash Graphs are an intuitive way of representing and visualizing the

relationships between many variablesndash With a graphical model we can decouple joint probability to

conditional probabilities which are usually easier

Directed Acyclic Graphical Models (Bayesian Networks)

bull A DAG Model Bayesian network corresponds to a factorization of the joint probability distribution

p(ABCDE) = p(A)p(B)p(C|AB)p(D|BC)p(E|CD)

bull In general

bull where pa(i) are the parents of node i

Directed Graphs for Statistical ModelsPlate Notation

bull A data set of N points generated from a Gaussian

PLSA ndash PROBABILISTIC LATENT SEMANTIC ANALYSIS

Latent Semantic Indexing (LSI) Review

bull For natural Language Queries simple term matching does not work effectivelyndash Ambiguous terms ndash Same Queries vary due to personal styles

bull Latent semantic indexingndash Creates this lsquolatent semantic spacersquo (hidden meaning)

bull LSI puts documents together even if they donrsquot have common words if the docs share frequently co-occurring terms

bull Disadvantagesndash Statistical foundation is missing

pLSA ndash Probabilistic Latent Semantic Analysis

bull Automated Document Indexing and Information retrievalbull Identification of Latent Classes using an Expectation

Maximization (EM) Algorithmbull Shown to solve

ndash Polysemybull Java could mean ldquocoffeerdquo and also the ldquoPL Javardquobull Cricket is a ldquogamerdquo and also an ldquoinsectrdquo

ndash Synonymybull ldquocomputerrdquo ldquopcrdquo ldquodesktoprdquo all could mean the same

bull Has a better statistical foundation than LSA

pLSA

M

Nd

d

z

w

M

d

z1

w1

z2

w2

z3

w3

zN

wN

hellip

z1 hellip zN are variables ziє[1K]K is the number of latent topics

pLSA

d1

z1

w1

z2

w2

zN1

wN1

hellip

d2

z1

w1

z2

w2

zN2

wN2

hellip

dM

z1

w1

z2

w2

zNm

wNm

hellip

p(w|z=1) p(w|z=2) p(w | z=Nm) are shared for all documents

Likelihood

Joint Probability vs Likelihood

bull Joint probability

bull Likelihood (only for observed variables)

bull p(d) is assumed to be uniform

Document Decomposition

bull Each document can be decomposed as

bull This is similar to the matrix decomposition if we consider each discrete distribution as a vector

p(w|d) = ZVtimesk p(z|d)

bull With many documents we hope to find latent topics as common basis

pLSA ndash Objective Function

bull pLSA tries to maximize the log likelihood

bull Due to the summation over z inside log we have to resort to EM

EM Steps

bull E-Stepndash Expectation of the likelihood function is calculated with the current

parameter valuesbull M-Step

ndash Update the parameters with the calculated posterior probabilitiesndash Find the parameters that maximizes the likelihood function

Lower Bounding the Log Likelihood

EM Steps

bull The E-Step

bull The M-Step

Latent Subspace

pLSA vs LSAbull LSA and PLSA perform dimensionality reduction

ndash In LSA by keeping only K singular valuesndash In PLSA by having K aspects

bull Comparison to SVDndash U Matrix related to P(z|d) (doc to aspect)ndash V Matrix related to P(w|z) (aspect to term)ndash E Matrix related to P(z) (aspect strength)

pLSA vs LSAbull The main difference is the way the approximation is done

bull PLSA generates a model (aspect model) and maximizes its predictive power

bull Selecting the proper value of K is heuristic in LSA

bull Model selection in statistics can determine optimal K in PLSA

Applications

bull Text mining topic discovering

bull Scene Classification

Text Mining

Scene Classification

Classification Result

Reference

bull Thomas Hofmann Probabilistic latent semantic analysis In Proc of Uncertainty in Artificial Intelligence UAI99 Stockholm 1999

bull Bosch A Zisserman A and Munoz X Scene Classification via pLSAProceedings of the European Conference on Computer Vision (2006)

bull Sivic J Russell B C Efros A A Zisserman A and Freeman W T Discovering Object Categories in Image Collections MIT AI Lab Memo AIM-2005-005 February 2005

LDA ndash LATENT DIRICHILET ALLOCATION

Problems in pLSA

bull pLSA provides no probabilistic model at the document level Each doc has its own topic mixture proportion

bull The number of parameters in the model grows linearly with M (the number of documents in the training set)

Problems in pLSA

bull There is no constraint for distributions p(z|di)

bull Easy to lead to serious problems with over-fitting

d1

z1

w1

z2

w2

zN1

wN1

hellip

d2

z1

w1

z2

w2

zN2

wN2

hellip

dm

z1

w1

z2

w2

zNm

wNm

hellip

p(z|d1) p(z|d2) p(z|dm)

Dirichlet Distribution

bull In the LDA model the topic mixture proportions for each document are assumed to follow some distribution

bull Requirement for such a distributionndash The samples (mixture proportions) generated from it are K-tuples of non-

negative numbers that sum to one That is the samples are multinormialsndash Easy to optimize

bull Dirichlet Distribution is one of such distributions

bull The space of all of these multinomials has a nice geometric interpretation as a (k-1)-simplex

Dirichlet Distribution

bull Definition

bull The density is zero outside this open (K minus 1)-dimensional simplex

k

i ii

ki ik

i i

ki i

kk

xx

xxxxp i

1

11

1

12121

1 0 st

)Γ()(Γ)|(

bull Various parameter α

(6 2 2) (3 7 5)

(2 3 4) (6 2 6)

Example Dirichlet Distributions (K=3)

Example Dirichlet Distributions (K=3)

bull Equal αi different

α0=01 α0=1 α0=10

k

i i10

The LDA Model

z4z3z2z1

w4w3w2w1

b

z4z3z2z1

w4w3w2w1

z4z3z2z1

w4w3w2w1

bull For each documentbull Choose ~Dirichlet()bull For each of the N words wn

ndash Choose a topic znraquo Multinomial()

ndash Choose a word wn from p(wn|znb) a multinomial probability conditioned on the topic zn

The LDA Model

bull For each documentbull Choose ~Dirichlet()bull For each of the N words wn

ndash Choose a topic znraquo Multinomial()

ndash Choose a word wn from p(wn|znb) a multinomial probability conditioned on the topic zn

Joint Probability

bull Given parameter α and β

where

Likelihood

bull Joint Probability

bull Marginal distribution of a document

bull Likelihood over all the documents

Inference

bull The likelihood can be computed by summing each documentbull Jansenrsquos inequality in EM

Inference

bull In E-Step we need to compute the posterior distribution of the hidden variables

bull Unfortunately this distribution is intractable to compute in general

bull We have to resort to variational approach

Variational Inference

bull In variational inference we consider a simplified graphical model with variational parameters and minimize the KL Divergence between the variational and posterior distributions

Variantional Inference

bull The difference between the lower bound and the likelihood is the KL divergence

bull Maximizing the lower bound L(b) with respect to and is equivalent to minimizing the KL divergence

VBEM vs EM

bull Only different in the E-Step

bull In standard EM q(X) is directly set to p(X|Dθ) and let KL=0bull In VBEM it is intractable to compute p(X|Dθ) Instead it

approximates p(X|Dθ) by a variational distribution q(X) by minimizing KL(q(X) | P(X|D θ)

bull This is also equivalent to maximizing the lower bound L(θ)

Parameter Estimation

bull Given a corpus of documents we would like to find the parameters and b which maximize the likelihood of the observed data

bull Strategy (Variational EM)

bull Lower bound log p(w|b) by a function L(b)bull Repeat until convergence

ndash E Maximize L(b) with respect to the variational parameters ndash M Maximize the bound with respect to parameters and b

Parameter Estimation

bull E-Step Variational Inference ndash repeat until convergence

bull M-Step Parameter estimation

β

α can be implemented using the Newton-Raphson method

Topic Examples in a 100-topic LDA Model)bull 16000 documents from a subset of the TREC AP corpus

Classification (50-topic LDA + SVM)

bull Reuters-21578 dataset ndash contains 8000 documents and 15818 words

(a) EARN vs NOT EARN (b) GRAIN vs NOT GRAIN

Problems in LDA

bull Dirichlet Distribution is helpful to avoid over-fitting But the assumption might be too strong

z4z3z2z1

w4w3w2w1

b

z4z3z2z1

w4w3w2w1

z4z3z2z1

w4w3w2w1

A Bayesian Hierarchical Model for Learning Natural Scene Categories

bull Incorporating category information

MNd

π

z

x

θ

β

Codebookbull 174 Local Image Patches

bull DetectionEvenly Sampled GridRandom SamplingSaliency DetectorLowersquos DoG Detector

bull RepresentationNormalized 11x11 gray values128-dim SIFT

Topic Distribution in Different Categories

Topic Hierarchical Clustering

More Topic Models

bull Dynamic topic models ICML 2006bull Correlated Topic Model NIPS 2005bull Hierarchical Dirichlet Process Journal of the American

Statistical Association 2003bull Nonparametric Bayes pachinko allocation UAI 2007bull Supervised LDA NIPS 2007bull MedLDA ndash Maximum Margin Discrimant LDA ICML 2009bull hellip

Are you really into Graphical Models

bull Describing Visual Scenes using Transformed Dirichlet Processes E Sudderth A Torralba W Freeman and A Willsky NIPS Dec 2005

Reference

bull David M Blei Andrew Y Ng Michael I Jordan Latent Dirichlet Allocation Journal of Machine Learning Research (JMLR) 2003

bull Matthew J Beal Variational Algorithms for Approximate Bayesian Inference PhD Thesis University of Cambridge 1998

bull L Fei-Fei and P Perona A Bayesian Hierarchical Model for Learning Natural Scene Categories CVPR 2005

Outline

bull Matrix Decompositionndash PCA SVD NMFndash LDA ICA Sparse Coding etc

bull Graphical Modelndash Basic concepts in probabilistic machine learningndash EMndash pLSAndash LDA

bull Two Applicationsndash Document decomposition for ldquolong queryrdquo retrieval ICCV 2009ndash Modeling Threaded Discussions SIGIR 2009

Large-Scale Indexing for ldquoLong Queryrdquo Retrieval (Similarity Search)

Xiao Zhang Zhiwei Li Lei Zhang Wei-Ying Ma and Heung-Yeung Shum

ICCV 2009

The Long Query Problem

bull If a query contains 1000 keywordsndash Need to access 1000 inverted listsndash The intersection of 1000 inverted lists may be emptyndash The union of 1000 inverted list may be the whole corpus

bull Dimension reductionTerm1 Term2 Term3 Term4 hellip TermN

Img1 1 2 0 0 hellip 2

f1 f2 hellip fM

Img1 02 01 hellip 003

Topic Projection

Dim = 1 million

Dim = 200

Key Idea Dimension Reduction + Residual Error Preservation

bull p original TF-IDF vector in vocabulary spacebull X projection matrix for dimension reductionbull w low dimensional feature vectorbull residual error

p Xw

An image = 0 01 02 03 04 05 06 07 08 09 1

0

2

4

6

8

10

12

14

16

+ a few words(10 words)

Orthogonal Decomposition

p Xw 1 11 1 1

2 21 2 1 2

1 1

1

k

k

W k W

W W Wk W

p x xp x x w

p wp x x

Base vector

Low dimensional representation

Residual

An image = 0 01 02 03 04 05 06 07 08 09 1

0

2

4

6

8

10

12

14

16

+ a few words(10 words)

T Tp q p q

p q

w w

X1 X2 X3 hellip Xk

A Probabilistic Implementation

x is a switch variable It controls a word generated from

bull a topic specific distribution

bull a document specific distribution

bull a background distribution

( | )p w d 1( 0 | ) ( | ) ( | )K

kp x d p w z k p z k d

( 1| ) ( | )p x d p w d ( 2 | ) ( | )p x d p w

C Chemudugunta etal Modeling General and Specific Aspects of Documents with a Probabilistic Topic Model NIPS 2006

Search (Online)

DS1 DS2 hellip

DS1 DS2 hellip

DS1 DS2 hellip

DS1 DS2 hellip

DS1 DS2 hellip

hellip

hellip

hellip

hellip

hellip

LSH Index

Doc 300 Doc 401 hellip

A query = 0 01 02 03 04 05 06 07 08 09 1

0

2

4

6

8

10

12

14

16

+ a few words

Re-rankingDoc 401 hellip

Doc 1

Doc 2

Doc 300

Doc 401

Doc N

Doc 300

Index 10M Images 46GBSearch Speed lt 100ms

Doc Meta

Search Example

Query Image

Search Example

Query Image

Simultaneously Modeling Semantics and Structure of Threaded Discussions A Sparse

Coding Approach and Its Applications

Chen LIN Jiang-Ming YANG Rui CAI Xin-jing WANG Wei WANG Lei ZHANG

SIGIR 2009

123

Semantic amp structure

124

SemanticTopics

StructureWho reply to who

Optimize them together

Model semantic

Model structure

125

Reply reconstruction

126

DocumentSimilarity

TopicSimilarity

StructureSimilarity

BaselinesNP

Reply to Nearest PostRR

Reply to RootDS

Document SimilarityLDA

Latent Dirichlet Allocation Project documents to topic space

SWB Special Words Topic Model with Background distribution Project documents to topic and junk topic space

127

Evaluation

method Slashdot Apple

All Posts Good Posts All Posts Good Posts

NP 0021 0012 0289 0239

RR 0183 0319 0269 0474

DS 0463 0643 0409 0628

LDA 0465 0644 0410 0648

SWB 0463 0644 0410 0641

SMSS 0524 0737 0517 0772

128

Expert finding

Reply reconstruction

Network construction

Expert finding

Methods

HITS

PageRank

hellip

129

Baselines

LMFormal Models for Expert Finding in Enterprise Corpora SIGIR

06Achieves stable performance in expert finding task using a

language modelPageRank

Benchmark nodal ranking methodHITS

Find hub nodes and authority nodeEABIF

Personalized Recommendation Driven by Information Flow SIGIR rsquo06

Find most influential node130

Evaluation

131

bull Bayesian estimate

Method MRR MAP P10

LM 0821 0698 0800

EABIF(ori) 0674 0362 0243

EABIF(rec) 0742 0318 0281

PageRank(ori) 0675 0377 0263

PageRank(rec) 0743 0321 0266

HITS(ori) 0906 0832 0900

HITS(rec) 0938 0822 0906

Summary

bull Matrix and probability are fundamental mathematics in information retrieval and computer visionndash Matrix decomposition ndash a good practice to learn matrixndash Graphical model ndash a good practice to learn probability

bull Graphical model is a good tool to analyze problems

bull The essence of decomposition is to discover a set of mid-level features to describe original documentsimages

bull It is more adaptable for various applications than matrix decomposition

  • An Introduction To Matrix Decomposition and Graphical Model
  • Outline
  • What Is Matrix Decomposition
  • Why We Need Matrix Decomposition
  • Why We Need Matrix Decomposition (2)
  • Principle Component Analysis
  • Definition ndash Eigenvalue amp Eigenvector
  • Definition ndash Principle Component Analysis
  • Principle Component Analysis (2)
  • Maximizing Variance
  • Optimization Problem
  • Property Data Decomposition
  • Face Recognition ndash Eigenface
  • Slide 14
  • Slide 15
  • PageRank ndash Power Iteration
  • Column-Stochastic amp Irreducible
  • Iterative PageRank Calculation
  • Convergence of the power iteration
  • Singular value decomposition
  • SVD - Definition
  • Singular Values And Singular Vectors
  • Matrix approximation
  • SVD and PCA
  • Example - LSI
  • SVD and PCA (2)
  • Latent Semantic Indexing (LSI)
  • Latent Semantic Indexing
  • Similarity Measures
  • Similarity Measures (2)
  • HITS (Hyperlink Induced Topic Search)
  • Power Iteration
  • HITS and SVD
  • HITS vs PageRank
  • NMF ndash Non-Negative Matrix Factorization
  • Definition
  • Motivation
  • Multiplicative Update Algorithm
  • Multiplicative Update Algorithm (2)
  • NMF vs PCA
  • Reference
  • Major Reference
  • Outline (2)
  • Not Included
  • Basic Concepts
  • What Is Machine Learning
  • Likelihood Function
  • Maximum Likelihood (ML)
  • IID ndash Independent Identically Distributed
  • Reference (2)
  • Expectation Maximization
  • Why We Need EM
  • More General
  • The Expectation Maximization (EM) Algorithm
  • Jensenrsquos Inequality
  • Lower Bounding the Log Likelihood
  • The E and M Steps of EM
  • The E Step
  • The M Step
  • EM Never Decreases the Likelihood
  • Reference (3)
  • Why Do We Need Graphical Model
  • Why Do We Need Graphical Models
  • Directed Acyclic Graphical Models (Bayesian Networks)
  • Directed Graphs for Statistical Models Plate Notation
  • pLSA ndash Probabilistic Latent Semantic Analysis
  • Latent Semantic Indexing (LSI) Review
  • pLSA ndash Probabilistic Latent Semantic Analysis (2)
  • pLSA
  • pLSA (2)
  • Joint Probability vs Likelihood
  • Document Decomposition
  • pLSA ndash Objective Function
  • EM Steps
  • Lower Bounding the Log Likelihood (2)
  • EM Steps (2)
  • Latent Subspace
  • pLSA vs LSA
  • pLSA vs LSA (2)
  • Applications
  • Text Mining
  • Scene Classification
  • Classification Result
  • Reference (4)
  • LDA ndash Latent Dirichilet Allocation
  • Problems in pLSA
  • Problems in pLSA (2)
  • Dirichlet Distribution
  • Dirichlet Distribution (2)
  • Example Dirichlet Distributions (K=3)
  • Example Dirichlet Distributions (K=3) (2)
  • The LDA Model
  • The LDA Model (2)
  • Joint Probability
  • Likelihood
  • Inference
  • Inference (2)
  • Variational Inference
  • Variantional Inference
  • VBEM vs EM
  • Parameter Estimation
  • Parameter Estimation (2)
  • Topic Examples in a 100-topic LDA Model)
  • Classification (50-topic LDA + SVM)
  • Problems in LDA
  • A Bayesian Hierarchical Model for Learning Natural Scene Catego
  • Codebook
  • Topic Distribution in Different Categories
  • Topic Hierarchical Clustering
  • More Topic Models
  • Are you really into Graphical Models
  • Reference (5)
  • Outline (3)
  • Slide 115
  • The Long Query Problem
  • Key Idea Dimension Reduction + Residual Error Preservation
  • Orthogonal Decomposition
  • A Probabilistic Implementation
  • Search (Online)
  • Search Example
  • Search Example (2)
  • Simultaneously Modeling Semantics and Structure of Threaded Dis
  • Semantic amp structure
  • Optimize them together
  • Reply reconstruction
  • Baselines
  • Evaluation
  • Expert finding
  • Baselines (2)
  • Evaluation (2)
  • Summary
Page 7: An Introduction To Matrix Decomposition and Graphical Model

Definition ndash Eigenvalue amp Eigenvector

Given a m x m matrix C for any λ and w if

Then λ is called eigenvalue and w is called eigenvector

Definition ndash Principle Component Analysis

ndash Principle Component Analysis (PCA)ndash Karhunen-Loeve transformation (KL transformation)

bull Let A be a n times m data matrix in which the rows represent data samples

bull Each row is a data vector each column represents a variablebull A is centered the estimated mean is subtracted from each

column so each column has zero meanbull Covariance matrix C (m x m)

Principle Component Analysisbull C can be decomposed as follows C=UΛUT

bull Λ is a diagonal matrix diag(λ1 λ2hellipλn) each λi is an eigenvaluebull U is an orthogonal matrix each column is an eigenvector

UTU=I U-1=UT

Maximizing Variancebull The objective of the rotation transformation is to find the

maximal variancebull Projection of data along w is Awbull Variance σ2

w= (Aw)T(Aw) = wTATAw = wTCwwhere C = ATA is the covariance matrix of the data (A is centered)

bull Task maximize variance subject to constraint wTw=1

Optimization Problembull Maximize

λ is the Lagrange multiplierbull Differentiating with respect to w yields

bull Eigenvalue equation Cw = λw where C = ATA bull Once the first principal component is found we continue in the same

fashion to look for the next one which is orthogonal to (all) the principal component(s) already found

Property Data Decompositionbull PCA can be treated as data decomposition

a=UUTa=(u1u2hellipun) (u1u2hellipun)T a=(u1u2hellipun) (ltu1agtltu2agthellipltunagt)T

=(u1u2hellipun) (b1 b2 hellip bn)T

= Σ biui

Face Recognition ndash Eigenfacebull Turk MA Pentland AP Face recognition using eigenfaces

CVPR 1991 (Citation 2654)bull The eigenface approach

ndash images are points in a vector spacendash use PCA to reduce dimensionalityndash face spacendash compare projections onto face space to recognize faces

PageRank ndash Power Iteration

bull Column j has nonzero elements in positions corresponding to outlinks of j (Nj in total)

bull Row i has nonzero element in positions corresponding to inlinks Ii

Column-Stochastic amp Irreduciblebull Column-Stochastic

bull where

bull Irreducible

Iterative PageRank Calculationbull For k=12hellip

bull Equivalently (λ=1 A is a Markov chain transition matrix)

bull Why can we use power iteration to find the first eigenvector

Convergence of the power iteration

bull Expand the initial approximation r0 in terms of the eigenvectors

SINGULAR VALUE DECOMPOSITION

SVD - Definitionbull Any m x n matrix A with m ge n can be factorized

bull

Singular Values And Singular Vectors

bull The diagonal elements σj of are the singular values of the matrix A

bull The columns of U and V are the left singular vectors and right singular vectors respectively

bull Equivalent form of SVD

Matrix approximation

bull Theorem Let Uk = (u1 u2 hellip uk) Vk = (v1 v2 hellip vk) and Σk = diag(σ1 σ2 hellip σk) and define

bull Then

bull It means that the best approximation of rank k for the matrix A is

SVD and PCAbull We can writebull Remember that in PCA we treat A as a row matrixbull V is just eigenvectors for A

ndash Each column in V is an eigenvector of row matrix Andash we use V to approximate a row in A

bull Equivalently we can writebull U is just eigenvectors for AT

ndash Each column in U is an eigenvector of column matrix Andash We use U to approximate a column in A

Example - LSI bull Build a term-by-document matrix A

bull Compute the SVD of A A = UΣVT

bull Approximate A by

ndash Uk Orthogonal basis that we use to approximate all the documents

ndash Dk Column j hold the coordinates of document j in the new basisndash Dk is the projection of A onto the subspace spanned by Uk

SVD and PCAbull For symmetric A SVD is closely related to PCA

bull PCA A = UΛUT

ndash U and Λ are eigenvectors and eigenvaluesbull SVD A = UΛVT

ndash U is left(column) eigenvectorsndash V is right(row) eigenvectorsndash Λ is the same eigenvalues

bull For symmetric A column eigenvectors equal to row eigenvectors

bull Note the difference of A in PCA and SVDndash SVD A is directly the data eg term-by-document matrixndash PCA A is covariance matrix A=XTX each row in X is a sample

Latent Semantic Indexing (LSI)1 Document file preparation preprocessing

ndash Indexing collecting termsndash Use stop list eliminate rdquomeaninglessrdquo wordsndash Stemming

2 Construction term-by-document matrix sparse matrix storage

3 Query matching distance measures4 Data compression by low rank approximation SVD5 Ranking and relevance feedback

Latent Semantic Indexingbull Assumption there is some underlying latent semantic

structure in the data

bull Eg car and automobile occur in similar documents as do cows and sheep

bull This structure can be enhanced by projecting the data (the term-by-document matrix and the queries) onto a lower dimensional space using SVD

Similarity Measures

bull Term to term AAT = UΣ2UT = (UΣ)(UΣ)T

UΣ are the coordinates of A (rows) projected to space V

bull Document to documentATA = VΣ2VT = (VΣ)(VΣ)T

VΣ are the coordinates of A (columns) projected to space U

Similarity Measures

bull Term to document A = UΣVT = (UΣfrac12)(VΣfrac12)T

UΣfrac12 are the coordinates of A (rows) projected to space VVΣfrac12 are the coordinates of A (columns) projected to space U

HITS (Hyperlink Induced Topic Search)

bull Idea Web includes two flavors of prominent pagesndash authorities contain high-quality informationndash hubs are comprehensive lists of links to authorities

bull A page is a good authority if many hubs point to itbull A page is a good hub if it points to many authoritiesbull Good authorities are pointed to by good hubs and good hubs

point to good authorities

Hubs Authorities

Power Iteration

bull Each page i has both a hub score hi and an authority score aibull HITS successively refines these scores by computing

bull Define the adjacency matrix L of the directed web graph

bull Now

HITS and SVD

bull L rows are outlinks columns are inlinks

bull a will be the dominant eigenvector of the authority matrix LTL bull h will be the dominant eigenvector of the hub matrix LLT

bull They are in fact the first left and right singular vectors of Lbull We are in fact running SVD on the adjacency matrix

HITS vs PageRankbull PageRank may be computed once HITS is computed per

query

bull HITS takes query into account PageRank doesnrsquot

bull PageRank has no concept of hubs

bull HITS is sensitive to local topology insertion or deletion of a small number of nodes may change the scores a lot

bull PageRank more stable because of its random jump step

NMF ndash NON-NEGATIVE MATRIX FACTORIZATION

Definition

bull Given a nonnegative matrix Vntimesm find non-negative matrix factors Wntimesk and Hktimesm such that

VntimesmasympWntimeskHktimesm

bull V column matrix each column is a data sample (n-dimension)bull Wi k-basis represents one basebull H coordinates of V projected to W

vj asymp Wntimeskhj

Motivationbull Non-negativity is natural in many applications

bull Probability is also non-negative

bull Additive model to capture local structure

Multiplicative Update Algorithmbull Cost function Euclidean distance

bull Multiplicative Update

Multiplicative Update Algorithmbull Cost function Divergence

ndash Reduce to Kullback-Leibler divergence whenndash A and B can be regarded as normalized probability distributions

bull Multiplicative update

bull PLSA is NMF with KL divergence

NMF vs PCA

bull n = 2429 faces m = 19x19 pixelsbull Positive values are illustrated with black pixels and negative values

with red pixels

bull NMF Parts-based representationbull PCA Holistic representations

Referencebull D D Lee and H S Seung Algorithms for non-negative matrix

factorization (pdf) NIPS 2001bull D D Lee and H S Seung Learning the parts of objects by non-negative

matrix factorization (pdf) Nature 401 788-791 (1999)

Major Referencebull Saara Hyvoumlnen Linear Algebra Methods for Data Mining Spring 2007

University of Helsinki (Highly recommend)

Outlinebull Basic concepts

ndash Likelihood iidndash ML MAP and Bayesian Inferencendash Expectation-Maximizationndash Mixture Gaussian Parameter estimation

bull pLSAndash Motivationndash Derivation amp Geometry propertiesndash Applications

bull LDAndash Motivation - Why to add a hyper parameterndash Dirichlet Distributionndash Variational EMndash Relations with other topic modalsndash Incorporating category information

bull Summary

Not Included

bull General graphical model theoriesbull Markov random field (belief propagation)bull Detailed derivation of LDA

BASIC CONCEPTS

What Is Machine Learning

Databull Let x = (x1 x2 xD) T denote a data point and D = x(1) x(2) x(N) a

data set D is sometimes associated with desired outputs y1 y2

Predictionsbull We are generally interested in predicting something based on the observed

data setbull Given D what can we say about x(N+1)

Modelbull To make predictions we need to make some assumptions We can often

express these assumptions in the form of a model with some parameters θbull Given data D we learn the model parameters from which we can predict

new data pointsbull The model can often be expressed as a probability distribution over data

points

Likelihood Function

bull Given a set of parameter values probability density function (PDF) will show that some data are more probable than other data

bull Inversely given the observed data and a model of interest Likelihood function is defined as

L(θ) = fθ(x|θ) = p(x|θ)

bull That is likelihood function L(θ) will show that some parameters are more likely to have produced the data

Maximum Likelihood (ML)

bull Maximum likelihood will find the best model parameters that make the data ldquomost likelyrdquo generated from this model

bull Suppose we are given n data samples (x1 x2 hellip xn)

bull Maximum likelihood will find θ that maximize L(θ)

bull Predictive distribution

IID ndash Independent Identically Distributed

bull IID means

bull The problem is considerably simplified as

bull Usually log likehood is used

Reference

bull Zoubin Ghahramani Machine Learning (4F13) 2006 Cambridge(Introduction to Machine Learning Lectures 1-2 Slides)

bull Gregor Heinrich Parameter estimation for text analysis Technical Note 2005-2008

EXPECTATION MAXIMIZATION

Why We Need EM

bull The ExpectationndashMaximization (EM) algorithm is a method for ML learning of parameters in latent variable models

bull Why we need latent variables

bull To describe complex model Gaussian Mixture Model

bull To discover the intrinsic structure inside a data set Topic Models such as pLSA LDA

More General

bull Data set bull Likelihood

bull Goal learn maximum likelihood (ML) parameter values

bull The maximum likelihood procedure finds parameters θ such that

bull Because of the integral (or sum) over latent variables the likelihood can be a very complicated and hard to optimize function

The Expectation Maximization (EM) Algorithm

bull The EM algorithm finds a (local) maximum of a latent variable model likelihood It starts from arbitrary values of the parameters and iterates two steps

bull E step Fill in values of latent variables according to posterior given data

bull M step Maximize likelihood as if latent variables were not hidden

bull Decomposes difficult problems into series of tractable steps

Jensenrsquos Inequality

Lower Bounding the Log Likelihoodbull Observed data D = yn Latent variables X = xn Parameters θbull Goal Maximize the log likelihood (ie ML learning) wrt θ

bull Any distribution q(X) over the hidden variables can be used to obtain a lower bound on the log likelihood using Jensenrsquos inequality

bull where H[q] is the entropy of q(X)

The E and M Steps of EM

bull The lower bound on the log likelihood is given by

bull EM alternates betweenbull E step optimize wrt distribution over hidden variables

holding params fixed

bull M step maximize wrt parameters holding hidden distribution fixed

The E Step

bull E step for fixed θ

bull The second term is the Kullback-Leibler divergencebull This means that for fixed θ F is bounded above by L and achieves

that bound whenbull So the E step simply sets

The M Step

bull M step maximize wrt parameters holding hidden distribution q fixed

bull The second equality comes from fact that entropy of q(X) does not depend directly on θ

bull The specific form of the M step depends on the model Often the maximum wrt θ can be found analytically

EM Never Decreases the Likelihood

bull The E and M steps together never decrease the log likelihood

bull The E step brings F(q θ) to the likelihood L(θ) 顶bull The M-step maximizes F(q θ) wrt θ 抬bull F(q θ) lt L(θ) by Jensen ndash or equivalently from the non-

negativity of KL

Reference

bull Zoubin Ghahramani Machine Learning (4F13) 2006 Cambridge(Unsupervised learning Lecture 5 Slides)

bull Christopher M Bishop (2006) Pattern Recognition and Machine Learning Springer

WHY DO WE NEED GRAPHICAL MODEL

Why Do We Need Graphical Models

bull Consndash Graphical model is so complex even with a few circleshellipndash We have to make too many assumptions

bull Prosndash We do need probability to explain our world But joint probability is

hard to computendash Graphical model can help us analyze and understand our problemsndash Graphs are an intuitive way of representing and visualizing the

relationships between many variablesndash With a graphical model we can decouple joint probability to

conditional probabilities which are usually easier

Directed Acyclic Graphical Models (Bayesian Networks)

bull A DAG Model Bayesian network corresponds to a factorization of the joint probability distribution

p(ABCDE) = p(A)p(B)p(C|AB)p(D|BC)p(E|CD)

bull In general

bull where pa(i) are the parents of node i

Directed Graphs for Statistical ModelsPlate Notation

bull A data set of N points generated from a Gaussian

PLSA ndash PROBABILISTIC LATENT SEMANTIC ANALYSIS

Latent Semantic Indexing (LSI) Review

bull For natural Language Queries simple term matching does not work effectivelyndash Ambiguous terms ndash Same Queries vary due to personal styles

bull Latent semantic indexingndash Creates this lsquolatent semantic spacersquo (hidden meaning)

bull LSI puts documents together even if they donrsquot have common words if the docs share frequently co-occurring terms

bull Disadvantagesndash Statistical foundation is missing

pLSA ndash Probabilistic Latent Semantic Analysis

bull Automated Document Indexing and Information retrievalbull Identification of Latent Classes using an Expectation

Maximization (EM) Algorithmbull Shown to solve

ndash Polysemybull Java could mean ldquocoffeerdquo and also the ldquoPL Javardquobull Cricket is a ldquogamerdquo and also an ldquoinsectrdquo

ndash Synonymybull ldquocomputerrdquo ldquopcrdquo ldquodesktoprdquo all could mean the same

bull Has a better statistical foundation than LSA

pLSA

M

Nd

d

z

w

M

d

z1

w1

z2

w2

z3

w3

zN

wN

hellip

z1 hellip zN are variables ziє[1K]K is the number of latent topics

pLSA

d1

z1

w1

z2

w2

zN1

wN1

hellip

d2

z1

w1

z2

w2

zN2

wN2

hellip

dM

z1

w1

z2

w2

zNm

wNm

hellip

p(w|z=1) p(w|z=2) p(w | z=Nm) are shared for all documents

Likelihood

Joint Probability vs Likelihood

bull Joint probability

bull Likelihood (only for observed variables)

bull p(d) is assumed to be uniform

Document Decomposition

bull Each document can be decomposed as

bull This is similar to the matrix decomposition if we consider each discrete distribution as a vector

p(w|d) = ZVtimesk p(z|d)

bull With many documents we hope to find latent topics as common basis

pLSA ndash Objective Function

bull pLSA tries to maximize the log likelihood

bull Due to the summation over z inside log we have to resort to EM

EM Steps

bull E-Stepndash Expectation of the likelihood function is calculated with the current

parameter valuesbull M-Step

ndash Update the parameters with the calculated posterior probabilitiesndash Find the parameters that maximizes the likelihood function

Lower Bounding the Log Likelihood

EM Steps

bull The E-Step

bull The M-Step

Latent Subspace

pLSA vs LSAbull LSA and PLSA perform dimensionality reduction

ndash In LSA by keeping only K singular valuesndash In PLSA by having K aspects

bull Comparison to SVDndash U Matrix related to P(z|d) (doc to aspect)ndash V Matrix related to P(w|z) (aspect to term)ndash E Matrix related to P(z) (aspect strength)

pLSA vs LSAbull The main difference is the way the approximation is done

bull PLSA generates a model (aspect model) and maximizes its predictive power

bull Selecting the proper value of K is heuristic in LSA

bull Model selection in statistics can determine optimal K in PLSA

Applications

bull Text mining topic discovering

bull Scene Classification

Text Mining

Scene Classification

Classification Result

Reference

bull Thomas Hofmann Probabilistic latent semantic analysis In Proc of Uncertainty in Artificial Intelligence UAI99 Stockholm 1999

bull Bosch A Zisserman A and Munoz X Scene Classification via pLSAProceedings of the European Conference on Computer Vision (2006)

bull Sivic J Russell B C Efros A A Zisserman A and Freeman W T Discovering Object Categories in Image Collections MIT AI Lab Memo AIM-2005-005 February 2005

LDA ndash LATENT DIRICHILET ALLOCATION

Problems in pLSA

bull pLSA provides no probabilistic model at the document level Each doc has its own topic mixture proportion

bull The number of parameters in the model grows linearly with M (the number of documents in the training set)

Problems in pLSA

bull There is no constraint for distributions p(z|di)

bull Easy to lead to serious problems with over-fitting

d1

z1

w1

z2

w2

zN1

wN1

hellip

d2

z1

w1

z2

w2

zN2

wN2

hellip

dm

z1

w1

z2

w2

zNm

wNm

hellip

p(z|d1) p(z|d2) p(z|dm)

Dirichlet Distribution

bull In the LDA model the topic mixture proportions for each document are assumed to follow some distribution

bull Requirement for such a distributionndash The samples (mixture proportions) generated from it are K-tuples of non-

negative numbers that sum to one That is the samples are multinormialsndash Easy to optimize

bull Dirichlet Distribution is one of such distributions

bull The space of all of these multinomials has a nice geometric interpretation as a (k-1)-simplex

Dirichlet Distribution

bull Definition

bull The density is zero outside this open (K minus 1)-dimensional simplex

k

i ii

ki ik

i i

ki i

kk

xx

xxxxp i

1

11

1

12121

1 0 st

)Γ()(Γ)|(

bull Various parameter α

(6 2 2) (3 7 5)

(2 3 4) (6 2 6)

Example Dirichlet Distributions (K=3)

Example Dirichlet Distributions (K=3)

bull Equal αi different

α0=01 α0=1 α0=10

k

i i10

The LDA Model

z4z3z2z1

w4w3w2w1

b

z4z3z2z1

w4w3w2w1

z4z3z2z1

w4w3w2w1

bull For each documentbull Choose ~Dirichlet()bull For each of the N words wn

ndash Choose a topic znraquo Multinomial()

ndash Choose a word wn from p(wn|znb) a multinomial probability conditioned on the topic zn

The LDA Model

bull For each documentbull Choose ~Dirichlet()bull For each of the N words wn

ndash Choose a topic znraquo Multinomial()

ndash Choose a word wn from p(wn|znb) a multinomial probability conditioned on the topic zn

Joint Probability

bull Given parameter α and β

where

Likelihood

bull Joint Probability

bull Marginal distribution of a document

bull Likelihood over all the documents

Inference

bull The likelihood can be computed by summing each documentbull Jansenrsquos inequality in EM

Inference

bull In E-Step we need to compute the posterior distribution of the hidden variables

bull Unfortunately this distribution is intractable to compute in general

bull We have to resort to variational approach

Variational Inference

bull In variational inference we consider a simplified graphical model with variational parameters and minimize the KL Divergence between the variational and posterior distributions

Variantional Inference

bull The difference between the lower bound and the likelihood is the KL divergence

bull Maximizing the lower bound L(b) with respect to and is equivalent to minimizing the KL divergence

VBEM vs EM

bull Only different in the E-Step

bull In standard EM q(X) is directly set to p(X|Dθ) and let KL=0bull In VBEM it is intractable to compute p(X|Dθ) Instead it

approximates p(X|Dθ) by a variational distribution q(X) by minimizing KL(q(X) | P(X|D θ)

bull This is also equivalent to maximizing the lower bound L(θ)

Parameter Estimation

bull Given a corpus of documents we would like to find the parameters and b which maximize the likelihood of the observed data

bull Strategy (Variational EM)

bull Lower bound log p(w|b) by a function L(b)bull Repeat until convergence

ndash E Maximize L(b) with respect to the variational parameters ndash M Maximize the bound with respect to parameters and b

Parameter Estimation

bull E-Step Variational Inference ndash repeat until convergence

bull M-Step Parameter estimation

β

α can be implemented using the Newton-Raphson method

Topic Examples in a 100-topic LDA Model)bull 16000 documents from a subset of the TREC AP corpus

Classification (50-topic LDA + SVM)

bull Reuters-21578 dataset ndash contains 8000 documents and 15818 words

(a) EARN vs NOT EARN (b) GRAIN vs NOT GRAIN

Problems in LDA

bull Dirichlet Distribution is helpful to avoid over-fitting But the assumption might be too strong

z4z3z2z1

w4w3w2w1

b

z4z3z2z1

w4w3w2w1

z4z3z2z1

w4w3w2w1

A Bayesian Hierarchical Model for Learning Natural Scene Categories

bull Incorporating category information

MNd

π

z

x

θ

β

Codebookbull 174 Local Image Patches

bull DetectionEvenly Sampled GridRandom SamplingSaliency DetectorLowersquos DoG Detector

bull RepresentationNormalized 11x11 gray values128-dim SIFT

Topic Distribution in Different Categories

Topic Hierarchical Clustering

More Topic Models

bull Dynamic topic models ICML 2006bull Correlated Topic Model NIPS 2005bull Hierarchical Dirichlet Process Journal of the American

Statistical Association 2003bull Nonparametric Bayes pachinko allocation UAI 2007bull Supervised LDA NIPS 2007bull MedLDA ndash Maximum Margin Discrimant LDA ICML 2009bull hellip

Are you really into Graphical Models

bull Describing Visual Scenes using Transformed Dirichlet Processes E Sudderth A Torralba W Freeman and A Willsky NIPS Dec 2005

Reference

bull David M Blei Andrew Y Ng Michael I Jordan Latent Dirichlet Allocation Journal of Machine Learning Research (JMLR) 2003

bull Matthew J Beal Variational Algorithms for Approximate Bayesian Inference PhD Thesis University of Cambridge 1998

bull L Fei-Fei and P Perona A Bayesian Hierarchical Model for Learning Natural Scene Categories CVPR 2005

Outline

bull Matrix Decompositionndash PCA SVD NMFndash LDA ICA Sparse Coding etc

bull Graphical Modelndash Basic concepts in probabilistic machine learningndash EMndash pLSAndash LDA

bull Two Applicationsndash Document decomposition for ldquolong queryrdquo retrieval ICCV 2009ndash Modeling Threaded Discussions SIGIR 2009

Large-Scale Indexing for ldquoLong Queryrdquo Retrieval (Similarity Search)

Xiao Zhang Zhiwei Li Lei Zhang Wei-Ying Ma and Heung-Yeung Shum

ICCV 2009

The Long Query Problem

bull If a query contains 1000 keywordsndash Need to access 1000 inverted listsndash The intersection of 1000 inverted lists may be emptyndash The union of 1000 inverted list may be the whole corpus

bull Dimension reductionTerm1 Term2 Term3 Term4 hellip TermN

Img1 1 2 0 0 hellip 2

f1 f2 hellip fM

Img1 02 01 hellip 003

Topic Projection

Dim = 1 million

Dim = 200

Key Idea Dimension Reduction + Residual Error Preservation

bull p original TF-IDF vector in vocabulary spacebull X projection matrix for dimension reductionbull w low dimensional feature vectorbull residual error

p Xw

An image = 0 01 02 03 04 05 06 07 08 09 1

0

2

4

6

8

10

12

14

16

+ a few words(10 words)

Orthogonal Decomposition

p Xw 1 11 1 1

2 21 2 1 2

1 1

1

k

k

W k W

W W Wk W

p x xp x x w

p wp x x

Base vector

Low dimensional representation

Residual

An image = 0 01 02 03 04 05 06 07 08 09 1

0

2

4

6

8

10

12

14

16

+ a few words(10 words)

T Tp q p q

p q

w w

X1 X2 X3 hellip Xk

A Probabilistic Implementation

x is a switch variable It controls a word generated from

bull a topic specific distribution

bull a document specific distribution

bull a background distribution

( | )p w d 1( 0 | ) ( | ) ( | )K

kp x d p w z k p z k d

( 1| ) ( | )p x d p w d ( 2 | ) ( | )p x d p w

C Chemudugunta etal Modeling General and Specific Aspects of Documents with a Probabilistic Topic Model NIPS 2006

Search (Online)

DS1 DS2 hellip

DS1 DS2 hellip

DS1 DS2 hellip

DS1 DS2 hellip

DS1 DS2 hellip

hellip

hellip

hellip

hellip

hellip

LSH Index

Doc 300 Doc 401 hellip

A query = 0 01 02 03 04 05 06 07 08 09 1

0

2

4

6

8

10

12

14

16

+ a few words

Re-rankingDoc 401 hellip

Doc 1

Doc 2

Doc 300

Doc 401

Doc N

Doc 300

Index 10M Images 46GBSearch Speed lt 100ms

Doc Meta

Search Example

Query Image

Search Example

Query Image

Simultaneously Modeling Semantics and Structure of Threaded Discussions A Sparse

Coding Approach and Its Applications

Chen LIN Jiang-Ming YANG Rui CAI Xin-jing WANG Wei WANG Lei ZHANG

SIGIR 2009

123

Semantic amp structure

124

SemanticTopics

StructureWho reply to who

Optimize them together

Model semantic

Model structure

125

Reply reconstruction

126

DocumentSimilarity

TopicSimilarity

StructureSimilarity

BaselinesNP

Reply to Nearest PostRR

Reply to RootDS

Document SimilarityLDA

Latent Dirichlet Allocation Project documents to topic space

SWB Special Words Topic Model with Background distribution Project documents to topic and junk topic space

127

Evaluation

method Slashdot Apple

All Posts Good Posts All Posts Good Posts

NP 0021 0012 0289 0239

RR 0183 0319 0269 0474

DS 0463 0643 0409 0628

LDA 0465 0644 0410 0648

SWB 0463 0644 0410 0641

SMSS 0524 0737 0517 0772

128

Expert finding

Reply reconstruction

Network construction

Expert finding

Methods

HITS

PageRank

hellip

129

Baselines

LMFormal Models for Expert Finding in Enterprise Corpora SIGIR

06Achieves stable performance in expert finding task using a

language modelPageRank

Benchmark nodal ranking methodHITS

Find hub nodes and authority nodeEABIF

Personalized Recommendation Driven by Information Flow SIGIR rsquo06

Find most influential node130

Evaluation

131

bull Bayesian estimate

Method MRR MAP P10

LM 0821 0698 0800

EABIF(ori) 0674 0362 0243

EABIF(rec) 0742 0318 0281

PageRank(ori) 0675 0377 0263

PageRank(rec) 0743 0321 0266

HITS(ori) 0906 0832 0900

HITS(rec) 0938 0822 0906

Summary

bull Matrix and probability are fundamental mathematics in information retrieval and computer visionndash Matrix decomposition ndash a good practice to learn matrixndash Graphical model ndash a good practice to learn probability

bull Graphical model is a good tool to analyze problems

bull The essence of decomposition is to discover a set of mid-level features to describe original documentsimages

bull It is more adaptable for various applications than matrix decomposition

  • An Introduction To Matrix Decomposition and Graphical Model
  • Outline
  • What Is Matrix Decomposition
  • Why We Need Matrix Decomposition
  • Why We Need Matrix Decomposition (2)
  • Principle Component Analysis
  • Definition ndash Eigenvalue amp Eigenvector
  • Definition ndash Principle Component Analysis
  • Principle Component Analysis (2)
  • Maximizing Variance
  • Optimization Problem
  • Property Data Decomposition
  • Face Recognition ndash Eigenface
  • Slide 14
  • Slide 15
  • PageRank ndash Power Iteration
  • Column-Stochastic amp Irreducible
  • Iterative PageRank Calculation
  • Convergence of the power iteration
  • Singular value decomposition
  • SVD - Definition
  • Singular Values And Singular Vectors
  • Matrix approximation
  • SVD and PCA
  • Example - LSI
  • SVD and PCA (2)
  • Latent Semantic Indexing (LSI)
  • Latent Semantic Indexing
  • Similarity Measures
  • Similarity Measures (2)
  • HITS (Hyperlink Induced Topic Search)
  • Power Iteration
  • HITS and SVD
  • HITS vs PageRank
  • NMF ndash Non-Negative Matrix Factorization
  • Definition
  • Motivation
  • Multiplicative Update Algorithm
  • Multiplicative Update Algorithm (2)
  • NMF vs PCA
  • Reference
  • Major Reference
  • Outline (2)
  • Not Included
  • Basic Concepts
  • What Is Machine Learning
  • Likelihood Function
  • Maximum Likelihood (ML)
  • IID ndash Independent Identically Distributed
  • Reference (2)
  • Expectation Maximization
  • Why We Need EM
  • More General
  • The Expectation Maximization (EM) Algorithm
  • Jensenrsquos Inequality
  • Lower Bounding the Log Likelihood
  • The E and M Steps of EM
  • The E Step
  • The M Step
  • EM Never Decreases the Likelihood
  • Reference (3)
  • Why Do We Need Graphical Model
  • Why Do We Need Graphical Models
  • Directed Acyclic Graphical Models (Bayesian Networks)
  • Directed Graphs for Statistical Models Plate Notation
  • pLSA ndash Probabilistic Latent Semantic Analysis
  • Latent Semantic Indexing (LSI) Review
  • pLSA ndash Probabilistic Latent Semantic Analysis (2)
  • pLSA
  • pLSA (2)
  • Joint Probability vs Likelihood
  • Document Decomposition
  • pLSA ndash Objective Function
  • EM Steps
  • Lower Bounding the Log Likelihood (2)
  • EM Steps (2)
  • Latent Subspace
  • pLSA vs LSA
  • pLSA vs LSA (2)
  • Applications
  • Text Mining
  • Scene Classification
  • Classification Result
  • Reference (4)
  • LDA ndash Latent Dirichilet Allocation
  • Problems in pLSA
  • Problems in pLSA (2)
  • Dirichlet Distribution
  • Dirichlet Distribution (2)
  • Example Dirichlet Distributions (K=3)
  • Example Dirichlet Distributions (K=3) (2)
  • The LDA Model
  • The LDA Model (2)
  • Joint Probability
  • Likelihood
  • Inference
  • Inference (2)
  • Variational Inference
  • Variantional Inference
  • VBEM vs EM
  • Parameter Estimation
  • Parameter Estimation (2)
  • Topic Examples in a 100-topic LDA Model)
  • Classification (50-topic LDA + SVM)
  • Problems in LDA
  • A Bayesian Hierarchical Model for Learning Natural Scene Catego
  • Codebook
  • Topic Distribution in Different Categories
  • Topic Hierarchical Clustering
  • More Topic Models
  • Are you really into Graphical Models
  • Reference (5)
  • Outline (3)
  • Slide 115
  • The Long Query Problem
  • Key Idea Dimension Reduction + Residual Error Preservation
  • Orthogonal Decomposition
  • A Probabilistic Implementation
  • Search (Online)
  • Search Example
  • Search Example (2)
  • Simultaneously Modeling Semantics and Structure of Threaded Dis
  • Semantic amp structure
  • Optimize them together
  • Reply reconstruction
  • Baselines
  • Evaluation
  • Expert finding
  • Baselines (2)
  • Evaluation (2)
  • Summary
Page 8: An Introduction To Matrix Decomposition and Graphical Model

Definition ndash Principle Component Analysis

ndash Principle Component Analysis (PCA)ndash Karhunen-Loeve transformation (KL transformation)

bull Let A be a n times m data matrix in which the rows represent data samples

bull Each row is a data vector each column represents a variablebull A is centered the estimated mean is subtracted from each

column so each column has zero meanbull Covariance matrix C (m x m)

Principle Component Analysisbull C can be decomposed as follows C=UΛUT

bull Λ is a diagonal matrix diag(λ1 λ2hellipλn) each λi is an eigenvaluebull U is an orthogonal matrix each column is an eigenvector

UTU=I U-1=UT

Maximizing Variancebull The objective of the rotation transformation is to find the

maximal variancebull Projection of data along w is Awbull Variance σ2

w= (Aw)T(Aw) = wTATAw = wTCwwhere C = ATA is the covariance matrix of the data (A is centered)

bull Task maximize variance subject to constraint wTw=1

Optimization Problembull Maximize

λ is the Lagrange multiplierbull Differentiating with respect to w yields

bull Eigenvalue equation Cw = λw where C = ATA bull Once the first principal component is found we continue in the same

fashion to look for the next one which is orthogonal to (all) the principal component(s) already found

Property Data Decompositionbull PCA can be treated as data decomposition

a=UUTa=(u1u2hellipun) (u1u2hellipun)T a=(u1u2hellipun) (ltu1agtltu2agthellipltunagt)T

=(u1u2hellipun) (b1 b2 hellip bn)T

= Σ biui

Face Recognition ndash Eigenfacebull Turk MA Pentland AP Face recognition using eigenfaces

CVPR 1991 (Citation 2654)bull The eigenface approach

ndash images are points in a vector spacendash use PCA to reduce dimensionalityndash face spacendash compare projections onto face space to recognize faces

PageRank ndash Power Iteration

bull Column j has nonzero elements in positions corresponding to outlinks of j (Nj in total)

bull Row i has nonzero element in positions corresponding to inlinks Ii

Column-Stochastic amp Irreduciblebull Column-Stochastic

bull where

bull Irreducible

Iterative PageRank Calculationbull For k=12hellip

bull Equivalently (λ=1 A is a Markov chain transition matrix)

bull Why can we use power iteration to find the first eigenvector

Convergence of the power iteration

bull Expand the initial approximation r0 in terms of the eigenvectors

SINGULAR VALUE DECOMPOSITION

SVD - Definitionbull Any m x n matrix A with m ge n can be factorized

bull

Singular Values And Singular Vectors

bull The diagonal elements σj of are the singular values of the matrix A

bull The columns of U and V are the left singular vectors and right singular vectors respectively

bull Equivalent form of SVD

Matrix approximation

bull Theorem Let Uk = (u1 u2 hellip uk) Vk = (v1 v2 hellip vk) and Σk = diag(σ1 σ2 hellip σk) and define

bull Then

bull It means that the best approximation of rank k for the matrix A is

SVD and PCAbull We can writebull Remember that in PCA we treat A as a row matrixbull V is just eigenvectors for A

ndash Each column in V is an eigenvector of row matrix Andash we use V to approximate a row in A

bull Equivalently we can writebull U is just eigenvectors for AT

ndash Each column in U is an eigenvector of column matrix Andash We use U to approximate a column in A

Example - LSI bull Build a term-by-document matrix A

bull Compute the SVD of A A = UΣVT

bull Approximate A by

ndash Uk Orthogonal basis that we use to approximate all the documents

ndash Dk Column j hold the coordinates of document j in the new basisndash Dk is the projection of A onto the subspace spanned by Uk

SVD and PCAbull For symmetric A SVD is closely related to PCA

bull PCA A = UΛUT

ndash U and Λ are eigenvectors and eigenvaluesbull SVD A = UΛVT

ndash U is left(column) eigenvectorsndash V is right(row) eigenvectorsndash Λ is the same eigenvalues

bull For symmetric A column eigenvectors equal to row eigenvectors

bull Note the difference of A in PCA and SVDndash SVD A is directly the data eg term-by-document matrixndash PCA A is covariance matrix A=XTX each row in X is a sample

Latent Semantic Indexing (LSI)1 Document file preparation preprocessing

ndash Indexing collecting termsndash Use stop list eliminate rdquomeaninglessrdquo wordsndash Stemming

2 Construction term-by-document matrix sparse matrix storage

3 Query matching distance measures4 Data compression by low rank approximation SVD5 Ranking and relevance feedback

Latent Semantic Indexingbull Assumption there is some underlying latent semantic

structure in the data

bull Eg car and automobile occur in similar documents as do cows and sheep

bull This structure can be enhanced by projecting the data (the term-by-document matrix and the queries) onto a lower dimensional space using SVD

Similarity Measures

bull Term to term AAT = UΣ2UT = (UΣ)(UΣ)T

UΣ are the coordinates of A (rows) projected to space V

bull Document to documentATA = VΣ2VT = (VΣ)(VΣ)T

VΣ are the coordinates of A (columns) projected to space U

Similarity Measures

bull Term to document A = UΣVT = (UΣfrac12)(VΣfrac12)T

UΣfrac12 are the coordinates of A (rows) projected to space VVΣfrac12 are the coordinates of A (columns) projected to space U

HITS (Hyperlink Induced Topic Search)

bull Idea Web includes two flavors of prominent pagesndash authorities contain high-quality informationndash hubs are comprehensive lists of links to authorities

bull A page is a good authority if many hubs point to itbull A page is a good hub if it points to many authoritiesbull Good authorities are pointed to by good hubs and good hubs

point to good authorities

Hubs Authorities

Power Iteration

bull Each page i has both a hub score hi and an authority score aibull HITS successively refines these scores by computing

bull Define the adjacency matrix L of the directed web graph

bull Now

HITS and SVD

bull L rows are outlinks columns are inlinks

bull a will be the dominant eigenvector of the authority matrix LTL bull h will be the dominant eigenvector of the hub matrix LLT

bull They are in fact the first left and right singular vectors of Lbull We are in fact running SVD on the adjacency matrix

HITS vs PageRankbull PageRank may be computed once HITS is computed per

query

bull HITS takes query into account PageRank doesnrsquot

bull PageRank has no concept of hubs

bull HITS is sensitive to local topology insertion or deletion of a small number of nodes may change the scores a lot

bull PageRank more stable because of its random jump step

NMF ndash NON-NEGATIVE MATRIX FACTORIZATION

Definition

bull Given a nonnegative matrix Vntimesm find non-negative matrix factors Wntimesk and Hktimesm such that

VntimesmasympWntimeskHktimesm

bull V column matrix each column is a data sample (n-dimension)bull Wi k-basis represents one basebull H coordinates of V projected to W

vj asymp Wntimeskhj

Motivationbull Non-negativity is natural in many applications

bull Probability is also non-negative

bull Additive model to capture local structure

Multiplicative Update Algorithmbull Cost function Euclidean distance

bull Multiplicative Update

Multiplicative Update Algorithmbull Cost function Divergence

ndash Reduce to Kullback-Leibler divergence whenndash A and B can be regarded as normalized probability distributions

bull Multiplicative update

bull PLSA is NMF with KL divergence

NMF vs PCA

bull n = 2429 faces m = 19x19 pixelsbull Positive values are illustrated with black pixels and negative values

with red pixels

bull NMF Parts-based representationbull PCA Holistic representations

Referencebull D D Lee and H S Seung Algorithms for non-negative matrix

factorization (pdf) NIPS 2001bull D D Lee and H S Seung Learning the parts of objects by non-negative

matrix factorization (pdf) Nature 401 788-791 (1999)

Major Referencebull Saara Hyvoumlnen Linear Algebra Methods for Data Mining Spring 2007

University of Helsinki (Highly recommend)

Outlinebull Basic concepts

ndash Likelihood iidndash ML MAP and Bayesian Inferencendash Expectation-Maximizationndash Mixture Gaussian Parameter estimation

bull pLSAndash Motivationndash Derivation amp Geometry propertiesndash Applications

bull LDAndash Motivation - Why to add a hyper parameterndash Dirichlet Distributionndash Variational EMndash Relations with other topic modalsndash Incorporating category information

bull Summary

Not Included

bull General graphical model theoriesbull Markov random field (belief propagation)bull Detailed derivation of LDA

BASIC CONCEPTS

What Is Machine Learning

Databull Let x = (x1 x2 xD) T denote a data point and D = x(1) x(2) x(N) a

data set D is sometimes associated with desired outputs y1 y2

Predictionsbull We are generally interested in predicting something based on the observed

data setbull Given D what can we say about x(N+1)

Modelbull To make predictions we need to make some assumptions We can often

express these assumptions in the form of a model with some parameters θbull Given data D we learn the model parameters from which we can predict

new data pointsbull The model can often be expressed as a probability distribution over data

points

Likelihood Function

bull Given a set of parameter values probability density function (PDF) will show that some data are more probable than other data

bull Inversely given the observed data and a model of interest Likelihood function is defined as

L(θ) = fθ(x|θ) = p(x|θ)

bull That is likelihood function L(θ) will show that some parameters are more likely to have produced the data

Maximum Likelihood (ML)

bull Maximum likelihood will find the best model parameters that make the data ldquomost likelyrdquo generated from this model

bull Suppose we are given n data samples (x1 x2 hellip xn)

bull Maximum likelihood will find θ that maximize L(θ)

bull Predictive distribution

IID ndash Independent Identically Distributed

bull IID means

bull The problem is considerably simplified as

bull Usually log likehood is used

Reference

bull Zoubin Ghahramani Machine Learning (4F13) 2006 Cambridge(Introduction to Machine Learning Lectures 1-2 Slides)

bull Gregor Heinrich Parameter estimation for text analysis Technical Note 2005-2008

EXPECTATION MAXIMIZATION

Why We Need EM

bull The ExpectationndashMaximization (EM) algorithm is a method for ML learning of parameters in latent variable models

bull Why we need latent variables

bull To describe complex model Gaussian Mixture Model

bull To discover the intrinsic structure inside a data set Topic Models such as pLSA LDA

More General

bull Data set bull Likelihood

bull Goal learn maximum likelihood (ML) parameter values

bull The maximum likelihood procedure finds parameters θ such that

bull Because of the integral (or sum) over latent variables the likelihood can be a very complicated and hard to optimize function

The Expectation Maximization (EM) Algorithm

bull The EM algorithm finds a (local) maximum of a latent variable model likelihood It starts from arbitrary values of the parameters and iterates two steps

bull E step Fill in values of latent variables according to posterior given data

bull M step Maximize likelihood as if latent variables were not hidden

bull Decomposes difficult problems into series of tractable steps

Jensenrsquos Inequality

Lower Bounding the Log Likelihoodbull Observed data D = yn Latent variables X = xn Parameters θbull Goal Maximize the log likelihood (ie ML learning) wrt θ

bull Any distribution q(X) over the hidden variables can be used to obtain a lower bound on the log likelihood using Jensenrsquos inequality

bull where H[q] is the entropy of q(X)

The E and M Steps of EM

bull The lower bound on the log likelihood is given by

bull EM alternates betweenbull E step optimize wrt distribution over hidden variables

holding params fixed

bull M step maximize wrt parameters holding hidden distribution fixed

The E Step

bull E step for fixed θ

bull The second term is the Kullback-Leibler divergencebull This means that for fixed θ F is bounded above by L and achieves

that bound whenbull So the E step simply sets

The M Step

bull M step maximize wrt parameters holding hidden distribution q fixed

bull The second equality comes from fact that entropy of q(X) does not depend directly on θ

bull The specific form of the M step depends on the model Often the maximum wrt θ can be found analytically

EM Never Decreases the Likelihood

bull The E and M steps together never decrease the log likelihood

bull The E step brings F(q θ) to the likelihood L(θ) 顶bull The M-step maximizes F(q θ) wrt θ 抬bull F(q θ) lt L(θ) by Jensen ndash or equivalently from the non-

negativity of KL

Reference

bull Zoubin Ghahramani Machine Learning (4F13) 2006 Cambridge(Unsupervised learning Lecture 5 Slides)

bull Christopher M Bishop (2006) Pattern Recognition and Machine Learning Springer

WHY DO WE NEED GRAPHICAL MODEL

Why Do We Need Graphical Models

bull Consndash Graphical model is so complex even with a few circleshellipndash We have to make too many assumptions

bull Prosndash We do need probability to explain our world But joint probability is

hard to computendash Graphical model can help us analyze and understand our problemsndash Graphs are an intuitive way of representing and visualizing the

relationships between many variablesndash With a graphical model we can decouple joint probability to

conditional probabilities which are usually easier

Directed Acyclic Graphical Models (Bayesian Networks)

bull A DAG Model Bayesian network corresponds to a factorization of the joint probability distribution

p(ABCDE) = p(A)p(B)p(C|AB)p(D|BC)p(E|CD)

bull In general

bull where pa(i) are the parents of node i

Directed Graphs for Statistical ModelsPlate Notation

bull A data set of N points generated from a Gaussian

PLSA ndash PROBABILISTIC LATENT SEMANTIC ANALYSIS

Latent Semantic Indexing (LSI) Review

bull For natural Language Queries simple term matching does not work effectivelyndash Ambiguous terms ndash Same Queries vary due to personal styles

bull Latent semantic indexingndash Creates this lsquolatent semantic spacersquo (hidden meaning)

bull LSI puts documents together even if they donrsquot have common words if the docs share frequently co-occurring terms

bull Disadvantagesndash Statistical foundation is missing

pLSA ndash Probabilistic Latent Semantic Analysis

bull Automated Document Indexing and Information retrievalbull Identification of Latent Classes using an Expectation

Maximization (EM) Algorithmbull Shown to solve

ndash Polysemybull Java could mean ldquocoffeerdquo and also the ldquoPL Javardquobull Cricket is a ldquogamerdquo and also an ldquoinsectrdquo

ndash Synonymybull ldquocomputerrdquo ldquopcrdquo ldquodesktoprdquo all could mean the same

bull Has a better statistical foundation than LSA

pLSA

M

Nd

d

z

w

M

d

z1

w1

z2

w2

z3

w3

zN

wN

hellip

z1 hellip zN are variables ziє[1K]K is the number of latent topics

pLSA

d1

z1

w1

z2

w2

zN1

wN1

hellip

d2

z1

w1

z2

w2

zN2

wN2

hellip

dM

z1

w1

z2

w2

zNm

wNm

hellip

p(w|z=1) p(w|z=2) p(w | z=Nm) are shared for all documents

Likelihood

Joint Probability vs Likelihood

bull Joint probability

bull Likelihood (only for observed variables)

bull p(d) is assumed to be uniform

Document Decomposition

bull Each document can be decomposed as

bull This is similar to the matrix decomposition if we consider each discrete distribution as a vector

p(w|d) = ZVtimesk p(z|d)

bull With many documents we hope to find latent topics as common basis

pLSA ndash Objective Function

bull pLSA tries to maximize the log likelihood

bull Due to the summation over z inside log we have to resort to EM

EM Steps

bull E-Stepndash Expectation of the likelihood function is calculated with the current

parameter valuesbull M-Step

ndash Update the parameters with the calculated posterior probabilitiesndash Find the parameters that maximizes the likelihood function

Lower Bounding the Log Likelihood

EM Steps

bull The E-Step

bull The M-Step

Latent Subspace

pLSA vs LSAbull LSA and PLSA perform dimensionality reduction

ndash In LSA by keeping only K singular valuesndash In PLSA by having K aspects

bull Comparison to SVDndash U Matrix related to P(z|d) (doc to aspect)ndash V Matrix related to P(w|z) (aspect to term)ndash E Matrix related to P(z) (aspect strength)

pLSA vs LSAbull The main difference is the way the approximation is done

bull PLSA generates a model (aspect model) and maximizes its predictive power

bull Selecting the proper value of K is heuristic in LSA

bull Model selection in statistics can determine optimal K in PLSA

Applications

bull Text mining topic discovering

bull Scene Classification

Text Mining

Scene Classification

Classification Result

Reference

bull Thomas Hofmann Probabilistic latent semantic analysis In Proc of Uncertainty in Artificial Intelligence UAI99 Stockholm 1999

bull Bosch A Zisserman A and Munoz X Scene Classification via pLSAProceedings of the European Conference on Computer Vision (2006)

bull Sivic J Russell B C Efros A A Zisserman A and Freeman W T Discovering Object Categories in Image Collections MIT AI Lab Memo AIM-2005-005 February 2005

LDA ndash LATENT DIRICHILET ALLOCATION

Problems in pLSA

bull pLSA provides no probabilistic model at the document level Each doc has its own topic mixture proportion

bull The number of parameters in the model grows linearly with M (the number of documents in the training set)

Problems in pLSA

bull There is no constraint for distributions p(z|di)

bull Easy to lead to serious problems with over-fitting

d1

z1

w1

z2

w2

zN1

wN1

hellip

d2

z1

w1

z2

w2

zN2

wN2

hellip

dm

z1

w1

z2

w2

zNm

wNm

hellip

p(z|d1) p(z|d2) p(z|dm)

Dirichlet Distribution

bull In the LDA model the topic mixture proportions for each document are assumed to follow some distribution

bull Requirement for such a distributionndash The samples (mixture proportions) generated from it are K-tuples of non-

negative numbers that sum to one That is the samples are multinormialsndash Easy to optimize

bull Dirichlet Distribution is one of such distributions

bull The space of all of these multinomials has a nice geometric interpretation as a (k-1)-simplex

Dirichlet Distribution

bull Definition

bull The density is zero outside this open (K minus 1)-dimensional simplex

k

i ii

ki ik

i i

ki i

kk

xx

xxxxp i

1

11

1

12121

1 0 st

)Γ()(Γ)|(

bull Various parameter α

(6 2 2) (3 7 5)

(2 3 4) (6 2 6)

Example Dirichlet Distributions (K=3)

Example Dirichlet Distributions (K=3)

bull Equal αi different

α0=01 α0=1 α0=10

k

i i10

The LDA Model

z4z3z2z1

w4w3w2w1

b

z4z3z2z1

w4w3w2w1

z4z3z2z1

w4w3w2w1

bull For each documentbull Choose ~Dirichlet()bull For each of the N words wn

ndash Choose a topic znraquo Multinomial()

ndash Choose a word wn from p(wn|znb) a multinomial probability conditioned on the topic zn

The LDA Model

bull For each documentbull Choose ~Dirichlet()bull For each of the N words wn

ndash Choose a topic znraquo Multinomial()

ndash Choose a word wn from p(wn|znb) a multinomial probability conditioned on the topic zn

Joint Probability

bull Given parameter α and β

where

Likelihood

bull Joint Probability

bull Marginal distribution of a document

bull Likelihood over all the documents

Inference

bull The likelihood can be computed by summing each documentbull Jansenrsquos inequality in EM

Inference

bull In E-Step we need to compute the posterior distribution of the hidden variables

bull Unfortunately this distribution is intractable to compute in general

bull We have to resort to variational approach

Variational Inference

bull In variational inference we consider a simplified graphical model with variational parameters and minimize the KL Divergence between the variational and posterior distributions

Variantional Inference

bull The difference between the lower bound and the likelihood is the KL divergence

bull Maximizing the lower bound L(b) with respect to and is equivalent to minimizing the KL divergence

VBEM vs EM

bull Only different in the E-Step

bull In standard EM q(X) is directly set to p(X|Dθ) and let KL=0bull In VBEM it is intractable to compute p(X|Dθ) Instead it

approximates p(X|Dθ) by a variational distribution q(X) by minimizing KL(q(X) | P(X|D θ)

bull This is also equivalent to maximizing the lower bound L(θ)

Parameter Estimation

bull Given a corpus of documents we would like to find the parameters and b which maximize the likelihood of the observed data

bull Strategy (Variational EM)

bull Lower bound log p(w|b) by a function L(b)bull Repeat until convergence

ndash E Maximize L(b) with respect to the variational parameters ndash M Maximize the bound with respect to parameters and b

Parameter Estimation

bull E-Step Variational Inference ndash repeat until convergence

bull M-Step Parameter estimation

β

α can be implemented using the Newton-Raphson method

Topic Examples in a 100-topic LDA Model)bull 16000 documents from a subset of the TREC AP corpus

Classification (50-topic LDA + SVM)

bull Reuters-21578 dataset ndash contains 8000 documents and 15818 words

(a) EARN vs NOT EARN (b) GRAIN vs NOT GRAIN

Problems in LDA

bull Dirichlet Distribution is helpful to avoid over-fitting But the assumption might be too strong

z4z3z2z1

w4w3w2w1

b

z4z3z2z1

w4w3w2w1

z4z3z2z1

w4w3w2w1

A Bayesian Hierarchical Model for Learning Natural Scene Categories

bull Incorporating category information

MNd

π

z

x

θ

β

Codebookbull 174 Local Image Patches

bull DetectionEvenly Sampled GridRandom SamplingSaliency DetectorLowersquos DoG Detector

bull RepresentationNormalized 11x11 gray values128-dim SIFT

Topic Distribution in Different Categories

Topic Hierarchical Clustering

More Topic Models

bull Dynamic topic models ICML 2006bull Correlated Topic Model NIPS 2005bull Hierarchical Dirichlet Process Journal of the American

Statistical Association 2003bull Nonparametric Bayes pachinko allocation UAI 2007bull Supervised LDA NIPS 2007bull MedLDA ndash Maximum Margin Discrimant LDA ICML 2009bull hellip

Are you really into Graphical Models

bull Describing Visual Scenes using Transformed Dirichlet Processes E Sudderth A Torralba W Freeman and A Willsky NIPS Dec 2005

Reference

bull David M Blei Andrew Y Ng Michael I Jordan Latent Dirichlet Allocation Journal of Machine Learning Research (JMLR) 2003

bull Matthew J Beal Variational Algorithms for Approximate Bayesian Inference PhD Thesis University of Cambridge 1998

bull L Fei-Fei and P Perona A Bayesian Hierarchical Model for Learning Natural Scene Categories CVPR 2005

Outline

bull Matrix Decompositionndash PCA SVD NMFndash LDA ICA Sparse Coding etc

bull Graphical Modelndash Basic concepts in probabilistic machine learningndash EMndash pLSAndash LDA

bull Two Applicationsndash Document decomposition for ldquolong queryrdquo retrieval ICCV 2009ndash Modeling Threaded Discussions SIGIR 2009

Large-Scale Indexing for ldquoLong Queryrdquo Retrieval (Similarity Search)

Xiao Zhang Zhiwei Li Lei Zhang Wei-Ying Ma and Heung-Yeung Shum

ICCV 2009

The Long Query Problem

bull If a query contains 1000 keywordsndash Need to access 1000 inverted listsndash The intersection of 1000 inverted lists may be emptyndash The union of 1000 inverted list may be the whole corpus

bull Dimension reductionTerm1 Term2 Term3 Term4 hellip TermN

Img1 1 2 0 0 hellip 2

f1 f2 hellip fM

Img1 02 01 hellip 003

Topic Projection

Dim = 1 million

Dim = 200

Key Idea Dimension Reduction + Residual Error Preservation

bull p original TF-IDF vector in vocabulary spacebull X projection matrix for dimension reductionbull w low dimensional feature vectorbull residual error

p Xw

An image = 0 01 02 03 04 05 06 07 08 09 1

0

2

4

6

8

10

12

14

16

+ a few words(10 words)

Orthogonal Decomposition

p Xw 1 11 1 1

2 21 2 1 2

1 1

1

k

k

W k W

W W Wk W

p x xp x x w

p wp x x

Base vector

Low dimensional representation

Residual

An image = 0 01 02 03 04 05 06 07 08 09 1

0

2

4

6

8

10

12

14

16

+ a few words(10 words)

T Tp q p q

p q

w w

X1 X2 X3 hellip Xk

A Probabilistic Implementation

x is a switch variable It controls a word generated from

bull a topic specific distribution

bull a document specific distribution

bull a background distribution

( | )p w d 1( 0 | ) ( | ) ( | )K

kp x d p w z k p z k d

( 1| ) ( | )p x d p w d ( 2 | ) ( | )p x d p w

C Chemudugunta etal Modeling General and Specific Aspects of Documents with a Probabilistic Topic Model NIPS 2006

Search (Online)

DS1 DS2 hellip

DS1 DS2 hellip

DS1 DS2 hellip

DS1 DS2 hellip

DS1 DS2 hellip

hellip

hellip

hellip

hellip

hellip

LSH Index

Doc 300 Doc 401 hellip

A query = 0 01 02 03 04 05 06 07 08 09 1

0

2

4

6

8

10

12

14

16

+ a few words

Re-rankingDoc 401 hellip

Doc 1

Doc 2

Doc 300

Doc 401

Doc N

Doc 300

Index 10M Images 46GBSearch Speed lt 100ms

Doc Meta

Search Example

Query Image

Search Example

Query Image

Simultaneously Modeling Semantics and Structure of Threaded Discussions A Sparse

Coding Approach and Its Applications

Chen LIN Jiang-Ming YANG Rui CAI Xin-jing WANG Wei WANG Lei ZHANG

SIGIR 2009

123

Semantic amp structure

124

SemanticTopics

StructureWho reply to who

Optimize them together

Model semantic

Model structure

125

Reply reconstruction

126

DocumentSimilarity

TopicSimilarity

StructureSimilarity

BaselinesNP

Reply to Nearest PostRR

Reply to RootDS

Document SimilarityLDA

Latent Dirichlet Allocation Project documents to topic space

SWB Special Words Topic Model with Background distribution Project documents to topic and junk topic space

127

Evaluation

method Slashdot Apple

All Posts Good Posts All Posts Good Posts

NP 0021 0012 0289 0239

RR 0183 0319 0269 0474

DS 0463 0643 0409 0628

LDA 0465 0644 0410 0648

SWB 0463 0644 0410 0641

SMSS 0524 0737 0517 0772

128

Expert finding

Reply reconstruction

Network construction

Expert finding

Methods

HITS

PageRank

hellip

129

Baselines

LMFormal Models for Expert Finding in Enterprise Corpora SIGIR

06Achieves stable performance in expert finding task using a

language modelPageRank

Benchmark nodal ranking methodHITS

Find hub nodes and authority nodeEABIF

Personalized Recommendation Driven by Information Flow SIGIR rsquo06

Find most influential node130

Evaluation

131

bull Bayesian estimate

Method MRR MAP P10

LM 0821 0698 0800

EABIF(ori) 0674 0362 0243

EABIF(rec) 0742 0318 0281

PageRank(ori) 0675 0377 0263

PageRank(rec) 0743 0321 0266

HITS(ori) 0906 0832 0900

HITS(rec) 0938 0822 0906

Summary

bull Matrix and probability are fundamental mathematics in information retrieval and computer visionndash Matrix decomposition ndash a good practice to learn matrixndash Graphical model ndash a good practice to learn probability

bull Graphical model is a good tool to analyze problems

bull The essence of decomposition is to discover a set of mid-level features to describe original documentsimages

bull It is more adaptable for various applications than matrix decomposition

  • An Introduction To Matrix Decomposition and Graphical Model
  • Outline
  • What Is Matrix Decomposition
  • Why We Need Matrix Decomposition
  • Why We Need Matrix Decomposition (2)
  • Principle Component Analysis
  • Definition ndash Eigenvalue amp Eigenvector
  • Definition ndash Principle Component Analysis
  • Principle Component Analysis (2)
  • Maximizing Variance
  • Optimization Problem
  • Property Data Decomposition
  • Face Recognition ndash Eigenface
  • Slide 14
  • Slide 15
  • PageRank ndash Power Iteration
  • Column-Stochastic amp Irreducible
  • Iterative PageRank Calculation
  • Convergence of the power iteration
  • Singular value decomposition
  • SVD - Definition
  • Singular Values And Singular Vectors
  • Matrix approximation
  • SVD and PCA
  • Example - LSI
  • SVD and PCA (2)
  • Latent Semantic Indexing (LSI)
  • Latent Semantic Indexing
  • Similarity Measures
  • Similarity Measures (2)
  • HITS (Hyperlink Induced Topic Search)
  • Power Iteration
  • HITS and SVD
  • HITS vs PageRank
  • NMF ndash Non-Negative Matrix Factorization
  • Definition
  • Motivation
  • Multiplicative Update Algorithm
  • Multiplicative Update Algorithm (2)
  • NMF vs PCA
  • Reference
  • Major Reference
  • Outline (2)
  • Not Included
  • Basic Concepts
  • What Is Machine Learning
  • Likelihood Function
  • Maximum Likelihood (ML)
  • IID ndash Independent Identically Distributed
  • Reference (2)
  • Expectation Maximization
  • Why We Need EM
  • More General
  • The Expectation Maximization (EM) Algorithm
  • Jensenrsquos Inequality
  • Lower Bounding the Log Likelihood
  • The E and M Steps of EM
  • The E Step
  • The M Step
  • EM Never Decreases the Likelihood
  • Reference (3)
  • Why Do We Need Graphical Model
  • Why Do We Need Graphical Models
  • Directed Acyclic Graphical Models (Bayesian Networks)
  • Directed Graphs for Statistical Models Plate Notation
  • pLSA ndash Probabilistic Latent Semantic Analysis
  • Latent Semantic Indexing (LSI) Review
  • pLSA ndash Probabilistic Latent Semantic Analysis (2)
  • pLSA
  • pLSA (2)
  • Joint Probability vs Likelihood
  • Document Decomposition
  • pLSA ndash Objective Function
  • EM Steps
  • Lower Bounding the Log Likelihood (2)
  • EM Steps (2)
  • Latent Subspace
  • pLSA vs LSA
  • pLSA vs LSA (2)
  • Applications
  • Text Mining
  • Scene Classification
  • Classification Result
  • Reference (4)
  • LDA ndash Latent Dirichilet Allocation
  • Problems in pLSA
  • Problems in pLSA (2)
  • Dirichlet Distribution
  • Dirichlet Distribution (2)
  • Example Dirichlet Distributions (K=3)
  • Example Dirichlet Distributions (K=3) (2)
  • The LDA Model
  • The LDA Model (2)
  • Joint Probability
  • Likelihood
  • Inference
  • Inference (2)
  • Variational Inference
  • Variantional Inference
  • VBEM vs EM
  • Parameter Estimation
  • Parameter Estimation (2)
  • Topic Examples in a 100-topic LDA Model)
  • Classification (50-topic LDA + SVM)
  • Problems in LDA
  • A Bayesian Hierarchical Model for Learning Natural Scene Catego
  • Codebook
  • Topic Distribution in Different Categories
  • Topic Hierarchical Clustering
  • More Topic Models
  • Are you really into Graphical Models
  • Reference (5)
  • Outline (3)
  • Slide 115
  • The Long Query Problem
  • Key Idea Dimension Reduction + Residual Error Preservation
  • Orthogonal Decomposition
  • A Probabilistic Implementation
  • Search (Online)
  • Search Example
  • Search Example (2)
  • Simultaneously Modeling Semantics and Structure of Threaded Dis
  • Semantic amp structure
  • Optimize them together
  • Reply reconstruction
  • Baselines
  • Evaluation
  • Expert finding
  • Baselines (2)
  • Evaluation (2)
  • Summary
Page 9: An Introduction To Matrix Decomposition and Graphical Model

Principle Component Analysisbull C can be decomposed as follows C=UΛUT

bull Λ is a diagonal matrix diag(λ1 λ2hellipλn) each λi is an eigenvaluebull U is an orthogonal matrix each column is an eigenvector

UTU=I U-1=UT

Maximizing Variancebull The objective of the rotation transformation is to find the

maximal variancebull Projection of data along w is Awbull Variance σ2

w= (Aw)T(Aw) = wTATAw = wTCwwhere C = ATA is the covariance matrix of the data (A is centered)

bull Task maximize variance subject to constraint wTw=1

Optimization Problembull Maximize

λ is the Lagrange multiplierbull Differentiating with respect to w yields

bull Eigenvalue equation Cw = λw where C = ATA bull Once the first principal component is found we continue in the same

fashion to look for the next one which is orthogonal to (all) the principal component(s) already found

Property Data Decompositionbull PCA can be treated as data decomposition

a=UUTa=(u1u2hellipun) (u1u2hellipun)T a=(u1u2hellipun) (ltu1agtltu2agthellipltunagt)T

=(u1u2hellipun) (b1 b2 hellip bn)T

= Σ biui

Face Recognition ndash Eigenfacebull Turk MA Pentland AP Face recognition using eigenfaces

CVPR 1991 (Citation 2654)bull The eigenface approach

ndash images are points in a vector spacendash use PCA to reduce dimensionalityndash face spacendash compare projections onto face space to recognize faces

PageRank ndash Power Iteration

bull Column j has nonzero elements in positions corresponding to outlinks of j (Nj in total)

bull Row i has nonzero element in positions corresponding to inlinks Ii

Column-Stochastic amp Irreduciblebull Column-Stochastic

bull where

bull Irreducible

Iterative PageRank Calculationbull For k=12hellip

bull Equivalently (λ=1 A is a Markov chain transition matrix)

bull Why can we use power iteration to find the first eigenvector

Convergence of the power iteration

bull Expand the initial approximation r0 in terms of the eigenvectors

SINGULAR VALUE DECOMPOSITION

SVD - Definitionbull Any m x n matrix A with m ge n can be factorized

bull

Singular Values And Singular Vectors

bull The diagonal elements σj of are the singular values of the matrix A

bull The columns of U and V are the left singular vectors and right singular vectors respectively

bull Equivalent form of SVD

Matrix approximation

bull Theorem Let Uk = (u1 u2 hellip uk) Vk = (v1 v2 hellip vk) and Σk = diag(σ1 σ2 hellip σk) and define

bull Then

bull It means that the best approximation of rank k for the matrix A is

SVD and PCAbull We can writebull Remember that in PCA we treat A as a row matrixbull V is just eigenvectors for A

ndash Each column in V is an eigenvector of row matrix Andash we use V to approximate a row in A

bull Equivalently we can writebull U is just eigenvectors for AT

ndash Each column in U is an eigenvector of column matrix Andash We use U to approximate a column in A

Example - LSI bull Build a term-by-document matrix A

bull Compute the SVD of A A = UΣVT

bull Approximate A by

ndash Uk Orthogonal basis that we use to approximate all the documents

ndash Dk Column j hold the coordinates of document j in the new basisndash Dk is the projection of A onto the subspace spanned by Uk

SVD and PCAbull For symmetric A SVD is closely related to PCA

bull PCA A = UΛUT

ndash U and Λ are eigenvectors and eigenvaluesbull SVD A = UΛVT

ndash U is left(column) eigenvectorsndash V is right(row) eigenvectorsndash Λ is the same eigenvalues

bull For symmetric A column eigenvectors equal to row eigenvectors

bull Note the difference of A in PCA and SVDndash SVD A is directly the data eg term-by-document matrixndash PCA A is covariance matrix A=XTX each row in X is a sample

Latent Semantic Indexing (LSI)1 Document file preparation preprocessing

ndash Indexing collecting termsndash Use stop list eliminate rdquomeaninglessrdquo wordsndash Stemming

2 Construction term-by-document matrix sparse matrix storage

3 Query matching distance measures4 Data compression by low rank approximation SVD5 Ranking and relevance feedback

Latent Semantic Indexingbull Assumption there is some underlying latent semantic

structure in the data

bull Eg car and automobile occur in similar documents as do cows and sheep

bull This structure can be enhanced by projecting the data (the term-by-document matrix and the queries) onto a lower dimensional space using SVD

Similarity Measures

bull Term to term AAT = UΣ2UT = (UΣ)(UΣ)T

UΣ are the coordinates of A (rows) projected to space V

bull Document to documentATA = VΣ2VT = (VΣ)(VΣ)T

VΣ are the coordinates of A (columns) projected to space U

Similarity Measures

bull Term to document A = UΣVT = (UΣfrac12)(VΣfrac12)T

UΣfrac12 are the coordinates of A (rows) projected to space VVΣfrac12 are the coordinates of A (columns) projected to space U

HITS (Hyperlink Induced Topic Search)

bull Idea Web includes two flavors of prominent pagesndash authorities contain high-quality informationndash hubs are comprehensive lists of links to authorities

bull A page is a good authority if many hubs point to itbull A page is a good hub if it points to many authoritiesbull Good authorities are pointed to by good hubs and good hubs

point to good authorities

Hubs Authorities

Power Iteration

bull Each page i has both a hub score hi and an authority score aibull HITS successively refines these scores by computing

bull Define the adjacency matrix L of the directed web graph

bull Now

HITS and SVD

bull L rows are outlinks columns are inlinks

bull a will be the dominant eigenvector of the authority matrix LTL bull h will be the dominant eigenvector of the hub matrix LLT

bull They are in fact the first left and right singular vectors of Lbull We are in fact running SVD on the adjacency matrix

HITS vs PageRankbull PageRank may be computed once HITS is computed per

query

bull HITS takes query into account PageRank doesnrsquot

bull PageRank has no concept of hubs

bull HITS is sensitive to local topology insertion or deletion of a small number of nodes may change the scores a lot

bull PageRank more stable because of its random jump step

NMF ndash NON-NEGATIVE MATRIX FACTORIZATION

Definition

bull Given a nonnegative matrix Vntimesm find non-negative matrix factors Wntimesk and Hktimesm such that

VntimesmasympWntimeskHktimesm

bull V column matrix each column is a data sample (n-dimension)bull Wi k-basis represents one basebull H coordinates of V projected to W

vj asymp Wntimeskhj

Motivationbull Non-negativity is natural in many applications

bull Probability is also non-negative

bull Additive model to capture local structure

Multiplicative Update Algorithmbull Cost function Euclidean distance

bull Multiplicative Update

Multiplicative Update Algorithmbull Cost function Divergence

ndash Reduce to Kullback-Leibler divergence whenndash A and B can be regarded as normalized probability distributions

bull Multiplicative update

bull PLSA is NMF with KL divergence

NMF vs PCA

bull n = 2429 faces m = 19x19 pixelsbull Positive values are illustrated with black pixels and negative values

with red pixels

bull NMF Parts-based representationbull PCA Holistic representations

Referencebull D D Lee and H S Seung Algorithms for non-negative matrix

factorization (pdf) NIPS 2001bull D D Lee and H S Seung Learning the parts of objects by non-negative

matrix factorization (pdf) Nature 401 788-791 (1999)

Major Referencebull Saara Hyvoumlnen Linear Algebra Methods for Data Mining Spring 2007

University of Helsinki (Highly recommend)

Outlinebull Basic concepts

ndash Likelihood iidndash ML MAP and Bayesian Inferencendash Expectation-Maximizationndash Mixture Gaussian Parameter estimation

bull pLSAndash Motivationndash Derivation amp Geometry propertiesndash Applications

bull LDAndash Motivation - Why to add a hyper parameterndash Dirichlet Distributionndash Variational EMndash Relations with other topic modalsndash Incorporating category information

bull Summary

Not Included

bull General graphical model theoriesbull Markov random field (belief propagation)bull Detailed derivation of LDA

BASIC CONCEPTS

What Is Machine Learning

Databull Let x = (x1 x2 xD) T denote a data point and D = x(1) x(2) x(N) a

data set D is sometimes associated with desired outputs y1 y2

Predictionsbull We are generally interested in predicting something based on the observed

data setbull Given D what can we say about x(N+1)

Modelbull To make predictions we need to make some assumptions We can often

express these assumptions in the form of a model with some parameters θbull Given data D we learn the model parameters from which we can predict

new data pointsbull The model can often be expressed as a probability distribution over data

points

Likelihood Function

bull Given a set of parameter values probability density function (PDF) will show that some data are more probable than other data

bull Inversely given the observed data and a model of interest Likelihood function is defined as

L(θ) = fθ(x|θ) = p(x|θ)

bull That is likelihood function L(θ) will show that some parameters are more likely to have produced the data

Maximum Likelihood (ML)

bull Maximum likelihood will find the best model parameters that make the data ldquomost likelyrdquo generated from this model

bull Suppose we are given n data samples (x1 x2 hellip xn)

bull Maximum likelihood will find θ that maximize L(θ)

bull Predictive distribution

IID ndash Independent Identically Distributed

bull IID means

bull The problem is considerably simplified as

bull Usually log likehood is used

Reference

bull Zoubin Ghahramani Machine Learning (4F13) 2006 Cambridge(Introduction to Machine Learning Lectures 1-2 Slides)

bull Gregor Heinrich Parameter estimation for text analysis Technical Note 2005-2008

EXPECTATION MAXIMIZATION

Why We Need EM

bull The ExpectationndashMaximization (EM) algorithm is a method for ML learning of parameters in latent variable models

bull Why we need latent variables

bull To describe complex model Gaussian Mixture Model

bull To discover the intrinsic structure inside a data set Topic Models such as pLSA LDA

More General

bull Data set bull Likelihood

bull Goal learn maximum likelihood (ML) parameter values

bull The maximum likelihood procedure finds parameters θ such that

bull Because of the integral (or sum) over latent variables the likelihood can be a very complicated and hard to optimize function

The Expectation Maximization (EM) Algorithm

bull The EM algorithm finds a (local) maximum of a latent variable model likelihood It starts from arbitrary values of the parameters and iterates two steps

bull E step Fill in values of latent variables according to posterior given data

bull M step Maximize likelihood as if latent variables were not hidden

bull Decomposes difficult problems into series of tractable steps

Jensenrsquos Inequality

Lower Bounding the Log Likelihoodbull Observed data D = yn Latent variables X = xn Parameters θbull Goal Maximize the log likelihood (ie ML learning) wrt θ

bull Any distribution q(X) over the hidden variables can be used to obtain a lower bound on the log likelihood using Jensenrsquos inequality

bull where H[q] is the entropy of q(X)

The E and M Steps of EM

bull The lower bound on the log likelihood is given by

bull EM alternates betweenbull E step optimize wrt distribution over hidden variables

holding params fixed

bull M step maximize wrt parameters holding hidden distribution fixed

The E Step

bull E step for fixed θ

bull The second term is the Kullback-Leibler divergencebull This means that for fixed θ F is bounded above by L and achieves

that bound whenbull So the E step simply sets

The M Step

bull M step maximize wrt parameters holding hidden distribution q fixed

bull The second equality comes from fact that entropy of q(X) does not depend directly on θ

bull The specific form of the M step depends on the model Often the maximum wrt θ can be found analytically

EM Never Decreases the Likelihood

bull The E and M steps together never decrease the log likelihood

bull The E step brings F(q θ) to the likelihood L(θ) 顶bull The M-step maximizes F(q θ) wrt θ 抬bull F(q θ) lt L(θ) by Jensen ndash or equivalently from the non-

negativity of KL

Reference

bull Zoubin Ghahramani Machine Learning (4F13) 2006 Cambridge(Unsupervised learning Lecture 5 Slides)

bull Christopher M Bishop (2006) Pattern Recognition and Machine Learning Springer

WHY DO WE NEED GRAPHICAL MODEL

Why Do We Need Graphical Models

bull Consndash Graphical model is so complex even with a few circleshellipndash We have to make too many assumptions

bull Prosndash We do need probability to explain our world But joint probability is

hard to computendash Graphical model can help us analyze and understand our problemsndash Graphs are an intuitive way of representing and visualizing the

relationships between many variablesndash With a graphical model we can decouple joint probability to

conditional probabilities which are usually easier

Directed Acyclic Graphical Models (Bayesian Networks)

bull A DAG Model Bayesian network corresponds to a factorization of the joint probability distribution

p(ABCDE) = p(A)p(B)p(C|AB)p(D|BC)p(E|CD)

bull In general

bull where pa(i) are the parents of node i

Directed Graphs for Statistical ModelsPlate Notation

bull A data set of N points generated from a Gaussian

PLSA ndash PROBABILISTIC LATENT SEMANTIC ANALYSIS

Latent Semantic Indexing (LSI) Review

bull For natural Language Queries simple term matching does not work effectivelyndash Ambiguous terms ndash Same Queries vary due to personal styles

bull Latent semantic indexingndash Creates this lsquolatent semantic spacersquo (hidden meaning)

bull LSI puts documents together even if they donrsquot have common words if the docs share frequently co-occurring terms

bull Disadvantagesndash Statistical foundation is missing

pLSA ndash Probabilistic Latent Semantic Analysis

bull Automated Document Indexing and Information retrievalbull Identification of Latent Classes using an Expectation

Maximization (EM) Algorithmbull Shown to solve

ndash Polysemybull Java could mean ldquocoffeerdquo and also the ldquoPL Javardquobull Cricket is a ldquogamerdquo and also an ldquoinsectrdquo

ndash Synonymybull ldquocomputerrdquo ldquopcrdquo ldquodesktoprdquo all could mean the same

bull Has a better statistical foundation than LSA

pLSA

M

Nd

d

z

w

M

d

z1

w1

z2

w2

z3

w3

zN

wN

hellip

z1 hellip zN are variables ziє[1K]K is the number of latent topics

pLSA

d1

z1

w1

z2

w2

zN1

wN1

hellip

d2

z1

w1

z2

w2

zN2

wN2

hellip

dM

z1

w1

z2

w2

zNm

wNm

hellip

p(w|z=1) p(w|z=2) p(w | z=Nm) are shared for all documents

Likelihood

Joint Probability vs Likelihood

bull Joint probability

bull Likelihood (only for observed variables)

bull p(d) is assumed to be uniform

Document Decomposition

bull Each document can be decomposed as

bull This is similar to the matrix decomposition if we consider each discrete distribution as a vector

p(w|d) = ZVtimesk p(z|d)

bull With many documents we hope to find latent topics as common basis

pLSA ndash Objective Function

bull pLSA tries to maximize the log likelihood

bull Due to the summation over z inside log we have to resort to EM

EM Steps

bull E-Stepndash Expectation of the likelihood function is calculated with the current

parameter valuesbull M-Step

ndash Update the parameters with the calculated posterior probabilitiesndash Find the parameters that maximizes the likelihood function

Lower Bounding the Log Likelihood

EM Steps

bull The E-Step

bull The M-Step

Latent Subspace

pLSA vs LSAbull LSA and PLSA perform dimensionality reduction

ndash In LSA by keeping only K singular valuesndash In PLSA by having K aspects

bull Comparison to SVDndash U Matrix related to P(z|d) (doc to aspect)ndash V Matrix related to P(w|z) (aspect to term)ndash E Matrix related to P(z) (aspect strength)

pLSA vs LSAbull The main difference is the way the approximation is done

bull PLSA generates a model (aspect model) and maximizes its predictive power

bull Selecting the proper value of K is heuristic in LSA

bull Model selection in statistics can determine optimal K in PLSA

Applications

bull Text mining topic discovering

bull Scene Classification

Text Mining

Scene Classification

Classification Result

Reference

bull Thomas Hofmann Probabilistic latent semantic analysis In Proc of Uncertainty in Artificial Intelligence UAI99 Stockholm 1999

bull Bosch A Zisserman A and Munoz X Scene Classification via pLSAProceedings of the European Conference on Computer Vision (2006)

bull Sivic J Russell B C Efros A A Zisserman A and Freeman W T Discovering Object Categories in Image Collections MIT AI Lab Memo AIM-2005-005 February 2005

LDA ndash LATENT DIRICHILET ALLOCATION

Problems in pLSA

bull pLSA provides no probabilistic model at the document level Each doc has its own topic mixture proportion

bull The number of parameters in the model grows linearly with M (the number of documents in the training set)

Problems in pLSA

bull There is no constraint for distributions p(z|di)

bull Easy to lead to serious problems with over-fitting

d1

z1

w1

z2

w2

zN1

wN1

hellip

d2

z1

w1

z2

w2

zN2

wN2

hellip

dm

z1

w1

z2

w2

zNm

wNm

hellip

p(z|d1) p(z|d2) p(z|dm)

Dirichlet Distribution

bull In the LDA model the topic mixture proportions for each document are assumed to follow some distribution

bull Requirement for such a distributionndash The samples (mixture proportions) generated from it are K-tuples of non-

negative numbers that sum to one That is the samples are multinormialsndash Easy to optimize

bull Dirichlet Distribution is one of such distributions

bull The space of all of these multinomials has a nice geometric interpretation as a (k-1)-simplex

Dirichlet Distribution

bull Definition

bull The density is zero outside this open (K minus 1)-dimensional simplex

k

i ii

ki ik

i i

ki i

kk

xx

xxxxp i

1

11

1

12121

1 0 st

)Γ()(Γ)|(

bull Various parameter α

(6 2 2) (3 7 5)

(2 3 4) (6 2 6)

Example Dirichlet Distributions (K=3)

Example Dirichlet Distributions (K=3)

bull Equal αi different

α0=01 α0=1 α0=10

k

i i10

The LDA Model

z4z3z2z1

w4w3w2w1

b

z4z3z2z1

w4w3w2w1

z4z3z2z1

w4w3w2w1

bull For each documentbull Choose ~Dirichlet()bull For each of the N words wn

ndash Choose a topic znraquo Multinomial()

ndash Choose a word wn from p(wn|znb) a multinomial probability conditioned on the topic zn

The LDA Model

bull For each documentbull Choose ~Dirichlet()bull For each of the N words wn

ndash Choose a topic znraquo Multinomial()

ndash Choose a word wn from p(wn|znb) a multinomial probability conditioned on the topic zn

Joint Probability

bull Given parameter α and β

where

Likelihood

bull Joint Probability

bull Marginal distribution of a document

bull Likelihood over all the documents

Inference

bull The likelihood can be computed by summing each documentbull Jansenrsquos inequality in EM

Inference

bull In E-Step we need to compute the posterior distribution of the hidden variables

bull Unfortunately this distribution is intractable to compute in general

bull We have to resort to variational approach

Variational Inference

bull In variational inference we consider a simplified graphical model with variational parameters and minimize the KL Divergence between the variational and posterior distributions

Variantional Inference

bull The difference between the lower bound and the likelihood is the KL divergence

bull Maximizing the lower bound L(b) with respect to and is equivalent to minimizing the KL divergence

VBEM vs EM

bull Only different in the E-Step

bull In standard EM q(X) is directly set to p(X|Dθ) and let KL=0bull In VBEM it is intractable to compute p(X|Dθ) Instead it

approximates p(X|Dθ) by a variational distribution q(X) by minimizing KL(q(X) | P(X|D θ)

bull This is also equivalent to maximizing the lower bound L(θ)

Parameter Estimation

bull Given a corpus of documents we would like to find the parameters and b which maximize the likelihood of the observed data

bull Strategy (Variational EM)

bull Lower bound log p(w|b) by a function L(b)bull Repeat until convergence

ndash E Maximize L(b) with respect to the variational parameters ndash M Maximize the bound with respect to parameters and b

Parameter Estimation

bull E-Step Variational Inference ndash repeat until convergence

bull M-Step Parameter estimation

β

α can be implemented using the Newton-Raphson method

Topic Examples in a 100-topic LDA Model)bull 16000 documents from a subset of the TREC AP corpus

Classification (50-topic LDA + SVM)

bull Reuters-21578 dataset ndash contains 8000 documents and 15818 words

(a) EARN vs NOT EARN (b) GRAIN vs NOT GRAIN

Problems in LDA

bull Dirichlet Distribution is helpful to avoid over-fitting But the assumption might be too strong

z4z3z2z1

w4w3w2w1

b

z4z3z2z1

w4w3w2w1

z4z3z2z1

w4w3w2w1

A Bayesian Hierarchical Model for Learning Natural Scene Categories

bull Incorporating category information

MNd

π

z

x

θ

β

Codebookbull 174 Local Image Patches

bull DetectionEvenly Sampled GridRandom SamplingSaliency DetectorLowersquos DoG Detector

bull RepresentationNormalized 11x11 gray values128-dim SIFT

Topic Distribution in Different Categories

Topic Hierarchical Clustering

More Topic Models

bull Dynamic topic models ICML 2006bull Correlated Topic Model NIPS 2005bull Hierarchical Dirichlet Process Journal of the American

Statistical Association 2003bull Nonparametric Bayes pachinko allocation UAI 2007bull Supervised LDA NIPS 2007bull MedLDA ndash Maximum Margin Discrimant LDA ICML 2009bull hellip

Are you really into Graphical Models

bull Describing Visual Scenes using Transformed Dirichlet Processes E Sudderth A Torralba W Freeman and A Willsky NIPS Dec 2005

Reference

bull David M Blei Andrew Y Ng Michael I Jordan Latent Dirichlet Allocation Journal of Machine Learning Research (JMLR) 2003

bull Matthew J Beal Variational Algorithms for Approximate Bayesian Inference PhD Thesis University of Cambridge 1998

bull L Fei-Fei and P Perona A Bayesian Hierarchical Model for Learning Natural Scene Categories CVPR 2005

Outline

bull Matrix Decompositionndash PCA SVD NMFndash LDA ICA Sparse Coding etc

bull Graphical Modelndash Basic concepts in probabilistic machine learningndash EMndash pLSAndash LDA

bull Two Applicationsndash Document decomposition for ldquolong queryrdquo retrieval ICCV 2009ndash Modeling Threaded Discussions SIGIR 2009

Large-Scale Indexing for ldquoLong Queryrdquo Retrieval (Similarity Search)

Xiao Zhang Zhiwei Li Lei Zhang Wei-Ying Ma and Heung-Yeung Shum

ICCV 2009

The Long Query Problem

bull If a query contains 1000 keywordsndash Need to access 1000 inverted listsndash The intersection of 1000 inverted lists may be emptyndash The union of 1000 inverted list may be the whole corpus

bull Dimension reductionTerm1 Term2 Term3 Term4 hellip TermN

Img1 1 2 0 0 hellip 2

f1 f2 hellip fM

Img1 02 01 hellip 003

Topic Projection

Dim = 1 million

Dim = 200

Key Idea Dimension Reduction + Residual Error Preservation

bull p original TF-IDF vector in vocabulary spacebull X projection matrix for dimension reductionbull w low dimensional feature vectorbull residual error

p Xw

An image = 0 01 02 03 04 05 06 07 08 09 1

0

2

4

6

8

10

12

14

16

+ a few words(10 words)

Orthogonal Decomposition

p Xw 1 11 1 1

2 21 2 1 2

1 1

1

k

k

W k W

W W Wk W

p x xp x x w

p wp x x

Base vector

Low dimensional representation

Residual

An image = 0 01 02 03 04 05 06 07 08 09 1

0

2

4

6

8

10

12

14

16

+ a few words(10 words)

T Tp q p q

p q

w w

X1 X2 X3 hellip Xk

A Probabilistic Implementation

x is a switch variable It controls a word generated from

bull a topic specific distribution

bull a document specific distribution

bull a background distribution

( | )p w d 1( 0 | ) ( | ) ( | )K

kp x d p w z k p z k d

( 1| ) ( | )p x d p w d ( 2 | ) ( | )p x d p w

C Chemudugunta etal Modeling General and Specific Aspects of Documents with a Probabilistic Topic Model NIPS 2006

Search (Online)

DS1 DS2 hellip

DS1 DS2 hellip

DS1 DS2 hellip

DS1 DS2 hellip

DS1 DS2 hellip

hellip

hellip

hellip

hellip

hellip

LSH Index

Doc 300 Doc 401 hellip

A query = 0 01 02 03 04 05 06 07 08 09 1

0

2

4

6

8

10

12

14

16

+ a few words

Re-rankingDoc 401 hellip

Doc 1

Doc 2

Doc 300

Doc 401

Doc N

Doc 300

Index 10M Images 46GBSearch Speed lt 100ms

Doc Meta

Search Example

Query Image

Search Example

Query Image

Simultaneously Modeling Semantics and Structure of Threaded Discussions A Sparse

Coding Approach and Its Applications

Chen LIN Jiang-Ming YANG Rui CAI Xin-jing WANG Wei WANG Lei ZHANG

SIGIR 2009

123

Semantic amp structure

124

SemanticTopics

StructureWho reply to who

Optimize them together

Model semantic

Model structure

125

Reply reconstruction

126

DocumentSimilarity

TopicSimilarity

StructureSimilarity

BaselinesNP

Reply to Nearest PostRR

Reply to RootDS

Document SimilarityLDA

Latent Dirichlet Allocation Project documents to topic space

SWB Special Words Topic Model with Background distribution Project documents to topic and junk topic space

127

Evaluation

method Slashdot Apple

All Posts Good Posts All Posts Good Posts

NP 0021 0012 0289 0239

RR 0183 0319 0269 0474

DS 0463 0643 0409 0628

LDA 0465 0644 0410 0648

SWB 0463 0644 0410 0641

SMSS 0524 0737 0517 0772

128

Expert finding

Reply reconstruction

Network construction

Expert finding

Methods

HITS

PageRank

hellip

129

Baselines

LMFormal Models for Expert Finding in Enterprise Corpora SIGIR

06Achieves stable performance in expert finding task using a

language modelPageRank

Benchmark nodal ranking methodHITS

Find hub nodes and authority nodeEABIF

Personalized Recommendation Driven by Information Flow SIGIR rsquo06

Find most influential node130

Evaluation

131

bull Bayesian estimate

Method MRR MAP P10

LM 0821 0698 0800

EABIF(ori) 0674 0362 0243

EABIF(rec) 0742 0318 0281

PageRank(ori) 0675 0377 0263

PageRank(rec) 0743 0321 0266

HITS(ori) 0906 0832 0900

HITS(rec) 0938 0822 0906

Summary

bull Matrix and probability are fundamental mathematics in information retrieval and computer visionndash Matrix decomposition ndash a good practice to learn matrixndash Graphical model ndash a good practice to learn probability

bull Graphical model is a good tool to analyze problems

bull The essence of decomposition is to discover a set of mid-level features to describe original documentsimages

bull It is more adaptable for various applications than matrix decomposition

  • An Introduction To Matrix Decomposition and Graphical Model
  • Outline
  • What Is Matrix Decomposition
  • Why We Need Matrix Decomposition
  • Why We Need Matrix Decomposition (2)
  • Principle Component Analysis
  • Definition ndash Eigenvalue amp Eigenvector
  • Definition ndash Principle Component Analysis
  • Principle Component Analysis (2)
  • Maximizing Variance
  • Optimization Problem
  • Property Data Decomposition
  • Face Recognition ndash Eigenface
  • Slide 14
  • Slide 15
  • PageRank ndash Power Iteration
  • Column-Stochastic amp Irreducible
  • Iterative PageRank Calculation
  • Convergence of the power iteration
  • Singular value decomposition
  • SVD - Definition
  • Singular Values And Singular Vectors
  • Matrix approximation
  • SVD and PCA
  • Example - LSI
  • SVD and PCA (2)
  • Latent Semantic Indexing (LSI)
  • Latent Semantic Indexing
  • Similarity Measures
  • Similarity Measures (2)
  • HITS (Hyperlink Induced Topic Search)
  • Power Iteration
  • HITS and SVD
  • HITS vs PageRank
  • NMF ndash Non-Negative Matrix Factorization
  • Definition
  • Motivation
  • Multiplicative Update Algorithm
  • Multiplicative Update Algorithm (2)
  • NMF vs PCA
  • Reference
  • Major Reference
  • Outline (2)
  • Not Included
  • Basic Concepts
  • What Is Machine Learning
  • Likelihood Function
  • Maximum Likelihood (ML)
  • IID ndash Independent Identically Distributed
  • Reference (2)
  • Expectation Maximization
  • Why We Need EM
  • More General
  • The Expectation Maximization (EM) Algorithm
  • Jensenrsquos Inequality
  • Lower Bounding the Log Likelihood
  • The E and M Steps of EM
  • The E Step
  • The M Step
  • EM Never Decreases the Likelihood
  • Reference (3)
  • Why Do We Need Graphical Model
  • Why Do We Need Graphical Models
  • Directed Acyclic Graphical Models (Bayesian Networks)
  • Directed Graphs for Statistical Models Plate Notation
  • pLSA ndash Probabilistic Latent Semantic Analysis
  • Latent Semantic Indexing (LSI) Review
  • pLSA ndash Probabilistic Latent Semantic Analysis (2)
  • pLSA
  • pLSA (2)
  • Joint Probability vs Likelihood
  • Document Decomposition
  • pLSA ndash Objective Function
  • EM Steps
  • Lower Bounding the Log Likelihood (2)
  • EM Steps (2)
  • Latent Subspace
  • pLSA vs LSA
  • pLSA vs LSA (2)
  • Applications
  • Text Mining
  • Scene Classification
  • Classification Result
  • Reference (4)
  • LDA ndash Latent Dirichilet Allocation
  • Problems in pLSA
  • Problems in pLSA (2)
  • Dirichlet Distribution
  • Dirichlet Distribution (2)
  • Example Dirichlet Distributions (K=3)
  • Example Dirichlet Distributions (K=3) (2)
  • The LDA Model
  • The LDA Model (2)
  • Joint Probability
  • Likelihood
  • Inference
  • Inference (2)
  • Variational Inference
  • Variantional Inference
  • VBEM vs EM
  • Parameter Estimation
  • Parameter Estimation (2)
  • Topic Examples in a 100-topic LDA Model)
  • Classification (50-topic LDA + SVM)
  • Problems in LDA
  • A Bayesian Hierarchical Model for Learning Natural Scene Catego
  • Codebook
  • Topic Distribution in Different Categories
  • Topic Hierarchical Clustering
  • More Topic Models
  • Are you really into Graphical Models
  • Reference (5)
  • Outline (3)
  • Slide 115
  • The Long Query Problem
  • Key Idea Dimension Reduction + Residual Error Preservation
  • Orthogonal Decomposition
  • A Probabilistic Implementation
  • Search (Online)
  • Search Example
  • Search Example (2)
  • Simultaneously Modeling Semantics and Structure of Threaded Dis
  • Semantic amp structure
  • Optimize them together
  • Reply reconstruction
  • Baselines
  • Evaluation
  • Expert finding
  • Baselines (2)
  • Evaluation (2)
  • Summary
Page 10: An Introduction To Matrix Decomposition and Graphical Model

Maximizing Variancebull The objective of the rotation transformation is to find the

maximal variancebull Projection of data along w is Awbull Variance σ2

w= (Aw)T(Aw) = wTATAw = wTCwwhere C = ATA is the covariance matrix of the data (A is centered)

bull Task maximize variance subject to constraint wTw=1

Optimization Problembull Maximize

λ is the Lagrange multiplierbull Differentiating with respect to w yields

bull Eigenvalue equation Cw = λw where C = ATA bull Once the first principal component is found we continue in the same

fashion to look for the next one which is orthogonal to (all) the principal component(s) already found

Property Data Decompositionbull PCA can be treated as data decomposition

a=UUTa=(u1u2hellipun) (u1u2hellipun)T a=(u1u2hellipun) (ltu1agtltu2agthellipltunagt)T

=(u1u2hellipun) (b1 b2 hellip bn)T

= Σ biui

Face Recognition ndash Eigenfacebull Turk MA Pentland AP Face recognition using eigenfaces

CVPR 1991 (Citation 2654)bull The eigenface approach

ndash images are points in a vector spacendash use PCA to reduce dimensionalityndash face spacendash compare projections onto face space to recognize faces

PageRank ndash Power Iteration

bull Column j has nonzero elements in positions corresponding to outlinks of j (Nj in total)

bull Row i has nonzero element in positions corresponding to inlinks Ii

Column-Stochastic amp Irreduciblebull Column-Stochastic

bull where

bull Irreducible

Iterative PageRank Calculationbull For k=12hellip

bull Equivalently (λ=1 A is a Markov chain transition matrix)

bull Why can we use power iteration to find the first eigenvector

Convergence of the power iteration

bull Expand the initial approximation r0 in terms of the eigenvectors

SINGULAR VALUE DECOMPOSITION

SVD - Definitionbull Any m x n matrix A with m ge n can be factorized

bull

Singular Values And Singular Vectors

bull The diagonal elements σj of are the singular values of the matrix A

bull The columns of U and V are the left singular vectors and right singular vectors respectively

bull Equivalent form of SVD

Matrix approximation

bull Theorem Let Uk = (u1 u2 hellip uk) Vk = (v1 v2 hellip vk) and Σk = diag(σ1 σ2 hellip σk) and define

bull Then

bull It means that the best approximation of rank k for the matrix A is

SVD and PCAbull We can writebull Remember that in PCA we treat A as a row matrixbull V is just eigenvectors for A

ndash Each column in V is an eigenvector of row matrix Andash we use V to approximate a row in A

bull Equivalently we can writebull U is just eigenvectors for AT

ndash Each column in U is an eigenvector of column matrix Andash We use U to approximate a column in A

Example - LSI bull Build a term-by-document matrix A

bull Compute the SVD of A A = UΣVT

bull Approximate A by

ndash Uk Orthogonal basis that we use to approximate all the documents

ndash Dk Column j hold the coordinates of document j in the new basisndash Dk is the projection of A onto the subspace spanned by Uk

SVD and PCAbull For symmetric A SVD is closely related to PCA

bull PCA A = UΛUT

ndash U and Λ are eigenvectors and eigenvaluesbull SVD A = UΛVT

ndash U is left(column) eigenvectorsndash V is right(row) eigenvectorsndash Λ is the same eigenvalues

bull For symmetric A column eigenvectors equal to row eigenvectors

bull Note the difference of A in PCA and SVDndash SVD A is directly the data eg term-by-document matrixndash PCA A is covariance matrix A=XTX each row in X is a sample

Latent Semantic Indexing (LSI)1 Document file preparation preprocessing

ndash Indexing collecting termsndash Use stop list eliminate rdquomeaninglessrdquo wordsndash Stemming

2 Construction term-by-document matrix sparse matrix storage

3 Query matching distance measures4 Data compression by low rank approximation SVD5 Ranking and relevance feedback

Latent Semantic Indexingbull Assumption there is some underlying latent semantic

structure in the data

bull Eg car and automobile occur in similar documents as do cows and sheep

bull This structure can be enhanced by projecting the data (the term-by-document matrix and the queries) onto a lower dimensional space using SVD

Similarity Measures

bull Term to term AAT = UΣ2UT = (UΣ)(UΣ)T

UΣ are the coordinates of A (rows) projected to space V

bull Document to documentATA = VΣ2VT = (VΣ)(VΣ)T

VΣ are the coordinates of A (columns) projected to space U

Similarity Measures

bull Term to document A = UΣVT = (UΣfrac12)(VΣfrac12)T

UΣfrac12 are the coordinates of A (rows) projected to space VVΣfrac12 are the coordinates of A (columns) projected to space U

HITS (Hyperlink Induced Topic Search)

bull Idea Web includes two flavors of prominent pagesndash authorities contain high-quality informationndash hubs are comprehensive lists of links to authorities

bull A page is a good authority if many hubs point to itbull A page is a good hub if it points to many authoritiesbull Good authorities are pointed to by good hubs and good hubs

point to good authorities

Hubs Authorities

Power Iteration

bull Each page i has both a hub score hi and an authority score aibull HITS successively refines these scores by computing

bull Define the adjacency matrix L of the directed web graph

bull Now

HITS and SVD

bull L rows are outlinks columns are inlinks

bull a will be the dominant eigenvector of the authority matrix LTL bull h will be the dominant eigenvector of the hub matrix LLT

bull They are in fact the first left and right singular vectors of Lbull We are in fact running SVD on the adjacency matrix

HITS vs PageRankbull PageRank may be computed once HITS is computed per

query

bull HITS takes query into account PageRank doesnrsquot

bull PageRank has no concept of hubs

bull HITS is sensitive to local topology insertion or deletion of a small number of nodes may change the scores a lot

bull PageRank more stable because of its random jump step

NMF ndash NON-NEGATIVE MATRIX FACTORIZATION

Definition

bull Given a nonnegative matrix Vntimesm find non-negative matrix factors Wntimesk and Hktimesm such that

VntimesmasympWntimeskHktimesm

bull V column matrix each column is a data sample (n-dimension)bull Wi k-basis represents one basebull H coordinates of V projected to W

vj asymp Wntimeskhj

Motivationbull Non-negativity is natural in many applications

bull Probability is also non-negative

bull Additive model to capture local structure

Multiplicative Update Algorithmbull Cost function Euclidean distance

bull Multiplicative Update

Multiplicative Update Algorithmbull Cost function Divergence

ndash Reduce to Kullback-Leibler divergence whenndash A and B can be regarded as normalized probability distributions

bull Multiplicative update

bull PLSA is NMF with KL divergence

NMF vs PCA

bull n = 2429 faces m = 19x19 pixelsbull Positive values are illustrated with black pixels and negative values

with red pixels

bull NMF Parts-based representationbull PCA Holistic representations

Referencebull D D Lee and H S Seung Algorithms for non-negative matrix

factorization (pdf) NIPS 2001bull D D Lee and H S Seung Learning the parts of objects by non-negative

matrix factorization (pdf) Nature 401 788-791 (1999)

Major Referencebull Saara Hyvoumlnen Linear Algebra Methods for Data Mining Spring 2007

University of Helsinki (Highly recommend)

Outlinebull Basic concepts

ndash Likelihood iidndash ML MAP and Bayesian Inferencendash Expectation-Maximizationndash Mixture Gaussian Parameter estimation

bull pLSAndash Motivationndash Derivation amp Geometry propertiesndash Applications

bull LDAndash Motivation - Why to add a hyper parameterndash Dirichlet Distributionndash Variational EMndash Relations with other topic modalsndash Incorporating category information

bull Summary

Not Included

bull General graphical model theoriesbull Markov random field (belief propagation)bull Detailed derivation of LDA

BASIC CONCEPTS

What Is Machine Learning

Databull Let x = (x1 x2 xD) T denote a data point and D = x(1) x(2) x(N) a

data set D is sometimes associated with desired outputs y1 y2

Predictionsbull We are generally interested in predicting something based on the observed

data setbull Given D what can we say about x(N+1)

Modelbull To make predictions we need to make some assumptions We can often

express these assumptions in the form of a model with some parameters θbull Given data D we learn the model parameters from which we can predict

new data pointsbull The model can often be expressed as a probability distribution over data

points

Likelihood Function

bull Given a set of parameter values probability density function (PDF) will show that some data are more probable than other data

bull Inversely given the observed data and a model of interest Likelihood function is defined as

L(θ) = fθ(x|θ) = p(x|θ)

bull That is likelihood function L(θ) will show that some parameters are more likely to have produced the data

Maximum Likelihood (ML)

bull Maximum likelihood will find the best model parameters that make the data ldquomost likelyrdquo generated from this model

bull Suppose we are given n data samples (x1 x2 hellip xn)

bull Maximum likelihood will find θ that maximize L(θ)

bull Predictive distribution

IID ndash Independent Identically Distributed

bull IID means

bull The problem is considerably simplified as

bull Usually log likehood is used

Reference

bull Zoubin Ghahramani Machine Learning (4F13) 2006 Cambridge(Introduction to Machine Learning Lectures 1-2 Slides)

bull Gregor Heinrich Parameter estimation for text analysis Technical Note 2005-2008

EXPECTATION MAXIMIZATION

Why We Need EM

bull The ExpectationndashMaximization (EM) algorithm is a method for ML learning of parameters in latent variable models

bull Why we need latent variables

bull To describe complex model Gaussian Mixture Model

bull To discover the intrinsic structure inside a data set Topic Models such as pLSA LDA

More General

bull Data set bull Likelihood

bull Goal learn maximum likelihood (ML) parameter values

bull The maximum likelihood procedure finds parameters θ such that

bull Because of the integral (or sum) over latent variables the likelihood can be a very complicated and hard to optimize function

The Expectation Maximization (EM) Algorithm

bull The EM algorithm finds a (local) maximum of a latent variable model likelihood It starts from arbitrary values of the parameters and iterates two steps

bull E step Fill in values of latent variables according to posterior given data

bull M step Maximize likelihood as if latent variables were not hidden

bull Decomposes difficult problems into series of tractable steps

Jensenrsquos Inequality

Lower Bounding the Log Likelihoodbull Observed data D = yn Latent variables X = xn Parameters θbull Goal Maximize the log likelihood (ie ML learning) wrt θ

bull Any distribution q(X) over the hidden variables can be used to obtain a lower bound on the log likelihood using Jensenrsquos inequality

bull where H[q] is the entropy of q(X)

The E and M Steps of EM

bull The lower bound on the log likelihood is given by

bull EM alternates betweenbull E step optimize wrt distribution over hidden variables

holding params fixed

bull M step maximize wrt parameters holding hidden distribution fixed

The E Step

bull E step for fixed θ

bull The second term is the Kullback-Leibler divergencebull This means that for fixed θ F is bounded above by L and achieves

that bound whenbull So the E step simply sets

The M Step

bull M step maximize wrt parameters holding hidden distribution q fixed

bull The second equality comes from fact that entropy of q(X) does not depend directly on θ

bull The specific form of the M step depends on the model Often the maximum wrt θ can be found analytically

EM Never Decreases the Likelihood

bull The E and M steps together never decrease the log likelihood

bull The E step brings F(q θ) to the likelihood L(θ) 顶bull The M-step maximizes F(q θ) wrt θ 抬bull F(q θ) lt L(θ) by Jensen ndash or equivalently from the non-

negativity of KL

Reference

bull Zoubin Ghahramani Machine Learning (4F13) 2006 Cambridge(Unsupervised learning Lecture 5 Slides)

bull Christopher M Bishop (2006) Pattern Recognition and Machine Learning Springer

WHY DO WE NEED GRAPHICAL MODEL

Why Do We Need Graphical Models

bull Consndash Graphical model is so complex even with a few circleshellipndash We have to make too many assumptions

bull Prosndash We do need probability to explain our world But joint probability is

hard to computendash Graphical model can help us analyze and understand our problemsndash Graphs are an intuitive way of representing and visualizing the

relationships between many variablesndash With a graphical model we can decouple joint probability to

conditional probabilities which are usually easier

Directed Acyclic Graphical Models (Bayesian Networks)

bull A DAG Model Bayesian network corresponds to a factorization of the joint probability distribution

p(ABCDE) = p(A)p(B)p(C|AB)p(D|BC)p(E|CD)

bull In general

bull where pa(i) are the parents of node i

Directed Graphs for Statistical ModelsPlate Notation

bull A data set of N points generated from a Gaussian

PLSA ndash PROBABILISTIC LATENT SEMANTIC ANALYSIS

Latent Semantic Indexing (LSI) Review

bull For natural Language Queries simple term matching does not work effectivelyndash Ambiguous terms ndash Same Queries vary due to personal styles

bull Latent semantic indexingndash Creates this lsquolatent semantic spacersquo (hidden meaning)

bull LSI puts documents together even if they donrsquot have common words if the docs share frequently co-occurring terms

bull Disadvantagesndash Statistical foundation is missing

pLSA ndash Probabilistic Latent Semantic Analysis

bull Automated Document Indexing and Information retrievalbull Identification of Latent Classes using an Expectation

Maximization (EM) Algorithmbull Shown to solve

ndash Polysemybull Java could mean ldquocoffeerdquo and also the ldquoPL Javardquobull Cricket is a ldquogamerdquo and also an ldquoinsectrdquo

ndash Synonymybull ldquocomputerrdquo ldquopcrdquo ldquodesktoprdquo all could mean the same

bull Has a better statistical foundation than LSA

pLSA

M

Nd

d

z

w

M

d

z1

w1

z2

w2

z3

w3

zN

wN

hellip

z1 hellip zN are variables ziє[1K]K is the number of latent topics

pLSA

d1

z1

w1

z2

w2

zN1

wN1

hellip

d2

z1

w1

z2

w2

zN2

wN2

hellip

dM

z1

w1

z2

w2

zNm

wNm

hellip

p(w|z=1) p(w|z=2) p(w | z=Nm) are shared for all documents

Likelihood

Joint Probability vs Likelihood

bull Joint probability

bull Likelihood (only for observed variables)

bull p(d) is assumed to be uniform

Document Decomposition

bull Each document can be decomposed as

bull This is similar to the matrix decomposition if we consider each discrete distribution as a vector

p(w|d) = ZVtimesk p(z|d)

bull With many documents we hope to find latent topics as common basis

pLSA ndash Objective Function

bull pLSA tries to maximize the log likelihood

bull Due to the summation over z inside log we have to resort to EM

EM Steps

bull E-Stepndash Expectation of the likelihood function is calculated with the current

parameter valuesbull M-Step

ndash Update the parameters with the calculated posterior probabilitiesndash Find the parameters that maximizes the likelihood function

Lower Bounding the Log Likelihood

EM Steps

bull The E-Step

bull The M-Step

Latent Subspace

pLSA vs LSAbull LSA and PLSA perform dimensionality reduction

ndash In LSA by keeping only K singular valuesndash In PLSA by having K aspects

bull Comparison to SVDndash U Matrix related to P(z|d) (doc to aspect)ndash V Matrix related to P(w|z) (aspect to term)ndash E Matrix related to P(z) (aspect strength)

pLSA vs LSAbull The main difference is the way the approximation is done

bull PLSA generates a model (aspect model) and maximizes its predictive power

bull Selecting the proper value of K is heuristic in LSA

bull Model selection in statistics can determine optimal K in PLSA

Applications

bull Text mining topic discovering

bull Scene Classification

Text Mining

Scene Classification

Classification Result

Reference

bull Thomas Hofmann Probabilistic latent semantic analysis In Proc of Uncertainty in Artificial Intelligence UAI99 Stockholm 1999

bull Bosch A Zisserman A and Munoz X Scene Classification via pLSAProceedings of the European Conference on Computer Vision (2006)

bull Sivic J Russell B C Efros A A Zisserman A and Freeman W T Discovering Object Categories in Image Collections MIT AI Lab Memo AIM-2005-005 February 2005

LDA ndash LATENT DIRICHILET ALLOCATION

Problems in pLSA

bull pLSA provides no probabilistic model at the document level Each doc has its own topic mixture proportion

bull The number of parameters in the model grows linearly with M (the number of documents in the training set)

Problems in pLSA

bull There is no constraint for distributions p(z|di)

bull Easy to lead to serious problems with over-fitting

d1

z1

w1

z2

w2

zN1

wN1

hellip

d2

z1

w1

z2

w2

zN2

wN2

hellip

dm

z1

w1

z2

w2

zNm

wNm

hellip

p(z|d1) p(z|d2) p(z|dm)

Dirichlet Distribution

bull In the LDA model the topic mixture proportions for each document are assumed to follow some distribution

bull Requirement for such a distributionndash The samples (mixture proportions) generated from it are K-tuples of non-

negative numbers that sum to one That is the samples are multinormialsndash Easy to optimize

bull Dirichlet Distribution is one of such distributions

bull The space of all of these multinomials has a nice geometric interpretation as a (k-1)-simplex

Dirichlet Distribution

bull Definition

bull The density is zero outside this open (K minus 1)-dimensional simplex

k

i ii

ki ik

i i

ki i

kk

xx

xxxxp i

1

11

1

12121

1 0 st

)Γ()(Γ)|(

bull Various parameter α

(6 2 2) (3 7 5)

(2 3 4) (6 2 6)

Example Dirichlet Distributions (K=3)

Example Dirichlet Distributions (K=3)

bull Equal αi different

α0=01 α0=1 α0=10

k

i i10

The LDA Model

z4z3z2z1

w4w3w2w1

b

z4z3z2z1

w4w3w2w1

z4z3z2z1

w4w3w2w1

bull For each documentbull Choose ~Dirichlet()bull For each of the N words wn

ndash Choose a topic znraquo Multinomial()

ndash Choose a word wn from p(wn|znb) a multinomial probability conditioned on the topic zn

The LDA Model

bull For each documentbull Choose ~Dirichlet()bull For each of the N words wn

ndash Choose a topic znraquo Multinomial()

ndash Choose a word wn from p(wn|znb) a multinomial probability conditioned on the topic zn

Joint Probability

bull Given parameter α and β

where

Likelihood

bull Joint Probability

bull Marginal distribution of a document

bull Likelihood over all the documents

Inference

bull The likelihood can be computed by summing each documentbull Jansenrsquos inequality in EM

Inference

bull In E-Step we need to compute the posterior distribution of the hidden variables

bull Unfortunately this distribution is intractable to compute in general

bull We have to resort to variational approach

Variational Inference

bull In variational inference we consider a simplified graphical model with variational parameters and minimize the KL Divergence between the variational and posterior distributions

Variantional Inference

bull The difference between the lower bound and the likelihood is the KL divergence

bull Maximizing the lower bound L(b) with respect to and is equivalent to minimizing the KL divergence

VBEM vs EM

bull Only different in the E-Step

bull In standard EM q(X) is directly set to p(X|Dθ) and let KL=0bull In VBEM it is intractable to compute p(X|Dθ) Instead it

approximates p(X|Dθ) by a variational distribution q(X) by minimizing KL(q(X) | P(X|D θ)

bull This is also equivalent to maximizing the lower bound L(θ)

Parameter Estimation

bull Given a corpus of documents we would like to find the parameters and b which maximize the likelihood of the observed data

bull Strategy (Variational EM)

bull Lower bound log p(w|b) by a function L(b)bull Repeat until convergence

ndash E Maximize L(b) with respect to the variational parameters ndash M Maximize the bound with respect to parameters and b

Parameter Estimation

bull E-Step Variational Inference ndash repeat until convergence

bull M-Step Parameter estimation

β

α can be implemented using the Newton-Raphson method

Topic Examples in a 100-topic LDA Model)bull 16000 documents from a subset of the TREC AP corpus

Classification (50-topic LDA + SVM)

bull Reuters-21578 dataset ndash contains 8000 documents and 15818 words

(a) EARN vs NOT EARN (b) GRAIN vs NOT GRAIN

Problems in LDA

bull Dirichlet Distribution is helpful to avoid over-fitting But the assumption might be too strong

z4z3z2z1

w4w3w2w1

b

z4z3z2z1

w4w3w2w1

z4z3z2z1

w4w3w2w1

A Bayesian Hierarchical Model for Learning Natural Scene Categories

bull Incorporating category information

MNd

π

z

x

θ

β

Codebookbull 174 Local Image Patches

bull DetectionEvenly Sampled GridRandom SamplingSaliency DetectorLowersquos DoG Detector

bull RepresentationNormalized 11x11 gray values128-dim SIFT

Topic Distribution in Different Categories

Topic Hierarchical Clustering

More Topic Models

bull Dynamic topic models ICML 2006bull Correlated Topic Model NIPS 2005bull Hierarchical Dirichlet Process Journal of the American

Statistical Association 2003bull Nonparametric Bayes pachinko allocation UAI 2007bull Supervised LDA NIPS 2007bull MedLDA ndash Maximum Margin Discrimant LDA ICML 2009bull hellip

Are you really into Graphical Models

bull Describing Visual Scenes using Transformed Dirichlet Processes E Sudderth A Torralba W Freeman and A Willsky NIPS Dec 2005

Reference

bull David M Blei Andrew Y Ng Michael I Jordan Latent Dirichlet Allocation Journal of Machine Learning Research (JMLR) 2003

bull Matthew J Beal Variational Algorithms for Approximate Bayesian Inference PhD Thesis University of Cambridge 1998

bull L Fei-Fei and P Perona A Bayesian Hierarchical Model for Learning Natural Scene Categories CVPR 2005

Outline

bull Matrix Decompositionndash PCA SVD NMFndash LDA ICA Sparse Coding etc

bull Graphical Modelndash Basic concepts in probabilistic machine learningndash EMndash pLSAndash LDA

bull Two Applicationsndash Document decomposition for ldquolong queryrdquo retrieval ICCV 2009ndash Modeling Threaded Discussions SIGIR 2009

Large-Scale Indexing for ldquoLong Queryrdquo Retrieval (Similarity Search)

Xiao Zhang Zhiwei Li Lei Zhang Wei-Ying Ma and Heung-Yeung Shum

ICCV 2009

The Long Query Problem

bull If a query contains 1000 keywordsndash Need to access 1000 inverted listsndash The intersection of 1000 inverted lists may be emptyndash The union of 1000 inverted list may be the whole corpus

bull Dimension reductionTerm1 Term2 Term3 Term4 hellip TermN

Img1 1 2 0 0 hellip 2

f1 f2 hellip fM

Img1 02 01 hellip 003

Topic Projection

Dim = 1 million

Dim = 200

Key Idea Dimension Reduction + Residual Error Preservation

bull p original TF-IDF vector in vocabulary spacebull X projection matrix for dimension reductionbull w low dimensional feature vectorbull residual error

p Xw

An image = 0 01 02 03 04 05 06 07 08 09 1

0

2

4

6

8

10

12

14

16

+ a few words(10 words)

Orthogonal Decomposition

p Xw 1 11 1 1

2 21 2 1 2

1 1

1

k

k

W k W

W W Wk W

p x xp x x w

p wp x x

Base vector

Low dimensional representation

Residual

An image = 0 01 02 03 04 05 06 07 08 09 1

0

2

4

6

8

10

12

14

16

+ a few words(10 words)

T Tp q p q

p q

w w

X1 X2 X3 hellip Xk

A Probabilistic Implementation

x is a switch variable It controls a word generated from

bull a topic specific distribution

bull a document specific distribution

bull a background distribution

( | )p w d 1( 0 | ) ( | ) ( | )K

kp x d p w z k p z k d

( 1| ) ( | )p x d p w d ( 2 | ) ( | )p x d p w

C Chemudugunta etal Modeling General and Specific Aspects of Documents with a Probabilistic Topic Model NIPS 2006

Search (Online)

DS1 DS2 hellip

DS1 DS2 hellip

DS1 DS2 hellip

DS1 DS2 hellip

DS1 DS2 hellip

hellip

hellip

hellip

hellip

hellip

LSH Index

Doc 300 Doc 401 hellip

A query = 0 01 02 03 04 05 06 07 08 09 1

0

2

4

6

8

10

12

14

16

+ a few words

Re-rankingDoc 401 hellip

Doc 1

Doc 2

Doc 300

Doc 401

Doc N

Doc 300

Index 10M Images 46GBSearch Speed lt 100ms

Doc Meta

Search Example

Query Image

Search Example

Query Image

Simultaneously Modeling Semantics and Structure of Threaded Discussions A Sparse

Coding Approach and Its Applications

Chen LIN Jiang-Ming YANG Rui CAI Xin-jing WANG Wei WANG Lei ZHANG

SIGIR 2009

123

Semantic amp structure

124

SemanticTopics

StructureWho reply to who

Optimize them together

Model semantic

Model structure

125

Reply reconstruction

126

DocumentSimilarity

TopicSimilarity

StructureSimilarity

BaselinesNP

Reply to Nearest PostRR

Reply to RootDS

Document SimilarityLDA

Latent Dirichlet Allocation Project documents to topic space

SWB Special Words Topic Model with Background distribution Project documents to topic and junk topic space

127

Evaluation

method Slashdot Apple

All Posts Good Posts All Posts Good Posts

NP 0021 0012 0289 0239

RR 0183 0319 0269 0474

DS 0463 0643 0409 0628

LDA 0465 0644 0410 0648

SWB 0463 0644 0410 0641

SMSS 0524 0737 0517 0772

128

Expert finding

Reply reconstruction

Network construction

Expert finding

Methods

HITS

PageRank

hellip

129

Baselines

LMFormal Models for Expert Finding in Enterprise Corpora SIGIR

06Achieves stable performance in expert finding task using a

language modelPageRank

Benchmark nodal ranking methodHITS

Find hub nodes and authority nodeEABIF

Personalized Recommendation Driven by Information Flow SIGIR rsquo06

Find most influential node130

Evaluation

131

bull Bayesian estimate

Method MRR MAP P10

LM 0821 0698 0800

EABIF(ori) 0674 0362 0243

EABIF(rec) 0742 0318 0281

PageRank(ori) 0675 0377 0263

PageRank(rec) 0743 0321 0266

HITS(ori) 0906 0832 0900

HITS(rec) 0938 0822 0906

Summary

bull Matrix and probability are fundamental mathematics in information retrieval and computer visionndash Matrix decomposition ndash a good practice to learn matrixndash Graphical model ndash a good practice to learn probability

bull Graphical model is a good tool to analyze problems

bull The essence of decomposition is to discover a set of mid-level features to describe original documentsimages

bull It is more adaptable for various applications than matrix decomposition

  • An Introduction To Matrix Decomposition and Graphical Model
  • Outline
  • What Is Matrix Decomposition
  • Why We Need Matrix Decomposition
  • Why We Need Matrix Decomposition (2)
  • Principle Component Analysis
  • Definition ndash Eigenvalue amp Eigenvector
  • Definition ndash Principle Component Analysis
  • Principle Component Analysis (2)
  • Maximizing Variance
  • Optimization Problem
  • Property Data Decomposition
  • Face Recognition ndash Eigenface
  • Slide 14
  • Slide 15
  • PageRank ndash Power Iteration
  • Column-Stochastic amp Irreducible
  • Iterative PageRank Calculation
  • Convergence of the power iteration
  • Singular value decomposition
  • SVD - Definition
  • Singular Values And Singular Vectors
  • Matrix approximation
  • SVD and PCA
  • Example - LSI
  • SVD and PCA (2)
  • Latent Semantic Indexing (LSI)
  • Latent Semantic Indexing
  • Similarity Measures
  • Similarity Measures (2)
  • HITS (Hyperlink Induced Topic Search)
  • Power Iteration
  • HITS and SVD
  • HITS vs PageRank
  • NMF ndash Non-Negative Matrix Factorization
  • Definition
  • Motivation
  • Multiplicative Update Algorithm
  • Multiplicative Update Algorithm (2)
  • NMF vs PCA
  • Reference
  • Major Reference
  • Outline (2)
  • Not Included
  • Basic Concepts
  • What Is Machine Learning
  • Likelihood Function
  • Maximum Likelihood (ML)
  • IID ndash Independent Identically Distributed
  • Reference (2)
  • Expectation Maximization
  • Why We Need EM
  • More General
  • The Expectation Maximization (EM) Algorithm
  • Jensenrsquos Inequality
  • Lower Bounding the Log Likelihood
  • The E and M Steps of EM
  • The E Step
  • The M Step
  • EM Never Decreases the Likelihood
  • Reference (3)
  • Why Do We Need Graphical Model
  • Why Do We Need Graphical Models
  • Directed Acyclic Graphical Models (Bayesian Networks)
  • Directed Graphs for Statistical Models Plate Notation
  • pLSA ndash Probabilistic Latent Semantic Analysis
  • Latent Semantic Indexing (LSI) Review
  • pLSA ndash Probabilistic Latent Semantic Analysis (2)
  • pLSA
  • pLSA (2)
  • Joint Probability vs Likelihood
  • Document Decomposition
  • pLSA ndash Objective Function
  • EM Steps
  • Lower Bounding the Log Likelihood (2)
  • EM Steps (2)
  • Latent Subspace
  • pLSA vs LSA
  • pLSA vs LSA (2)
  • Applications
  • Text Mining
  • Scene Classification
  • Classification Result
  • Reference (4)
  • LDA ndash Latent Dirichilet Allocation
  • Problems in pLSA
  • Problems in pLSA (2)
  • Dirichlet Distribution
  • Dirichlet Distribution (2)
  • Example Dirichlet Distributions (K=3)
  • Example Dirichlet Distributions (K=3) (2)
  • The LDA Model
  • The LDA Model (2)
  • Joint Probability
  • Likelihood
  • Inference
  • Inference (2)
  • Variational Inference
  • Variantional Inference
  • VBEM vs EM
  • Parameter Estimation
  • Parameter Estimation (2)
  • Topic Examples in a 100-topic LDA Model)
  • Classification (50-topic LDA + SVM)
  • Problems in LDA
  • A Bayesian Hierarchical Model for Learning Natural Scene Catego
  • Codebook
  • Topic Distribution in Different Categories
  • Topic Hierarchical Clustering
  • More Topic Models
  • Are you really into Graphical Models
  • Reference (5)
  • Outline (3)
  • Slide 115
  • The Long Query Problem
  • Key Idea Dimension Reduction + Residual Error Preservation
  • Orthogonal Decomposition
  • A Probabilistic Implementation
  • Search (Online)
  • Search Example
  • Search Example (2)
  • Simultaneously Modeling Semantics and Structure of Threaded Dis
  • Semantic amp structure
  • Optimize them together
  • Reply reconstruction
  • Baselines
  • Evaluation
  • Expert finding
  • Baselines (2)
  • Evaluation (2)
  • Summary
Page 11: An Introduction To Matrix Decomposition and Graphical Model

Optimization Problembull Maximize

λ is the Lagrange multiplierbull Differentiating with respect to w yields

bull Eigenvalue equation Cw = λw where C = ATA bull Once the first principal component is found we continue in the same

fashion to look for the next one which is orthogonal to (all) the principal component(s) already found

Property Data Decompositionbull PCA can be treated as data decomposition

a=UUTa=(u1u2hellipun) (u1u2hellipun)T a=(u1u2hellipun) (ltu1agtltu2agthellipltunagt)T

=(u1u2hellipun) (b1 b2 hellip bn)T

= Σ biui

Face Recognition ndash Eigenfacebull Turk MA Pentland AP Face recognition using eigenfaces

CVPR 1991 (Citation 2654)bull The eigenface approach

ndash images are points in a vector spacendash use PCA to reduce dimensionalityndash face spacendash compare projections onto face space to recognize faces

PageRank ndash Power Iteration

bull Column j has nonzero elements in positions corresponding to outlinks of j (Nj in total)

bull Row i has nonzero element in positions corresponding to inlinks Ii

Column-Stochastic amp Irreduciblebull Column-Stochastic

bull where

bull Irreducible

Iterative PageRank Calculationbull For k=12hellip

bull Equivalently (λ=1 A is a Markov chain transition matrix)

bull Why can we use power iteration to find the first eigenvector

Convergence of the power iteration

bull Expand the initial approximation r0 in terms of the eigenvectors

SINGULAR VALUE DECOMPOSITION

SVD - Definitionbull Any m x n matrix A with m ge n can be factorized

bull

Singular Values And Singular Vectors

bull The diagonal elements σj of are the singular values of the matrix A

bull The columns of U and V are the left singular vectors and right singular vectors respectively

bull Equivalent form of SVD

Matrix approximation

bull Theorem Let Uk = (u1 u2 hellip uk) Vk = (v1 v2 hellip vk) and Σk = diag(σ1 σ2 hellip σk) and define

bull Then

bull It means that the best approximation of rank k for the matrix A is

SVD and PCAbull We can writebull Remember that in PCA we treat A as a row matrixbull V is just eigenvectors for A

ndash Each column in V is an eigenvector of row matrix Andash we use V to approximate a row in A

bull Equivalently we can writebull U is just eigenvectors for AT

ndash Each column in U is an eigenvector of column matrix Andash We use U to approximate a column in A

Example - LSI bull Build a term-by-document matrix A

bull Compute the SVD of A A = UΣVT

bull Approximate A by

ndash Uk Orthogonal basis that we use to approximate all the documents

ndash Dk Column j hold the coordinates of document j in the new basisndash Dk is the projection of A onto the subspace spanned by Uk

SVD and PCAbull For symmetric A SVD is closely related to PCA

bull PCA A = UΛUT

ndash U and Λ are eigenvectors and eigenvaluesbull SVD A = UΛVT

ndash U is left(column) eigenvectorsndash V is right(row) eigenvectorsndash Λ is the same eigenvalues

bull For symmetric A column eigenvectors equal to row eigenvectors

bull Note the difference of A in PCA and SVDndash SVD A is directly the data eg term-by-document matrixndash PCA A is covariance matrix A=XTX each row in X is a sample

Latent Semantic Indexing (LSI)1 Document file preparation preprocessing

ndash Indexing collecting termsndash Use stop list eliminate rdquomeaninglessrdquo wordsndash Stemming

2 Construction term-by-document matrix sparse matrix storage

3 Query matching distance measures4 Data compression by low rank approximation SVD5 Ranking and relevance feedback

Latent Semantic Indexingbull Assumption there is some underlying latent semantic

structure in the data

bull Eg car and automobile occur in similar documents as do cows and sheep

bull This structure can be enhanced by projecting the data (the term-by-document matrix and the queries) onto a lower dimensional space using SVD

Similarity Measures

bull Term to term AAT = UΣ2UT = (UΣ)(UΣ)T

UΣ are the coordinates of A (rows) projected to space V

bull Document to documentATA = VΣ2VT = (VΣ)(VΣ)T

VΣ are the coordinates of A (columns) projected to space U

Similarity Measures

bull Term to document A = UΣVT = (UΣfrac12)(VΣfrac12)T

UΣfrac12 are the coordinates of A (rows) projected to space VVΣfrac12 are the coordinates of A (columns) projected to space U

HITS (Hyperlink Induced Topic Search)

bull Idea Web includes two flavors of prominent pagesndash authorities contain high-quality informationndash hubs are comprehensive lists of links to authorities

bull A page is a good authority if many hubs point to itbull A page is a good hub if it points to many authoritiesbull Good authorities are pointed to by good hubs and good hubs

point to good authorities

Hubs Authorities

Power Iteration

bull Each page i has both a hub score hi and an authority score aibull HITS successively refines these scores by computing

bull Define the adjacency matrix L of the directed web graph

bull Now

HITS and SVD

bull L rows are outlinks columns are inlinks

bull a will be the dominant eigenvector of the authority matrix LTL bull h will be the dominant eigenvector of the hub matrix LLT

bull They are in fact the first left and right singular vectors of Lbull We are in fact running SVD on the adjacency matrix

HITS vs PageRankbull PageRank may be computed once HITS is computed per

query

bull HITS takes query into account PageRank doesnrsquot

bull PageRank has no concept of hubs

bull HITS is sensitive to local topology insertion or deletion of a small number of nodes may change the scores a lot

bull PageRank more stable because of its random jump step

NMF ndash NON-NEGATIVE MATRIX FACTORIZATION

Definition

bull Given a nonnegative matrix Vntimesm find non-negative matrix factors Wntimesk and Hktimesm such that

VntimesmasympWntimeskHktimesm

bull V column matrix each column is a data sample (n-dimension)bull Wi k-basis represents one basebull H coordinates of V projected to W

vj asymp Wntimeskhj

Motivationbull Non-negativity is natural in many applications

bull Probability is also non-negative

bull Additive model to capture local structure

Multiplicative Update Algorithmbull Cost function Euclidean distance

bull Multiplicative Update

Multiplicative Update Algorithmbull Cost function Divergence

ndash Reduce to Kullback-Leibler divergence whenndash A and B can be regarded as normalized probability distributions

bull Multiplicative update

bull PLSA is NMF with KL divergence

NMF vs PCA

bull n = 2429 faces m = 19x19 pixelsbull Positive values are illustrated with black pixels and negative values

with red pixels

bull NMF Parts-based representationbull PCA Holistic representations

Referencebull D D Lee and H S Seung Algorithms for non-negative matrix

factorization (pdf) NIPS 2001bull D D Lee and H S Seung Learning the parts of objects by non-negative

matrix factorization (pdf) Nature 401 788-791 (1999)

Major Referencebull Saara Hyvoumlnen Linear Algebra Methods for Data Mining Spring 2007

University of Helsinki (Highly recommend)

Outlinebull Basic concepts

ndash Likelihood iidndash ML MAP and Bayesian Inferencendash Expectation-Maximizationndash Mixture Gaussian Parameter estimation

bull pLSAndash Motivationndash Derivation amp Geometry propertiesndash Applications

bull LDAndash Motivation - Why to add a hyper parameterndash Dirichlet Distributionndash Variational EMndash Relations with other topic modalsndash Incorporating category information

bull Summary

Not Included

bull General graphical model theoriesbull Markov random field (belief propagation)bull Detailed derivation of LDA

BASIC CONCEPTS

What Is Machine Learning

Databull Let x = (x1 x2 xD) T denote a data point and D = x(1) x(2) x(N) a

data set D is sometimes associated with desired outputs y1 y2

Predictionsbull We are generally interested in predicting something based on the observed

data setbull Given D what can we say about x(N+1)

Modelbull To make predictions we need to make some assumptions We can often

express these assumptions in the form of a model with some parameters θbull Given data D we learn the model parameters from which we can predict

new data pointsbull The model can often be expressed as a probability distribution over data

points

Likelihood Function

bull Given a set of parameter values probability density function (PDF) will show that some data are more probable than other data

bull Inversely given the observed data and a model of interest Likelihood function is defined as

L(θ) = fθ(x|θ) = p(x|θ)

bull That is likelihood function L(θ) will show that some parameters are more likely to have produced the data

Maximum Likelihood (ML)

bull Maximum likelihood will find the best model parameters that make the data ldquomost likelyrdquo generated from this model

bull Suppose we are given n data samples (x1 x2 hellip xn)

bull Maximum likelihood will find θ that maximize L(θ)

bull Predictive distribution

IID ndash Independent Identically Distributed

bull IID means

bull The problem is considerably simplified as

bull Usually log likehood is used

Reference

bull Zoubin Ghahramani Machine Learning (4F13) 2006 Cambridge(Introduction to Machine Learning Lectures 1-2 Slides)

bull Gregor Heinrich Parameter estimation for text analysis Technical Note 2005-2008

EXPECTATION MAXIMIZATION

Why We Need EM

bull The ExpectationndashMaximization (EM) algorithm is a method for ML learning of parameters in latent variable models

bull Why we need latent variables

bull To describe complex model Gaussian Mixture Model

bull To discover the intrinsic structure inside a data set Topic Models such as pLSA LDA

More General

bull Data set bull Likelihood

bull Goal learn maximum likelihood (ML) parameter values

bull The maximum likelihood procedure finds parameters θ such that

bull Because of the integral (or sum) over latent variables the likelihood can be a very complicated and hard to optimize function

The Expectation Maximization (EM) Algorithm

bull The EM algorithm finds a (local) maximum of a latent variable model likelihood It starts from arbitrary values of the parameters and iterates two steps

bull E step Fill in values of latent variables according to posterior given data

bull M step Maximize likelihood as if latent variables were not hidden

bull Decomposes difficult problems into series of tractable steps

Jensenrsquos Inequality

Lower Bounding the Log Likelihoodbull Observed data D = yn Latent variables X = xn Parameters θbull Goal Maximize the log likelihood (ie ML learning) wrt θ

bull Any distribution q(X) over the hidden variables can be used to obtain a lower bound on the log likelihood using Jensenrsquos inequality

bull where H[q] is the entropy of q(X)

The E and M Steps of EM

bull The lower bound on the log likelihood is given by

bull EM alternates betweenbull E step optimize wrt distribution over hidden variables

holding params fixed

bull M step maximize wrt parameters holding hidden distribution fixed

The E Step

bull E step for fixed θ

bull The second term is the Kullback-Leibler divergencebull This means that for fixed θ F is bounded above by L and achieves

that bound whenbull So the E step simply sets

The M Step

bull M step maximize wrt parameters holding hidden distribution q fixed

bull The second equality comes from fact that entropy of q(X) does not depend directly on θ

bull The specific form of the M step depends on the model Often the maximum wrt θ can be found analytically

EM Never Decreases the Likelihood

bull The E and M steps together never decrease the log likelihood

bull The E step brings F(q θ) to the likelihood L(θ) 顶bull The M-step maximizes F(q θ) wrt θ 抬bull F(q θ) lt L(θ) by Jensen ndash or equivalently from the non-

negativity of KL

Reference

bull Zoubin Ghahramani Machine Learning (4F13) 2006 Cambridge(Unsupervised learning Lecture 5 Slides)

bull Christopher M Bishop (2006) Pattern Recognition and Machine Learning Springer

WHY DO WE NEED GRAPHICAL MODEL

Why Do We Need Graphical Models

bull Consndash Graphical model is so complex even with a few circleshellipndash We have to make too many assumptions

bull Prosndash We do need probability to explain our world But joint probability is

hard to computendash Graphical model can help us analyze and understand our problemsndash Graphs are an intuitive way of representing and visualizing the

relationships between many variablesndash With a graphical model we can decouple joint probability to

conditional probabilities which are usually easier

Directed Acyclic Graphical Models (Bayesian Networks)

bull A DAG Model Bayesian network corresponds to a factorization of the joint probability distribution

p(ABCDE) = p(A)p(B)p(C|AB)p(D|BC)p(E|CD)

bull In general

bull where pa(i) are the parents of node i

Directed Graphs for Statistical ModelsPlate Notation

bull A data set of N points generated from a Gaussian

PLSA ndash PROBABILISTIC LATENT SEMANTIC ANALYSIS

Latent Semantic Indexing (LSI) Review

bull For natural Language Queries simple term matching does not work effectivelyndash Ambiguous terms ndash Same Queries vary due to personal styles

bull Latent semantic indexingndash Creates this lsquolatent semantic spacersquo (hidden meaning)

bull LSI puts documents together even if they donrsquot have common words if the docs share frequently co-occurring terms

bull Disadvantagesndash Statistical foundation is missing

pLSA ndash Probabilistic Latent Semantic Analysis

bull Automated Document Indexing and Information retrievalbull Identification of Latent Classes using an Expectation

Maximization (EM) Algorithmbull Shown to solve

ndash Polysemybull Java could mean ldquocoffeerdquo and also the ldquoPL Javardquobull Cricket is a ldquogamerdquo and also an ldquoinsectrdquo

ndash Synonymybull ldquocomputerrdquo ldquopcrdquo ldquodesktoprdquo all could mean the same

bull Has a better statistical foundation than LSA

pLSA

M

Nd

d

z

w

M

d

z1

w1

z2

w2

z3

w3

zN

wN

hellip

z1 hellip zN are variables ziє[1K]K is the number of latent topics

pLSA

d1

z1

w1

z2

w2

zN1

wN1

hellip

d2

z1

w1

z2

w2

zN2

wN2

hellip

dM

z1

w1

z2

w2

zNm

wNm

hellip

p(w|z=1) p(w|z=2) p(w | z=Nm) are shared for all documents

Likelihood

Joint Probability vs Likelihood

bull Joint probability

bull Likelihood (only for observed variables)

bull p(d) is assumed to be uniform

Document Decomposition

bull Each document can be decomposed as

bull This is similar to the matrix decomposition if we consider each discrete distribution as a vector

p(w|d) = ZVtimesk p(z|d)

bull With many documents we hope to find latent topics as common basis

pLSA ndash Objective Function

bull pLSA tries to maximize the log likelihood

bull Due to the summation over z inside log we have to resort to EM

EM Steps

bull E-Stepndash Expectation of the likelihood function is calculated with the current

parameter valuesbull M-Step

ndash Update the parameters with the calculated posterior probabilitiesndash Find the parameters that maximizes the likelihood function

Lower Bounding the Log Likelihood

EM Steps

bull The E-Step

bull The M-Step

Latent Subspace

pLSA vs LSAbull LSA and PLSA perform dimensionality reduction

ndash In LSA by keeping only K singular valuesndash In PLSA by having K aspects

bull Comparison to SVDndash U Matrix related to P(z|d) (doc to aspect)ndash V Matrix related to P(w|z) (aspect to term)ndash E Matrix related to P(z) (aspect strength)

pLSA vs LSAbull The main difference is the way the approximation is done

bull PLSA generates a model (aspect model) and maximizes its predictive power

bull Selecting the proper value of K is heuristic in LSA

bull Model selection in statistics can determine optimal K in PLSA

Applications

bull Text mining topic discovering

bull Scene Classification

Text Mining

Scene Classification

Classification Result

Reference

bull Thomas Hofmann Probabilistic latent semantic analysis In Proc of Uncertainty in Artificial Intelligence UAI99 Stockholm 1999

bull Bosch A Zisserman A and Munoz X Scene Classification via pLSAProceedings of the European Conference on Computer Vision (2006)

bull Sivic J Russell B C Efros A A Zisserman A and Freeman W T Discovering Object Categories in Image Collections MIT AI Lab Memo AIM-2005-005 February 2005

LDA ndash LATENT DIRICHILET ALLOCATION

Problems in pLSA

bull pLSA provides no probabilistic model at the document level Each doc has its own topic mixture proportion

bull The number of parameters in the model grows linearly with M (the number of documents in the training set)

Problems in pLSA

bull There is no constraint for distributions p(z|di)

bull Easy to lead to serious problems with over-fitting

d1

z1

w1

z2

w2

zN1

wN1

hellip

d2

z1

w1

z2

w2

zN2

wN2

hellip

dm

z1

w1

z2

w2

zNm

wNm

hellip

p(z|d1) p(z|d2) p(z|dm)

Dirichlet Distribution

bull In the LDA model the topic mixture proportions for each document are assumed to follow some distribution

bull Requirement for such a distributionndash The samples (mixture proportions) generated from it are K-tuples of non-

negative numbers that sum to one That is the samples are multinormialsndash Easy to optimize

bull Dirichlet Distribution is one of such distributions

bull The space of all of these multinomials has a nice geometric interpretation as a (k-1)-simplex

Dirichlet Distribution

bull Definition

bull The density is zero outside this open (K minus 1)-dimensional simplex

k

i ii

ki ik

i i

ki i

kk

xx

xxxxp i

1

11

1

12121

1 0 st

)Γ()(Γ)|(

bull Various parameter α

(6 2 2) (3 7 5)

(2 3 4) (6 2 6)

Example Dirichlet Distributions (K=3)

Example Dirichlet Distributions (K=3)

bull Equal αi different

α0=01 α0=1 α0=10

k

i i10

The LDA Model

z4z3z2z1

w4w3w2w1

b

z4z3z2z1

w4w3w2w1

z4z3z2z1

w4w3w2w1

bull For each documentbull Choose ~Dirichlet()bull For each of the N words wn

ndash Choose a topic znraquo Multinomial()

ndash Choose a word wn from p(wn|znb) a multinomial probability conditioned on the topic zn

The LDA Model

bull For each documentbull Choose ~Dirichlet()bull For each of the N words wn

ndash Choose a topic znraquo Multinomial()

ndash Choose a word wn from p(wn|znb) a multinomial probability conditioned on the topic zn

Joint Probability

bull Given parameter α and β

where

Likelihood

bull Joint Probability

bull Marginal distribution of a document

bull Likelihood over all the documents

Inference

bull The likelihood can be computed by summing each documentbull Jansenrsquos inequality in EM

Inference

bull In E-Step we need to compute the posterior distribution of the hidden variables

bull Unfortunately this distribution is intractable to compute in general

bull We have to resort to variational approach

Variational Inference

bull In variational inference we consider a simplified graphical model with variational parameters and minimize the KL Divergence between the variational and posterior distributions

Variantional Inference

bull The difference between the lower bound and the likelihood is the KL divergence

bull Maximizing the lower bound L(b) with respect to and is equivalent to minimizing the KL divergence

VBEM vs EM

bull Only different in the E-Step

bull In standard EM q(X) is directly set to p(X|Dθ) and let KL=0bull In VBEM it is intractable to compute p(X|Dθ) Instead it

approximates p(X|Dθ) by a variational distribution q(X) by minimizing KL(q(X) | P(X|D θ)

bull This is also equivalent to maximizing the lower bound L(θ)

Parameter Estimation

bull Given a corpus of documents we would like to find the parameters and b which maximize the likelihood of the observed data

bull Strategy (Variational EM)

bull Lower bound log p(w|b) by a function L(b)bull Repeat until convergence

ndash E Maximize L(b) with respect to the variational parameters ndash M Maximize the bound with respect to parameters and b

Parameter Estimation

bull E-Step Variational Inference ndash repeat until convergence

bull M-Step Parameter estimation

β

α can be implemented using the Newton-Raphson method

Topic Examples in a 100-topic LDA Model)bull 16000 documents from a subset of the TREC AP corpus

Classification (50-topic LDA + SVM)

bull Reuters-21578 dataset ndash contains 8000 documents and 15818 words

(a) EARN vs NOT EARN (b) GRAIN vs NOT GRAIN

Problems in LDA

bull Dirichlet Distribution is helpful to avoid over-fitting But the assumption might be too strong

z4z3z2z1

w4w3w2w1

b

z4z3z2z1

w4w3w2w1

z4z3z2z1

w4w3w2w1

A Bayesian Hierarchical Model for Learning Natural Scene Categories

bull Incorporating category information

MNd

π

z

x

θ

β

Codebookbull 174 Local Image Patches

bull DetectionEvenly Sampled GridRandom SamplingSaliency DetectorLowersquos DoG Detector

bull RepresentationNormalized 11x11 gray values128-dim SIFT

Topic Distribution in Different Categories

Topic Hierarchical Clustering

More Topic Models

bull Dynamic topic models ICML 2006bull Correlated Topic Model NIPS 2005bull Hierarchical Dirichlet Process Journal of the American

Statistical Association 2003bull Nonparametric Bayes pachinko allocation UAI 2007bull Supervised LDA NIPS 2007bull MedLDA ndash Maximum Margin Discrimant LDA ICML 2009bull hellip

Are you really into Graphical Models

bull Describing Visual Scenes using Transformed Dirichlet Processes E Sudderth A Torralba W Freeman and A Willsky NIPS Dec 2005

Reference

bull David M Blei Andrew Y Ng Michael I Jordan Latent Dirichlet Allocation Journal of Machine Learning Research (JMLR) 2003

bull Matthew J Beal Variational Algorithms for Approximate Bayesian Inference PhD Thesis University of Cambridge 1998

bull L Fei-Fei and P Perona A Bayesian Hierarchical Model for Learning Natural Scene Categories CVPR 2005

Outline

bull Matrix Decompositionndash PCA SVD NMFndash LDA ICA Sparse Coding etc

bull Graphical Modelndash Basic concepts in probabilistic machine learningndash EMndash pLSAndash LDA

bull Two Applicationsndash Document decomposition for ldquolong queryrdquo retrieval ICCV 2009ndash Modeling Threaded Discussions SIGIR 2009

Large-Scale Indexing for ldquoLong Queryrdquo Retrieval (Similarity Search)

Xiao Zhang Zhiwei Li Lei Zhang Wei-Ying Ma and Heung-Yeung Shum

ICCV 2009

The Long Query Problem

bull If a query contains 1000 keywordsndash Need to access 1000 inverted listsndash The intersection of 1000 inverted lists may be emptyndash The union of 1000 inverted list may be the whole corpus

bull Dimension reductionTerm1 Term2 Term3 Term4 hellip TermN

Img1 1 2 0 0 hellip 2

f1 f2 hellip fM

Img1 02 01 hellip 003

Topic Projection

Dim = 1 million

Dim = 200

Key Idea Dimension Reduction + Residual Error Preservation

bull p original TF-IDF vector in vocabulary spacebull X projection matrix for dimension reductionbull w low dimensional feature vectorbull residual error

p Xw

An image = 0 01 02 03 04 05 06 07 08 09 1

0

2

4

6

8

10

12

14

16

+ a few words(10 words)

Orthogonal Decomposition

p Xw 1 11 1 1

2 21 2 1 2

1 1

1

k

k

W k W

W W Wk W

p x xp x x w

p wp x x

Base vector

Low dimensional representation

Residual

An image = 0 01 02 03 04 05 06 07 08 09 1

0

2

4

6

8

10

12

14

16

+ a few words(10 words)

T Tp q p q

p q

w w

X1 X2 X3 hellip Xk

A Probabilistic Implementation

x is a switch variable It controls a word generated from

bull a topic specific distribution

bull a document specific distribution

bull a background distribution

( | )p w d 1( 0 | ) ( | ) ( | )K

kp x d p w z k p z k d

( 1| ) ( | )p x d p w d ( 2 | ) ( | )p x d p w

C Chemudugunta etal Modeling General and Specific Aspects of Documents with a Probabilistic Topic Model NIPS 2006

Search (Online)

DS1 DS2 hellip

DS1 DS2 hellip

DS1 DS2 hellip

DS1 DS2 hellip

DS1 DS2 hellip

hellip

hellip

hellip

hellip

hellip

LSH Index

Doc 300 Doc 401 hellip

A query = 0 01 02 03 04 05 06 07 08 09 1

0

2

4

6

8

10

12

14

16

+ a few words

Re-rankingDoc 401 hellip

Doc 1

Doc 2

Doc 300

Doc 401

Doc N

Doc 300

Index 10M Images 46GBSearch Speed lt 100ms

Doc Meta

Search Example

Query Image

Search Example

Query Image

Simultaneously Modeling Semantics and Structure of Threaded Discussions A Sparse

Coding Approach and Its Applications

Chen LIN Jiang-Ming YANG Rui CAI Xin-jing WANG Wei WANG Lei ZHANG

SIGIR 2009

123

Semantic amp structure

124

SemanticTopics

StructureWho reply to who

Optimize them together

Model semantic

Model structure

125

Reply reconstruction

126

DocumentSimilarity

TopicSimilarity

StructureSimilarity

BaselinesNP

Reply to Nearest PostRR

Reply to RootDS

Document SimilarityLDA

Latent Dirichlet Allocation Project documents to topic space

SWB Special Words Topic Model with Background distribution Project documents to topic and junk topic space

127

Evaluation

method Slashdot Apple

All Posts Good Posts All Posts Good Posts

NP 0021 0012 0289 0239

RR 0183 0319 0269 0474

DS 0463 0643 0409 0628

LDA 0465 0644 0410 0648

SWB 0463 0644 0410 0641

SMSS 0524 0737 0517 0772

128

Expert finding

Reply reconstruction

Network construction

Expert finding

Methods

HITS

PageRank

hellip

129

Baselines

LMFormal Models for Expert Finding in Enterprise Corpora SIGIR

06Achieves stable performance in expert finding task using a

language modelPageRank

Benchmark nodal ranking methodHITS

Find hub nodes and authority nodeEABIF

Personalized Recommendation Driven by Information Flow SIGIR rsquo06

Find most influential node130

Evaluation

131

bull Bayesian estimate

Method MRR MAP P10

LM 0821 0698 0800

EABIF(ori) 0674 0362 0243

EABIF(rec) 0742 0318 0281

PageRank(ori) 0675 0377 0263

PageRank(rec) 0743 0321 0266

HITS(ori) 0906 0832 0900

HITS(rec) 0938 0822 0906

Summary

bull Matrix and probability are fundamental mathematics in information retrieval and computer visionndash Matrix decomposition ndash a good practice to learn matrixndash Graphical model ndash a good practice to learn probability

bull Graphical model is a good tool to analyze problems

bull The essence of decomposition is to discover a set of mid-level features to describe original documentsimages

bull It is more adaptable for various applications than matrix decomposition

  • An Introduction To Matrix Decomposition and Graphical Model
  • Outline
  • What Is Matrix Decomposition
  • Why We Need Matrix Decomposition
  • Why We Need Matrix Decomposition (2)
  • Principle Component Analysis
  • Definition ndash Eigenvalue amp Eigenvector
  • Definition ndash Principle Component Analysis
  • Principle Component Analysis (2)
  • Maximizing Variance
  • Optimization Problem
  • Property Data Decomposition
  • Face Recognition ndash Eigenface
  • Slide 14
  • Slide 15
  • PageRank ndash Power Iteration
  • Column-Stochastic amp Irreducible
  • Iterative PageRank Calculation
  • Convergence of the power iteration
  • Singular value decomposition
  • SVD - Definition
  • Singular Values And Singular Vectors
  • Matrix approximation
  • SVD and PCA
  • Example - LSI
  • SVD and PCA (2)
  • Latent Semantic Indexing (LSI)
  • Latent Semantic Indexing
  • Similarity Measures
  • Similarity Measures (2)
  • HITS (Hyperlink Induced Topic Search)
  • Power Iteration
  • HITS and SVD
  • HITS vs PageRank
  • NMF ndash Non-Negative Matrix Factorization
  • Definition
  • Motivation
  • Multiplicative Update Algorithm
  • Multiplicative Update Algorithm (2)
  • NMF vs PCA
  • Reference
  • Major Reference
  • Outline (2)
  • Not Included
  • Basic Concepts
  • What Is Machine Learning
  • Likelihood Function
  • Maximum Likelihood (ML)
  • IID ndash Independent Identically Distributed
  • Reference (2)
  • Expectation Maximization
  • Why We Need EM
  • More General
  • The Expectation Maximization (EM) Algorithm
  • Jensenrsquos Inequality
  • Lower Bounding the Log Likelihood
  • The E and M Steps of EM
  • The E Step
  • The M Step
  • EM Never Decreases the Likelihood
  • Reference (3)
  • Why Do We Need Graphical Model
  • Why Do We Need Graphical Models
  • Directed Acyclic Graphical Models (Bayesian Networks)
  • Directed Graphs for Statistical Models Plate Notation
  • pLSA ndash Probabilistic Latent Semantic Analysis
  • Latent Semantic Indexing (LSI) Review
  • pLSA ndash Probabilistic Latent Semantic Analysis (2)
  • pLSA
  • pLSA (2)
  • Joint Probability vs Likelihood
  • Document Decomposition
  • pLSA ndash Objective Function
  • EM Steps
  • Lower Bounding the Log Likelihood (2)
  • EM Steps (2)
  • Latent Subspace
  • pLSA vs LSA
  • pLSA vs LSA (2)
  • Applications
  • Text Mining
  • Scene Classification
  • Classification Result
  • Reference (4)
  • LDA ndash Latent Dirichilet Allocation
  • Problems in pLSA
  • Problems in pLSA (2)
  • Dirichlet Distribution
  • Dirichlet Distribution (2)
  • Example Dirichlet Distributions (K=3)
  • Example Dirichlet Distributions (K=3) (2)
  • The LDA Model
  • The LDA Model (2)
  • Joint Probability
  • Likelihood
  • Inference
  • Inference (2)
  • Variational Inference
  • Variantional Inference
  • VBEM vs EM
  • Parameter Estimation
  • Parameter Estimation (2)
  • Topic Examples in a 100-topic LDA Model)
  • Classification (50-topic LDA + SVM)
  • Problems in LDA
  • A Bayesian Hierarchical Model for Learning Natural Scene Catego
  • Codebook
  • Topic Distribution in Different Categories
  • Topic Hierarchical Clustering
  • More Topic Models
  • Are you really into Graphical Models
  • Reference (5)
  • Outline (3)
  • Slide 115
  • The Long Query Problem
  • Key Idea Dimension Reduction + Residual Error Preservation
  • Orthogonal Decomposition
  • A Probabilistic Implementation
  • Search (Online)
  • Search Example
  • Search Example (2)
  • Simultaneously Modeling Semantics and Structure of Threaded Dis
  • Semantic amp structure
  • Optimize them together
  • Reply reconstruction
  • Baselines
  • Evaluation
  • Expert finding
  • Baselines (2)
  • Evaluation (2)
  • Summary
Page 12: An Introduction To Matrix Decomposition and Graphical Model

Property Data Decompositionbull PCA can be treated as data decomposition

a=UUTa=(u1u2hellipun) (u1u2hellipun)T a=(u1u2hellipun) (ltu1agtltu2agthellipltunagt)T

=(u1u2hellipun) (b1 b2 hellip bn)T

= Σ biui

Face Recognition ndash Eigenfacebull Turk MA Pentland AP Face recognition using eigenfaces

CVPR 1991 (Citation 2654)bull The eigenface approach

ndash images are points in a vector spacendash use PCA to reduce dimensionalityndash face spacendash compare projections onto face space to recognize faces

PageRank ndash Power Iteration

bull Column j has nonzero elements in positions corresponding to outlinks of j (Nj in total)

bull Row i has nonzero element in positions corresponding to inlinks Ii

Column-Stochastic amp Irreduciblebull Column-Stochastic

bull where

bull Irreducible

Iterative PageRank Calculationbull For k=12hellip

bull Equivalently (λ=1 A is a Markov chain transition matrix)

bull Why can we use power iteration to find the first eigenvector

Convergence of the power iteration

bull Expand the initial approximation r0 in terms of the eigenvectors

SINGULAR VALUE DECOMPOSITION

SVD - Definitionbull Any m x n matrix A with m ge n can be factorized

bull

Singular Values And Singular Vectors

bull The diagonal elements σj of are the singular values of the matrix A

bull The columns of U and V are the left singular vectors and right singular vectors respectively

bull Equivalent form of SVD

Matrix approximation

bull Theorem Let Uk = (u1 u2 hellip uk) Vk = (v1 v2 hellip vk) and Σk = diag(σ1 σ2 hellip σk) and define

bull Then

bull It means that the best approximation of rank k for the matrix A is

SVD and PCAbull We can writebull Remember that in PCA we treat A as a row matrixbull V is just eigenvectors for A

ndash Each column in V is an eigenvector of row matrix Andash we use V to approximate a row in A

bull Equivalently we can writebull U is just eigenvectors for AT

ndash Each column in U is an eigenvector of column matrix Andash We use U to approximate a column in A

Example - LSI bull Build a term-by-document matrix A

bull Compute the SVD of A A = UΣVT

bull Approximate A by

ndash Uk Orthogonal basis that we use to approximate all the documents

ndash Dk Column j hold the coordinates of document j in the new basisndash Dk is the projection of A onto the subspace spanned by Uk

SVD and PCAbull For symmetric A SVD is closely related to PCA

bull PCA A = UΛUT

ndash U and Λ are eigenvectors and eigenvaluesbull SVD A = UΛVT

ndash U is left(column) eigenvectorsndash V is right(row) eigenvectorsndash Λ is the same eigenvalues

bull For symmetric A column eigenvectors equal to row eigenvectors

bull Note the difference of A in PCA and SVDndash SVD A is directly the data eg term-by-document matrixndash PCA A is covariance matrix A=XTX each row in X is a sample

Latent Semantic Indexing (LSI)1 Document file preparation preprocessing

ndash Indexing collecting termsndash Use stop list eliminate rdquomeaninglessrdquo wordsndash Stemming

2 Construction term-by-document matrix sparse matrix storage

3 Query matching distance measures4 Data compression by low rank approximation SVD5 Ranking and relevance feedback

Latent Semantic Indexingbull Assumption there is some underlying latent semantic

structure in the data

bull Eg car and automobile occur in similar documents as do cows and sheep

bull This structure can be enhanced by projecting the data (the term-by-document matrix and the queries) onto a lower dimensional space using SVD

Similarity Measures

bull Term to term AAT = UΣ2UT = (UΣ)(UΣ)T

UΣ are the coordinates of A (rows) projected to space V

bull Document to documentATA = VΣ2VT = (VΣ)(VΣ)T

VΣ are the coordinates of A (columns) projected to space U

Similarity Measures

bull Term to document A = UΣVT = (UΣfrac12)(VΣfrac12)T

UΣfrac12 are the coordinates of A (rows) projected to space VVΣfrac12 are the coordinates of A (columns) projected to space U

HITS (Hyperlink Induced Topic Search)

bull Idea Web includes two flavors of prominent pagesndash authorities contain high-quality informationndash hubs are comprehensive lists of links to authorities

bull A page is a good authority if many hubs point to itbull A page is a good hub if it points to many authoritiesbull Good authorities are pointed to by good hubs and good hubs

point to good authorities

Hubs Authorities

Power Iteration

bull Each page i has both a hub score hi and an authority score aibull HITS successively refines these scores by computing

bull Define the adjacency matrix L of the directed web graph

bull Now

HITS and SVD

bull L rows are outlinks columns are inlinks

bull a will be the dominant eigenvector of the authority matrix LTL bull h will be the dominant eigenvector of the hub matrix LLT

bull They are in fact the first left and right singular vectors of Lbull We are in fact running SVD on the adjacency matrix

HITS vs PageRankbull PageRank may be computed once HITS is computed per

query

bull HITS takes query into account PageRank doesnrsquot

bull PageRank has no concept of hubs

bull HITS is sensitive to local topology insertion or deletion of a small number of nodes may change the scores a lot

bull PageRank more stable because of its random jump step

NMF ndash NON-NEGATIVE MATRIX FACTORIZATION

Definition

bull Given a nonnegative matrix Vntimesm find non-negative matrix factors Wntimesk and Hktimesm such that

VntimesmasympWntimeskHktimesm

bull V column matrix each column is a data sample (n-dimension)bull Wi k-basis represents one basebull H coordinates of V projected to W

vj asymp Wntimeskhj

Motivationbull Non-negativity is natural in many applications

bull Probability is also non-negative

bull Additive model to capture local structure

Multiplicative Update Algorithmbull Cost function Euclidean distance

bull Multiplicative Update

Multiplicative Update Algorithmbull Cost function Divergence

ndash Reduce to Kullback-Leibler divergence whenndash A and B can be regarded as normalized probability distributions

bull Multiplicative update

bull PLSA is NMF with KL divergence

NMF vs PCA

bull n = 2429 faces m = 19x19 pixelsbull Positive values are illustrated with black pixels and negative values

with red pixels

bull NMF Parts-based representationbull PCA Holistic representations

Referencebull D D Lee and H S Seung Algorithms for non-negative matrix

factorization (pdf) NIPS 2001bull D D Lee and H S Seung Learning the parts of objects by non-negative

matrix factorization (pdf) Nature 401 788-791 (1999)

Major Referencebull Saara Hyvoumlnen Linear Algebra Methods for Data Mining Spring 2007

University of Helsinki (Highly recommend)

Outlinebull Basic concepts

ndash Likelihood iidndash ML MAP and Bayesian Inferencendash Expectation-Maximizationndash Mixture Gaussian Parameter estimation

bull pLSAndash Motivationndash Derivation amp Geometry propertiesndash Applications

bull LDAndash Motivation - Why to add a hyper parameterndash Dirichlet Distributionndash Variational EMndash Relations with other topic modalsndash Incorporating category information

bull Summary

Not Included

bull General graphical model theoriesbull Markov random field (belief propagation)bull Detailed derivation of LDA

BASIC CONCEPTS

What Is Machine Learning

Databull Let x = (x1 x2 xD) T denote a data point and D = x(1) x(2) x(N) a

data set D is sometimes associated with desired outputs y1 y2

Predictionsbull We are generally interested in predicting something based on the observed

data setbull Given D what can we say about x(N+1)

Modelbull To make predictions we need to make some assumptions We can often

express these assumptions in the form of a model with some parameters θbull Given data D we learn the model parameters from which we can predict

new data pointsbull The model can often be expressed as a probability distribution over data

points

Likelihood Function

bull Given a set of parameter values probability density function (PDF) will show that some data are more probable than other data

bull Inversely given the observed data and a model of interest Likelihood function is defined as

L(θ) = fθ(x|θ) = p(x|θ)

bull That is likelihood function L(θ) will show that some parameters are more likely to have produced the data

Maximum Likelihood (ML)

bull Maximum likelihood will find the best model parameters that make the data ldquomost likelyrdquo generated from this model

bull Suppose we are given n data samples (x1 x2 hellip xn)

bull Maximum likelihood will find θ that maximize L(θ)

bull Predictive distribution

IID ndash Independent Identically Distributed

bull IID means

bull The problem is considerably simplified as

bull Usually log likehood is used

Reference

bull Zoubin Ghahramani Machine Learning (4F13) 2006 Cambridge(Introduction to Machine Learning Lectures 1-2 Slides)

bull Gregor Heinrich Parameter estimation for text analysis Technical Note 2005-2008

EXPECTATION MAXIMIZATION

Why We Need EM

bull The ExpectationndashMaximization (EM) algorithm is a method for ML learning of parameters in latent variable models

bull Why we need latent variables

bull To describe complex model Gaussian Mixture Model

bull To discover the intrinsic structure inside a data set Topic Models such as pLSA LDA

More General

bull Data set bull Likelihood

bull Goal learn maximum likelihood (ML) parameter values

bull The maximum likelihood procedure finds parameters θ such that

bull Because of the integral (or sum) over latent variables the likelihood can be a very complicated and hard to optimize function

The Expectation Maximization (EM) Algorithm

bull The EM algorithm finds a (local) maximum of a latent variable model likelihood It starts from arbitrary values of the parameters and iterates two steps

bull E step Fill in values of latent variables according to posterior given data

bull M step Maximize likelihood as if latent variables were not hidden

bull Decomposes difficult problems into series of tractable steps

Jensenrsquos Inequality

Lower Bounding the Log Likelihoodbull Observed data D = yn Latent variables X = xn Parameters θbull Goal Maximize the log likelihood (ie ML learning) wrt θ

bull Any distribution q(X) over the hidden variables can be used to obtain a lower bound on the log likelihood using Jensenrsquos inequality

bull where H[q] is the entropy of q(X)

The E and M Steps of EM

bull The lower bound on the log likelihood is given by

bull EM alternates betweenbull E step optimize wrt distribution over hidden variables

holding params fixed

bull M step maximize wrt parameters holding hidden distribution fixed

The E Step

bull E step for fixed θ

bull The second term is the Kullback-Leibler divergencebull This means that for fixed θ F is bounded above by L and achieves

that bound whenbull So the E step simply sets

The M Step

bull M step maximize wrt parameters holding hidden distribution q fixed

bull The second equality comes from fact that entropy of q(X) does not depend directly on θ

bull The specific form of the M step depends on the model Often the maximum wrt θ can be found analytically

EM Never Decreases the Likelihood

bull The E and M steps together never decrease the log likelihood

bull The E step brings F(q θ) to the likelihood L(θ) 顶bull The M-step maximizes F(q θ) wrt θ 抬bull F(q θ) lt L(θ) by Jensen ndash or equivalently from the non-

negativity of KL

Reference

bull Zoubin Ghahramani Machine Learning (4F13) 2006 Cambridge(Unsupervised learning Lecture 5 Slides)

bull Christopher M Bishop (2006) Pattern Recognition and Machine Learning Springer

WHY DO WE NEED GRAPHICAL MODEL

Why Do We Need Graphical Models

bull Consndash Graphical model is so complex even with a few circleshellipndash We have to make too many assumptions

bull Prosndash We do need probability to explain our world But joint probability is

hard to computendash Graphical model can help us analyze and understand our problemsndash Graphs are an intuitive way of representing and visualizing the

relationships between many variablesndash With a graphical model we can decouple joint probability to

conditional probabilities which are usually easier

Directed Acyclic Graphical Models (Bayesian Networks)

bull A DAG Model Bayesian network corresponds to a factorization of the joint probability distribution

p(ABCDE) = p(A)p(B)p(C|AB)p(D|BC)p(E|CD)

bull In general

bull where pa(i) are the parents of node i

Directed Graphs for Statistical ModelsPlate Notation

bull A data set of N points generated from a Gaussian

PLSA ndash PROBABILISTIC LATENT SEMANTIC ANALYSIS

Latent Semantic Indexing (LSI) Review

bull For natural Language Queries simple term matching does not work effectivelyndash Ambiguous terms ndash Same Queries vary due to personal styles

bull Latent semantic indexingndash Creates this lsquolatent semantic spacersquo (hidden meaning)

bull LSI puts documents together even if they donrsquot have common words if the docs share frequently co-occurring terms

bull Disadvantagesndash Statistical foundation is missing

pLSA ndash Probabilistic Latent Semantic Analysis

bull Automated Document Indexing and Information retrievalbull Identification of Latent Classes using an Expectation

Maximization (EM) Algorithmbull Shown to solve

ndash Polysemybull Java could mean ldquocoffeerdquo and also the ldquoPL Javardquobull Cricket is a ldquogamerdquo and also an ldquoinsectrdquo

ndash Synonymybull ldquocomputerrdquo ldquopcrdquo ldquodesktoprdquo all could mean the same

bull Has a better statistical foundation than LSA

pLSA

M

Nd

d

z

w

M

d

z1

w1

z2

w2

z3

w3

zN

wN

hellip

z1 hellip zN are variables ziє[1K]K is the number of latent topics

pLSA

d1

z1

w1

z2

w2

zN1

wN1

hellip

d2

z1

w1

z2

w2

zN2

wN2

hellip

dM

z1

w1

z2

w2

zNm

wNm

hellip

p(w|z=1) p(w|z=2) p(w | z=Nm) are shared for all documents

Likelihood

Joint Probability vs Likelihood

bull Joint probability

bull Likelihood (only for observed variables)

bull p(d) is assumed to be uniform

Document Decomposition

bull Each document can be decomposed as

bull This is similar to the matrix decomposition if we consider each discrete distribution as a vector

p(w|d) = ZVtimesk p(z|d)

bull With many documents we hope to find latent topics as common basis

pLSA ndash Objective Function

bull pLSA tries to maximize the log likelihood

bull Due to the summation over z inside log we have to resort to EM

EM Steps

bull E-Stepndash Expectation of the likelihood function is calculated with the current

parameter valuesbull M-Step

ndash Update the parameters with the calculated posterior probabilitiesndash Find the parameters that maximizes the likelihood function

Lower Bounding the Log Likelihood

EM Steps

bull The E-Step

bull The M-Step

Latent Subspace

pLSA vs LSAbull LSA and PLSA perform dimensionality reduction

ndash In LSA by keeping only K singular valuesndash In PLSA by having K aspects

bull Comparison to SVDndash U Matrix related to P(z|d) (doc to aspect)ndash V Matrix related to P(w|z) (aspect to term)ndash E Matrix related to P(z) (aspect strength)

pLSA vs LSAbull The main difference is the way the approximation is done

bull PLSA generates a model (aspect model) and maximizes its predictive power

bull Selecting the proper value of K is heuristic in LSA

bull Model selection in statistics can determine optimal K in PLSA

Applications

bull Text mining topic discovering

bull Scene Classification

Text Mining

Scene Classification

Classification Result

Reference

bull Thomas Hofmann Probabilistic latent semantic analysis In Proc of Uncertainty in Artificial Intelligence UAI99 Stockholm 1999

bull Bosch A Zisserman A and Munoz X Scene Classification via pLSAProceedings of the European Conference on Computer Vision (2006)

bull Sivic J Russell B C Efros A A Zisserman A and Freeman W T Discovering Object Categories in Image Collections MIT AI Lab Memo AIM-2005-005 February 2005

LDA ndash LATENT DIRICHILET ALLOCATION

Problems in pLSA

bull pLSA provides no probabilistic model at the document level Each doc has its own topic mixture proportion

bull The number of parameters in the model grows linearly with M (the number of documents in the training set)

Problems in pLSA

bull There is no constraint for distributions p(z|di)

bull Easy to lead to serious problems with over-fitting

d1

z1

w1

z2

w2

zN1

wN1

hellip

d2

z1

w1

z2

w2

zN2

wN2

hellip

dm

z1

w1

z2

w2

zNm

wNm

hellip

p(z|d1) p(z|d2) p(z|dm)

Dirichlet Distribution

bull In the LDA model the topic mixture proportions for each document are assumed to follow some distribution

bull Requirement for such a distributionndash The samples (mixture proportions) generated from it are K-tuples of non-

negative numbers that sum to one That is the samples are multinormialsndash Easy to optimize

bull Dirichlet Distribution is one of such distributions

bull The space of all of these multinomials has a nice geometric interpretation as a (k-1)-simplex

Dirichlet Distribution

bull Definition

bull The density is zero outside this open (K minus 1)-dimensional simplex

k

i ii

ki ik

i i

ki i

kk

xx

xxxxp i

1

11

1

12121

1 0 st

)Γ()(Γ)|(

bull Various parameter α

(6 2 2) (3 7 5)

(2 3 4) (6 2 6)

Example Dirichlet Distributions (K=3)

Example Dirichlet Distributions (K=3)

bull Equal αi different

α0=01 α0=1 α0=10

k

i i10

The LDA Model

z4z3z2z1

w4w3w2w1

b

z4z3z2z1

w4w3w2w1

z4z3z2z1

w4w3w2w1

bull For each documentbull Choose ~Dirichlet()bull For each of the N words wn

ndash Choose a topic znraquo Multinomial()

ndash Choose a word wn from p(wn|znb) a multinomial probability conditioned on the topic zn

The LDA Model

bull For each documentbull Choose ~Dirichlet()bull For each of the N words wn

ndash Choose a topic znraquo Multinomial()

ndash Choose a word wn from p(wn|znb) a multinomial probability conditioned on the topic zn

Joint Probability

bull Given parameter α and β

where

Likelihood

bull Joint Probability

bull Marginal distribution of a document

bull Likelihood over all the documents

Inference

bull The likelihood can be computed by summing each documentbull Jansenrsquos inequality in EM

Inference

bull In E-Step we need to compute the posterior distribution of the hidden variables

bull Unfortunately this distribution is intractable to compute in general

bull We have to resort to variational approach

Variational Inference

bull In variational inference we consider a simplified graphical model with variational parameters and minimize the KL Divergence between the variational and posterior distributions

Variantional Inference

bull The difference between the lower bound and the likelihood is the KL divergence

bull Maximizing the lower bound L(b) with respect to and is equivalent to minimizing the KL divergence

VBEM vs EM

bull Only different in the E-Step

bull In standard EM q(X) is directly set to p(X|Dθ) and let KL=0bull In VBEM it is intractable to compute p(X|Dθ) Instead it

approximates p(X|Dθ) by a variational distribution q(X) by minimizing KL(q(X) | P(X|D θ)

bull This is also equivalent to maximizing the lower bound L(θ)

Parameter Estimation

bull Given a corpus of documents we would like to find the parameters and b which maximize the likelihood of the observed data

bull Strategy (Variational EM)

bull Lower bound log p(w|b) by a function L(b)bull Repeat until convergence

ndash E Maximize L(b) with respect to the variational parameters ndash M Maximize the bound with respect to parameters and b

Parameter Estimation

bull E-Step Variational Inference ndash repeat until convergence

bull M-Step Parameter estimation

β

α can be implemented using the Newton-Raphson method

Topic Examples in a 100-topic LDA Model)bull 16000 documents from a subset of the TREC AP corpus

Classification (50-topic LDA + SVM)

bull Reuters-21578 dataset ndash contains 8000 documents and 15818 words

(a) EARN vs NOT EARN (b) GRAIN vs NOT GRAIN

Problems in LDA

bull Dirichlet Distribution is helpful to avoid over-fitting But the assumption might be too strong

z4z3z2z1

w4w3w2w1

b

z4z3z2z1

w4w3w2w1

z4z3z2z1

w4w3w2w1

A Bayesian Hierarchical Model for Learning Natural Scene Categories

bull Incorporating category information

MNd

π

z

x

θ

β

Codebookbull 174 Local Image Patches

bull DetectionEvenly Sampled GridRandom SamplingSaliency DetectorLowersquos DoG Detector

bull RepresentationNormalized 11x11 gray values128-dim SIFT

Topic Distribution in Different Categories

Topic Hierarchical Clustering

More Topic Models

bull Dynamic topic models ICML 2006bull Correlated Topic Model NIPS 2005bull Hierarchical Dirichlet Process Journal of the American

Statistical Association 2003bull Nonparametric Bayes pachinko allocation UAI 2007bull Supervised LDA NIPS 2007bull MedLDA ndash Maximum Margin Discrimant LDA ICML 2009bull hellip

Are you really into Graphical Models

bull Describing Visual Scenes using Transformed Dirichlet Processes E Sudderth A Torralba W Freeman and A Willsky NIPS Dec 2005

Reference

bull David M Blei Andrew Y Ng Michael I Jordan Latent Dirichlet Allocation Journal of Machine Learning Research (JMLR) 2003

bull Matthew J Beal Variational Algorithms for Approximate Bayesian Inference PhD Thesis University of Cambridge 1998

bull L Fei-Fei and P Perona A Bayesian Hierarchical Model for Learning Natural Scene Categories CVPR 2005

Outline

bull Matrix Decompositionndash PCA SVD NMFndash LDA ICA Sparse Coding etc

bull Graphical Modelndash Basic concepts in probabilistic machine learningndash EMndash pLSAndash LDA

bull Two Applicationsndash Document decomposition for ldquolong queryrdquo retrieval ICCV 2009ndash Modeling Threaded Discussions SIGIR 2009

Large-Scale Indexing for ldquoLong Queryrdquo Retrieval (Similarity Search)

Xiao Zhang Zhiwei Li Lei Zhang Wei-Ying Ma and Heung-Yeung Shum

ICCV 2009

The Long Query Problem

bull If a query contains 1000 keywordsndash Need to access 1000 inverted listsndash The intersection of 1000 inverted lists may be emptyndash The union of 1000 inverted list may be the whole corpus

bull Dimension reductionTerm1 Term2 Term3 Term4 hellip TermN

Img1 1 2 0 0 hellip 2

f1 f2 hellip fM

Img1 02 01 hellip 003

Topic Projection

Dim = 1 million

Dim = 200

Key Idea Dimension Reduction + Residual Error Preservation

bull p original TF-IDF vector in vocabulary spacebull X projection matrix for dimension reductionbull w low dimensional feature vectorbull residual error

p Xw

An image = 0 01 02 03 04 05 06 07 08 09 1

0

2

4

6

8

10

12

14

16

+ a few words(10 words)

Orthogonal Decomposition

p Xw 1 11 1 1

2 21 2 1 2

1 1

1

k

k

W k W

W W Wk W

p x xp x x w

p wp x x

Base vector

Low dimensional representation

Residual

An image = 0 01 02 03 04 05 06 07 08 09 1

0

2

4

6

8

10

12

14

16

+ a few words(10 words)

T Tp q p q

p q

w w

X1 X2 X3 hellip Xk

A Probabilistic Implementation

x is a switch variable It controls a word generated from

bull a topic specific distribution

bull a document specific distribution

bull a background distribution

( | )p w d 1( 0 | ) ( | ) ( | )K

kp x d p w z k p z k d

( 1| ) ( | )p x d p w d ( 2 | ) ( | )p x d p w

C Chemudugunta etal Modeling General and Specific Aspects of Documents with a Probabilistic Topic Model NIPS 2006

Search (Online)

DS1 DS2 hellip

DS1 DS2 hellip

DS1 DS2 hellip

DS1 DS2 hellip

DS1 DS2 hellip

hellip

hellip

hellip

hellip

hellip

LSH Index

Doc 300 Doc 401 hellip

A query = 0 01 02 03 04 05 06 07 08 09 1

0

2

4

6

8

10

12

14

16

+ a few words

Re-rankingDoc 401 hellip

Doc 1

Doc 2

Doc 300

Doc 401

Doc N

Doc 300

Index 10M Images 46GBSearch Speed lt 100ms

Doc Meta

Search Example

Query Image

Search Example

Query Image

Simultaneously Modeling Semantics and Structure of Threaded Discussions A Sparse

Coding Approach and Its Applications

Chen LIN Jiang-Ming YANG Rui CAI Xin-jing WANG Wei WANG Lei ZHANG

SIGIR 2009

123

Semantic amp structure

124

SemanticTopics

StructureWho reply to who

Optimize them together

Model semantic

Model structure

125

Reply reconstruction

126

DocumentSimilarity

TopicSimilarity

StructureSimilarity

BaselinesNP

Reply to Nearest PostRR

Reply to RootDS

Document SimilarityLDA

Latent Dirichlet Allocation Project documents to topic space

SWB Special Words Topic Model with Background distribution Project documents to topic and junk topic space

127

Evaluation

method Slashdot Apple

All Posts Good Posts All Posts Good Posts

NP 0021 0012 0289 0239

RR 0183 0319 0269 0474

DS 0463 0643 0409 0628

LDA 0465 0644 0410 0648

SWB 0463 0644 0410 0641

SMSS 0524 0737 0517 0772

128

Expert finding

Reply reconstruction

Network construction

Expert finding

Methods

HITS

PageRank

hellip

129

Baselines

LMFormal Models for Expert Finding in Enterprise Corpora SIGIR

06Achieves stable performance in expert finding task using a

language modelPageRank

Benchmark nodal ranking methodHITS

Find hub nodes and authority nodeEABIF

Personalized Recommendation Driven by Information Flow SIGIR rsquo06

Find most influential node130

Evaluation

131

bull Bayesian estimate

Method MRR MAP P10

LM 0821 0698 0800

EABIF(ori) 0674 0362 0243

EABIF(rec) 0742 0318 0281

PageRank(ori) 0675 0377 0263

PageRank(rec) 0743 0321 0266

HITS(ori) 0906 0832 0900

HITS(rec) 0938 0822 0906

Summary

bull Matrix and probability are fundamental mathematics in information retrieval and computer visionndash Matrix decomposition ndash a good practice to learn matrixndash Graphical model ndash a good practice to learn probability

bull Graphical model is a good tool to analyze problems

bull The essence of decomposition is to discover a set of mid-level features to describe original documentsimages

bull It is more adaptable for various applications than matrix decomposition

  • An Introduction To Matrix Decomposition and Graphical Model
  • Outline
  • What Is Matrix Decomposition
  • Why We Need Matrix Decomposition
  • Why We Need Matrix Decomposition (2)
  • Principle Component Analysis
  • Definition ndash Eigenvalue amp Eigenvector
  • Definition ndash Principle Component Analysis
  • Principle Component Analysis (2)
  • Maximizing Variance
  • Optimization Problem
  • Property Data Decomposition
  • Face Recognition ndash Eigenface
  • Slide 14
  • Slide 15
  • PageRank ndash Power Iteration
  • Column-Stochastic amp Irreducible
  • Iterative PageRank Calculation
  • Convergence of the power iteration
  • Singular value decomposition
  • SVD - Definition
  • Singular Values And Singular Vectors
  • Matrix approximation
  • SVD and PCA
  • Example - LSI
  • SVD and PCA (2)
  • Latent Semantic Indexing (LSI)
  • Latent Semantic Indexing
  • Similarity Measures
  • Similarity Measures (2)
  • HITS (Hyperlink Induced Topic Search)
  • Power Iteration
  • HITS and SVD
  • HITS vs PageRank
  • NMF ndash Non-Negative Matrix Factorization
  • Definition
  • Motivation
  • Multiplicative Update Algorithm
  • Multiplicative Update Algorithm (2)
  • NMF vs PCA
  • Reference
  • Major Reference
  • Outline (2)
  • Not Included
  • Basic Concepts
  • What Is Machine Learning
  • Likelihood Function
  • Maximum Likelihood (ML)
  • IID ndash Independent Identically Distributed
  • Reference (2)
  • Expectation Maximization
  • Why We Need EM
  • More General
  • The Expectation Maximization (EM) Algorithm
  • Jensenrsquos Inequality
  • Lower Bounding the Log Likelihood
  • The E and M Steps of EM
  • The E Step
  • The M Step
  • EM Never Decreases the Likelihood
  • Reference (3)
  • Why Do We Need Graphical Model
  • Why Do We Need Graphical Models
  • Directed Acyclic Graphical Models (Bayesian Networks)
  • Directed Graphs for Statistical Models Plate Notation
  • pLSA ndash Probabilistic Latent Semantic Analysis
  • Latent Semantic Indexing (LSI) Review
  • pLSA ndash Probabilistic Latent Semantic Analysis (2)
  • pLSA
  • pLSA (2)
  • Joint Probability vs Likelihood
  • Document Decomposition
  • pLSA ndash Objective Function
  • EM Steps
  • Lower Bounding the Log Likelihood (2)
  • EM Steps (2)
  • Latent Subspace
  • pLSA vs LSA
  • pLSA vs LSA (2)
  • Applications
  • Text Mining
  • Scene Classification
  • Classification Result
  • Reference (4)
  • LDA ndash Latent Dirichilet Allocation
  • Problems in pLSA
  • Problems in pLSA (2)
  • Dirichlet Distribution
  • Dirichlet Distribution (2)
  • Example Dirichlet Distributions (K=3)
  • Example Dirichlet Distributions (K=3) (2)
  • The LDA Model
  • The LDA Model (2)
  • Joint Probability
  • Likelihood
  • Inference
  • Inference (2)
  • Variational Inference
  • Variantional Inference
  • VBEM vs EM
  • Parameter Estimation
  • Parameter Estimation (2)
  • Topic Examples in a 100-topic LDA Model)
  • Classification (50-topic LDA + SVM)
  • Problems in LDA
  • A Bayesian Hierarchical Model for Learning Natural Scene Catego
  • Codebook
  • Topic Distribution in Different Categories
  • Topic Hierarchical Clustering
  • More Topic Models
  • Are you really into Graphical Models
  • Reference (5)
  • Outline (3)
  • Slide 115
  • The Long Query Problem
  • Key Idea Dimension Reduction + Residual Error Preservation
  • Orthogonal Decomposition
  • A Probabilistic Implementation
  • Search (Online)
  • Search Example
  • Search Example (2)
  • Simultaneously Modeling Semantics and Structure of Threaded Dis
  • Semantic amp structure
  • Optimize them together
  • Reply reconstruction
  • Baselines
  • Evaluation
  • Expert finding
  • Baselines (2)
  • Evaluation (2)
  • Summary
Page 13: An Introduction To Matrix Decomposition and Graphical Model

Face Recognition ndash Eigenfacebull Turk MA Pentland AP Face recognition using eigenfaces

CVPR 1991 (Citation 2654)bull The eigenface approach

ndash images are points in a vector spacendash use PCA to reduce dimensionalityndash face spacendash compare projections onto face space to recognize faces

PageRank ndash Power Iteration

bull Column j has nonzero elements in positions corresponding to outlinks of j (Nj in total)

bull Row i has nonzero element in positions corresponding to inlinks Ii

Column-Stochastic amp Irreduciblebull Column-Stochastic

bull where

bull Irreducible

Iterative PageRank Calculationbull For k=12hellip

bull Equivalently (λ=1 A is a Markov chain transition matrix)

bull Why can we use power iteration to find the first eigenvector

Convergence of the power iteration

bull Expand the initial approximation r0 in terms of the eigenvectors

SINGULAR VALUE DECOMPOSITION

SVD - Definitionbull Any m x n matrix A with m ge n can be factorized

bull

Singular Values And Singular Vectors

bull The diagonal elements σj of are the singular values of the matrix A

bull The columns of U and V are the left singular vectors and right singular vectors respectively

bull Equivalent form of SVD

Matrix approximation

bull Theorem Let Uk = (u1 u2 hellip uk) Vk = (v1 v2 hellip vk) and Σk = diag(σ1 σ2 hellip σk) and define

bull Then

bull It means that the best approximation of rank k for the matrix A is

SVD and PCAbull We can writebull Remember that in PCA we treat A as a row matrixbull V is just eigenvectors for A

ndash Each column in V is an eigenvector of row matrix Andash we use V to approximate a row in A

bull Equivalently we can writebull U is just eigenvectors for AT

ndash Each column in U is an eigenvector of column matrix Andash We use U to approximate a column in A

Example - LSI bull Build a term-by-document matrix A

bull Compute the SVD of A A = UΣVT

bull Approximate A by

ndash Uk Orthogonal basis that we use to approximate all the documents

ndash Dk Column j hold the coordinates of document j in the new basisndash Dk is the projection of A onto the subspace spanned by Uk

SVD and PCAbull For symmetric A SVD is closely related to PCA

bull PCA A = UΛUT

ndash U and Λ are eigenvectors and eigenvaluesbull SVD A = UΛVT

ndash U is left(column) eigenvectorsndash V is right(row) eigenvectorsndash Λ is the same eigenvalues

bull For symmetric A column eigenvectors equal to row eigenvectors

bull Note the difference of A in PCA and SVDndash SVD A is directly the data eg term-by-document matrixndash PCA A is covariance matrix A=XTX each row in X is a sample

Latent Semantic Indexing (LSI)1 Document file preparation preprocessing

ndash Indexing collecting termsndash Use stop list eliminate rdquomeaninglessrdquo wordsndash Stemming

2 Construction term-by-document matrix sparse matrix storage

3 Query matching distance measures4 Data compression by low rank approximation SVD5 Ranking and relevance feedback

Latent Semantic Indexingbull Assumption there is some underlying latent semantic

structure in the data

bull Eg car and automobile occur in similar documents as do cows and sheep

bull This structure can be enhanced by projecting the data (the term-by-document matrix and the queries) onto a lower dimensional space using SVD

Similarity Measures

bull Term to term AAT = UΣ2UT = (UΣ)(UΣ)T

UΣ are the coordinates of A (rows) projected to space V

bull Document to documentATA = VΣ2VT = (VΣ)(VΣ)T

VΣ are the coordinates of A (columns) projected to space U

Similarity Measures

bull Term to document A = UΣVT = (UΣfrac12)(VΣfrac12)T

UΣfrac12 are the coordinates of A (rows) projected to space VVΣfrac12 are the coordinates of A (columns) projected to space U

HITS (Hyperlink Induced Topic Search)

bull Idea Web includes two flavors of prominent pagesndash authorities contain high-quality informationndash hubs are comprehensive lists of links to authorities

bull A page is a good authority if many hubs point to itbull A page is a good hub if it points to many authoritiesbull Good authorities are pointed to by good hubs and good hubs

point to good authorities

Hubs Authorities

Power Iteration

bull Each page i has both a hub score hi and an authority score aibull HITS successively refines these scores by computing

bull Define the adjacency matrix L of the directed web graph

bull Now

HITS and SVD

bull L rows are outlinks columns are inlinks

bull a will be the dominant eigenvector of the authority matrix LTL bull h will be the dominant eigenvector of the hub matrix LLT

bull They are in fact the first left and right singular vectors of Lbull We are in fact running SVD on the adjacency matrix

HITS vs PageRankbull PageRank may be computed once HITS is computed per

query

bull HITS takes query into account PageRank doesnrsquot

bull PageRank has no concept of hubs

bull HITS is sensitive to local topology insertion or deletion of a small number of nodes may change the scores a lot

bull PageRank more stable because of its random jump step

NMF ndash NON-NEGATIVE MATRIX FACTORIZATION

Definition

bull Given a nonnegative matrix Vntimesm find non-negative matrix factors Wntimesk and Hktimesm such that

VntimesmasympWntimeskHktimesm

bull V column matrix each column is a data sample (n-dimension)bull Wi k-basis represents one basebull H coordinates of V projected to W

vj asymp Wntimeskhj

Motivationbull Non-negativity is natural in many applications

bull Probability is also non-negative

bull Additive model to capture local structure

Multiplicative Update Algorithmbull Cost function Euclidean distance

bull Multiplicative Update

Multiplicative Update Algorithmbull Cost function Divergence

ndash Reduce to Kullback-Leibler divergence whenndash A and B can be regarded as normalized probability distributions

bull Multiplicative update

bull PLSA is NMF with KL divergence

NMF vs PCA

bull n = 2429 faces m = 19x19 pixelsbull Positive values are illustrated with black pixels and negative values

with red pixels

bull NMF Parts-based representationbull PCA Holistic representations

Referencebull D D Lee and H S Seung Algorithms for non-negative matrix

factorization (pdf) NIPS 2001bull D D Lee and H S Seung Learning the parts of objects by non-negative

matrix factorization (pdf) Nature 401 788-791 (1999)

Major Referencebull Saara Hyvoumlnen Linear Algebra Methods for Data Mining Spring 2007

University of Helsinki (Highly recommend)

Outlinebull Basic concepts

ndash Likelihood iidndash ML MAP and Bayesian Inferencendash Expectation-Maximizationndash Mixture Gaussian Parameter estimation

bull pLSAndash Motivationndash Derivation amp Geometry propertiesndash Applications

bull LDAndash Motivation - Why to add a hyper parameterndash Dirichlet Distributionndash Variational EMndash Relations with other topic modalsndash Incorporating category information

bull Summary

Not Included

bull General graphical model theoriesbull Markov random field (belief propagation)bull Detailed derivation of LDA

BASIC CONCEPTS

What Is Machine Learning

Databull Let x = (x1 x2 xD) T denote a data point and D = x(1) x(2) x(N) a

data set D is sometimes associated with desired outputs y1 y2

Predictionsbull We are generally interested in predicting something based on the observed

data setbull Given D what can we say about x(N+1)

Modelbull To make predictions we need to make some assumptions We can often

express these assumptions in the form of a model with some parameters θbull Given data D we learn the model parameters from which we can predict

new data pointsbull The model can often be expressed as a probability distribution over data

points

Likelihood Function

bull Given a set of parameter values probability density function (PDF) will show that some data are more probable than other data

bull Inversely given the observed data and a model of interest Likelihood function is defined as

L(θ) = fθ(x|θ) = p(x|θ)

bull That is likelihood function L(θ) will show that some parameters are more likely to have produced the data

Maximum Likelihood (ML)

bull Maximum likelihood will find the best model parameters that make the data ldquomost likelyrdquo generated from this model

bull Suppose we are given n data samples (x1 x2 hellip xn)

bull Maximum likelihood will find θ that maximize L(θ)

bull Predictive distribution

IID ndash Independent Identically Distributed

bull IID means

bull The problem is considerably simplified as

bull Usually log likehood is used

Reference

bull Zoubin Ghahramani Machine Learning (4F13) 2006 Cambridge(Introduction to Machine Learning Lectures 1-2 Slides)

bull Gregor Heinrich Parameter estimation for text analysis Technical Note 2005-2008

EXPECTATION MAXIMIZATION

Why We Need EM

bull The ExpectationndashMaximization (EM) algorithm is a method for ML learning of parameters in latent variable models

bull Why we need latent variables

bull To describe complex model Gaussian Mixture Model

bull To discover the intrinsic structure inside a data set Topic Models such as pLSA LDA

More General

bull Data set bull Likelihood

bull Goal learn maximum likelihood (ML) parameter values

bull The maximum likelihood procedure finds parameters θ such that

bull Because of the integral (or sum) over latent variables the likelihood can be a very complicated and hard to optimize function

The Expectation Maximization (EM) Algorithm

bull The EM algorithm finds a (local) maximum of a latent variable model likelihood It starts from arbitrary values of the parameters and iterates two steps

bull E step Fill in values of latent variables according to posterior given data

bull M step Maximize likelihood as if latent variables were not hidden

bull Decomposes difficult problems into series of tractable steps

Jensenrsquos Inequality

Lower Bounding the Log Likelihoodbull Observed data D = yn Latent variables X = xn Parameters θbull Goal Maximize the log likelihood (ie ML learning) wrt θ

bull Any distribution q(X) over the hidden variables can be used to obtain a lower bound on the log likelihood using Jensenrsquos inequality

bull where H[q] is the entropy of q(X)

The E and M Steps of EM

bull The lower bound on the log likelihood is given by

bull EM alternates betweenbull E step optimize wrt distribution over hidden variables

holding params fixed

bull M step maximize wrt parameters holding hidden distribution fixed

The E Step

bull E step for fixed θ

bull The second term is the Kullback-Leibler divergencebull This means that for fixed θ F is bounded above by L and achieves

that bound whenbull So the E step simply sets

The M Step

bull M step maximize wrt parameters holding hidden distribution q fixed

bull The second equality comes from fact that entropy of q(X) does not depend directly on θ

bull The specific form of the M step depends on the model Often the maximum wrt θ can be found analytically

EM Never Decreases the Likelihood

bull The E and M steps together never decrease the log likelihood

bull The E step brings F(q θ) to the likelihood L(θ) 顶bull The M-step maximizes F(q θ) wrt θ 抬bull F(q θ) lt L(θ) by Jensen ndash or equivalently from the non-

negativity of KL

Reference

bull Zoubin Ghahramani Machine Learning (4F13) 2006 Cambridge(Unsupervised learning Lecture 5 Slides)

bull Christopher M Bishop (2006) Pattern Recognition and Machine Learning Springer

WHY DO WE NEED GRAPHICAL MODEL

Why Do We Need Graphical Models

bull Consndash Graphical model is so complex even with a few circleshellipndash We have to make too many assumptions

bull Prosndash We do need probability to explain our world But joint probability is

hard to computendash Graphical model can help us analyze and understand our problemsndash Graphs are an intuitive way of representing and visualizing the

relationships between many variablesndash With a graphical model we can decouple joint probability to

conditional probabilities which are usually easier

Directed Acyclic Graphical Models (Bayesian Networks)

bull A DAG Model Bayesian network corresponds to a factorization of the joint probability distribution

p(ABCDE) = p(A)p(B)p(C|AB)p(D|BC)p(E|CD)

bull In general

bull where pa(i) are the parents of node i

Directed Graphs for Statistical ModelsPlate Notation

bull A data set of N points generated from a Gaussian

PLSA ndash PROBABILISTIC LATENT SEMANTIC ANALYSIS

Latent Semantic Indexing (LSI) Review

bull For natural Language Queries simple term matching does not work effectivelyndash Ambiguous terms ndash Same Queries vary due to personal styles

bull Latent semantic indexingndash Creates this lsquolatent semantic spacersquo (hidden meaning)

bull LSI puts documents together even if they donrsquot have common words if the docs share frequently co-occurring terms

bull Disadvantagesndash Statistical foundation is missing

pLSA ndash Probabilistic Latent Semantic Analysis

bull Automated Document Indexing and Information retrievalbull Identification of Latent Classes using an Expectation

Maximization (EM) Algorithmbull Shown to solve

ndash Polysemybull Java could mean ldquocoffeerdquo and also the ldquoPL Javardquobull Cricket is a ldquogamerdquo and also an ldquoinsectrdquo

ndash Synonymybull ldquocomputerrdquo ldquopcrdquo ldquodesktoprdquo all could mean the same

bull Has a better statistical foundation than LSA

pLSA

M

Nd

d

z

w

M

d

z1

w1

z2

w2

z3

w3

zN

wN

hellip

z1 hellip zN are variables ziє[1K]K is the number of latent topics

pLSA

d1

z1

w1

z2

w2

zN1

wN1

hellip

d2

z1

w1

z2

w2

zN2

wN2

hellip

dM

z1

w1

z2

w2

zNm

wNm

hellip

p(w|z=1) p(w|z=2) p(w | z=Nm) are shared for all documents

Likelihood

Joint Probability vs Likelihood

bull Joint probability

bull Likelihood (only for observed variables)

bull p(d) is assumed to be uniform

Document Decomposition

bull Each document can be decomposed as

bull This is similar to the matrix decomposition if we consider each discrete distribution as a vector

p(w|d) = ZVtimesk p(z|d)

bull With many documents we hope to find latent topics as common basis

pLSA ndash Objective Function

bull pLSA tries to maximize the log likelihood

bull Due to the summation over z inside log we have to resort to EM

EM Steps

bull E-Stepndash Expectation of the likelihood function is calculated with the current

parameter valuesbull M-Step

ndash Update the parameters with the calculated posterior probabilitiesndash Find the parameters that maximizes the likelihood function

Lower Bounding the Log Likelihood

EM Steps

bull The E-Step

bull The M-Step

Latent Subspace

pLSA vs LSAbull LSA and PLSA perform dimensionality reduction

ndash In LSA by keeping only K singular valuesndash In PLSA by having K aspects

bull Comparison to SVDndash U Matrix related to P(z|d) (doc to aspect)ndash V Matrix related to P(w|z) (aspect to term)ndash E Matrix related to P(z) (aspect strength)

pLSA vs LSAbull The main difference is the way the approximation is done

bull PLSA generates a model (aspect model) and maximizes its predictive power

bull Selecting the proper value of K is heuristic in LSA

bull Model selection in statistics can determine optimal K in PLSA

Applications

bull Text mining topic discovering

bull Scene Classification

Text Mining

Scene Classification

Classification Result

Reference

bull Thomas Hofmann Probabilistic latent semantic analysis In Proc of Uncertainty in Artificial Intelligence UAI99 Stockholm 1999

bull Bosch A Zisserman A and Munoz X Scene Classification via pLSAProceedings of the European Conference on Computer Vision (2006)

bull Sivic J Russell B C Efros A A Zisserman A and Freeman W T Discovering Object Categories in Image Collections MIT AI Lab Memo AIM-2005-005 February 2005

LDA ndash LATENT DIRICHILET ALLOCATION

Problems in pLSA

bull pLSA provides no probabilistic model at the document level Each doc has its own topic mixture proportion

bull The number of parameters in the model grows linearly with M (the number of documents in the training set)

Problems in pLSA

bull There is no constraint for distributions p(z|di)

bull Easy to lead to serious problems with over-fitting

d1

z1

w1

z2

w2

zN1

wN1

hellip

d2

z1

w1

z2

w2

zN2

wN2

hellip

dm

z1

w1

z2

w2

zNm

wNm

hellip

p(z|d1) p(z|d2) p(z|dm)

Dirichlet Distribution

bull In the LDA model the topic mixture proportions for each document are assumed to follow some distribution

bull Requirement for such a distributionndash The samples (mixture proportions) generated from it are K-tuples of non-

negative numbers that sum to one That is the samples are multinormialsndash Easy to optimize

bull Dirichlet Distribution is one of such distributions

bull The space of all of these multinomials has a nice geometric interpretation as a (k-1)-simplex

Dirichlet Distribution

bull Definition

bull The density is zero outside this open (K minus 1)-dimensional simplex

k

i ii

ki ik

i i

ki i

kk

xx

xxxxp i

1

11

1

12121

1 0 st

)Γ()(Γ)|(

bull Various parameter α

(6 2 2) (3 7 5)

(2 3 4) (6 2 6)

Example Dirichlet Distributions (K=3)

Example Dirichlet Distributions (K=3)

bull Equal αi different

α0=01 α0=1 α0=10

k

i i10

The LDA Model

z4z3z2z1

w4w3w2w1

b

z4z3z2z1

w4w3w2w1

z4z3z2z1

w4w3w2w1

bull For each documentbull Choose ~Dirichlet()bull For each of the N words wn

ndash Choose a topic znraquo Multinomial()

ndash Choose a word wn from p(wn|znb) a multinomial probability conditioned on the topic zn

The LDA Model

bull For each documentbull Choose ~Dirichlet()bull For each of the N words wn

ndash Choose a topic znraquo Multinomial()

ndash Choose a word wn from p(wn|znb) a multinomial probability conditioned on the topic zn

Joint Probability

bull Given parameter α and β

where

Likelihood

bull Joint Probability

bull Marginal distribution of a document

bull Likelihood over all the documents

Inference

bull The likelihood can be computed by summing each documentbull Jansenrsquos inequality in EM

Inference

bull In E-Step we need to compute the posterior distribution of the hidden variables

bull Unfortunately this distribution is intractable to compute in general

bull We have to resort to variational approach

Variational Inference

bull In variational inference we consider a simplified graphical model with variational parameters and minimize the KL Divergence between the variational and posterior distributions

Variantional Inference

bull The difference between the lower bound and the likelihood is the KL divergence

bull Maximizing the lower bound L(b) with respect to and is equivalent to minimizing the KL divergence

VBEM vs EM

bull Only different in the E-Step

bull In standard EM q(X) is directly set to p(X|Dθ) and let KL=0bull In VBEM it is intractable to compute p(X|Dθ) Instead it

approximates p(X|Dθ) by a variational distribution q(X) by minimizing KL(q(X) | P(X|D θ)

bull This is also equivalent to maximizing the lower bound L(θ)

Parameter Estimation

bull Given a corpus of documents we would like to find the parameters and b which maximize the likelihood of the observed data

bull Strategy (Variational EM)

bull Lower bound log p(w|b) by a function L(b)bull Repeat until convergence

ndash E Maximize L(b) with respect to the variational parameters ndash M Maximize the bound with respect to parameters and b

Parameter Estimation

bull E-Step Variational Inference ndash repeat until convergence

bull M-Step Parameter estimation

β

α can be implemented using the Newton-Raphson method

Topic Examples in a 100-topic LDA Model)bull 16000 documents from a subset of the TREC AP corpus

Classification (50-topic LDA + SVM)

bull Reuters-21578 dataset ndash contains 8000 documents and 15818 words

(a) EARN vs NOT EARN (b) GRAIN vs NOT GRAIN

Problems in LDA

bull Dirichlet Distribution is helpful to avoid over-fitting But the assumption might be too strong

z4z3z2z1

w4w3w2w1

b

z4z3z2z1

w4w3w2w1

z4z3z2z1

w4w3w2w1

A Bayesian Hierarchical Model for Learning Natural Scene Categories

bull Incorporating category information

MNd

π

z

x

θ

β

Codebookbull 174 Local Image Patches

bull DetectionEvenly Sampled GridRandom SamplingSaliency DetectorLowersquos DoG Detector

bull RepresentationNormalized 11x11 gray values128-dim SIFT

Topic Distribution in Different Categories

Topic Hierarchical Clustering

More Topic Models

bull Dynamic topic models ICML 2006bull Correlated Topic Model NIPS 2005bull Hierarchical Dirichlet Process Journal of the American

Statistical Association 2003bull Nonparametric Bayes pachinko allocation UAI 2007bull Supervised LDA NIPS 2007bull MedLDA ndash Maximum Margin Discrimant LDA ICML 2009bull hellip

Are you really into Graphical Models

bull Describing Visual Scenes using Transformed Dirichlet Processes E Sudderth A Torralba W Freeman and A Willsky NIPS Dec 2005

Reference

bull David M Blei Andrew Y Ng Michael I Jordan Latent Dirichlet Allocation Journal of Machine Learning Research (JMLR) 2003

bull Matthew J Beal Variational Algorithms for Approximate Bayesian Inference PhD Thesis University of Cambridge 1998

bull L Fei-Fei and P Perona A Bayesian Hierarchical Model for Learning Natural Scene Categories CVPR 2005

Outline

bull Matrix Decompositionndash PCA SVD NMFndash LDA ICA Sparse Coding etc

bull Graphical Modelndash Basic concepts in probabilistic machine learningndash EMndash pLSAndash LDA

bull Two Applicationsndash Document decomposition for ldquolong queryrdquo retrieval ICCV 2009ndash Modeling Threaded Discussions SIGIR 2009

Large-Scale Indexing for ldquoLong Queryrdquo Retrieval (Similarity Search)

Xiao Zhang Zhiwei Li Lei Zhang Wei-Ying Ma and Heung-Yeung Shum

ICCV 2009

The Long Query Problem

bull If a query contains 1000 keywordsndash Need to access 1000 inverted listsndash The intersection of 1000 inverted lists may be emptyndash The union of 1000 inverted list may be the whole corpus

bull Dimension reductionTerm1 Term2 Term3 Term4 hellip TermN

Img1 1 2 0 0 hellip 2

f1 f2 hellip fM

Img1 02 01 hellip 003

Topic Projection

Dim = 1 million

Dim = 200

Key Idea Dimension Reduction + Residual Error Preservation

bull p original TF-IDF vector in vocabulary spacebull X projection matrix for dimension reductionbull w low dimensional feature vectorbull residual error

p Xw

An image = 0 01 02 03 04 05 06 07 08 09 1

0

2

4

6

8

10

12

14

16

+ a few words(10 words)

Orthogonal Decomposition

p Xw 1 11 1 1

2 21 2 1 2

1 1

1

k

k

W k W

W W Wk W

p x xp x x w

p wp x x

Base vector

Low dimensional representation

Residual

An image = 0 01 02 03 04 05 06 07 08 09 1

0

2

4

6

8

10

12

14

16

+ a few words(10 words)

T Tp q p q

p q

w w

X1 X2 X3 hellip Xk

A Probabilistic Implementation

x is a switch variable It controls a word generated from

bull a topic specific distribution

bull a document specific distribution

bull a background distribution

( | )p w d 1( 0 | ) ( | ) ( | )K

kp x d p w z k p z k d

( 1| ) ( | )p x d p w d ( 2 | ) ( | )p x d p w

C Chemudugunta etal Modeling General and Specific Aspects of Documents with a Probabilistic Topic Model NIPS 2006

Search (Online)

DS1 DS2 hellip

DS1 DS2 hellip

DS1 DS2 hellip

DS1 DS2 hellip

DS1 DS2 hellip

hellip

hellip

hellip

hellip

hellip

LSH Index

Doc 300 Doc 401 hellip

A query = 0 01 02 03 04 05 06 07 08 09 1

0

2

4

6

8

10

12

14

16

+ a few words

Re-rankingDoc 401 hellip

Doc 1

Doc 2

Doc 300

Doc 401

Doc N

Doc 300

Index 10M Images 46GBSearch Speed lt 100ms

Doc Meta

Search Example

Query Image

Search Example

Query Image

Simultaneously Modeling Semantics and Structure of Threaded Discussions A Sparse

Coding Approach and Its Applications

Chen LIN Jiang-Ming YANG Rui CAI Xin-jing WANG Wei WANG Lei ZHANG

SIGIR 2009

123

Semantic amp structure

124

SemanticTopics

StructureWho reply to who

Optimize them together

Model semantic

Model structure

125

Reply reconstruction

126

DocumentSimilarity

TopicSimilarity

StructureSimilarity

BaselinesNP

Reply to Nearest PostRR

Reply to RootDS

Document SimilarityLDA

Latent Dirichlet Allocation Project documents to topic space

SWB Special Words Topic Model with Background distribution Project documents to topic and junk topic space

127

Evaluation

method Slashdot Apple

All Posts Good Posts All Posts Good Posts

NP 0021 0012 0289 0239

RR 0183 0319 0269 0474

DS 0463 0643 0409 0628

LDA 0465 0644 0410 0648

SWB 0463 0644 0410 0641

SMSS 0524 0737 0517 0772

128

Expert finding

Reply reconstruction

Network construction

Expert finding

Methods

HITS

PageRank

hellip

129

Baselines

LMFormal Models for Expert Finding in Enterprise Corpora SIGIR

06Achieves stable performance in expert finding task using a

language modelPageRank

Benchmark nodal ranking methodHITS

Find hub nodes and authority nodeEABIF

Personalized Recommendation Driven by Information Flow SIGIR rsquo06

Find most influential node130

Evaluation

131

bull Bayesian estimate

Method MRR MAP P10

LM 0821 0698 0800

EABIF(ori) 0674 0362 0243

EABIF(rec) 0742 0318 0281

PageRank(ori) 0675 0377 0263

PageRank(rec) 0743 0321 0266

HITS(ori) 0906 0832 0900

HITS(rec) 0938 0822 0906

Summary

bull Matrix and probability are fundamental mathematics in information retrieval and computer visionndash Matrix decomposition ndash a good practice to learn matrixndash Graphical model ndash a good practice to learn probability

bull Graphical model is a good tool to analyze problems

bull The essence of decomposition is to discover a set of mid-level features to describe original documentsimages

bull It is more adaptable for various applications than matrix decomposition

  • An Introduction To Matrix Decomposition and Graphical Model
  • Outline
  • What Is Matrix Decomposition
  • Why We Need Matrix Decomposition
  • Why We Need Matrix Decomposition (2)
  • Principle Component Analysis
  • Definition ndash Eigenvalue amp Eigenvector
  • Definition ndash Principle Component Analysis
  • Principle Component Analysis (2)
  • Maximizing Variance
  • Optimization Problem
  • Property Data Decomposition
  • Face Recognition ndash Eigenface
  • Slide 14
  • Slide 15
  • PageRank ndash Power Iteration
  • Column-Stochastic amp Irreducible
  • Iterative PageRank Calculation
  • Convergence of the power iteration
  • Singular value decomposition
  • SVD - Definition
  • Singular Values And Singular Vectors
  • Matrix approximation
  • SVD and PCA
  • Example - LSI
  • SVD and PCA (2)
  • Latent Semantic Indexing (LSI)
  • Latent Semantic Indexing
  • Similarity Measures
  • Similarity Measures (2)
  • HITS (Hyperlink Induced Topic Search)
  • Power Iteration
  • HITS and SVD
  • HITS vs PageRank
  • NMF ndash Non-Negative Matrix Factorization
  • Definition
  • Motivation
  • Multiplicative Update Algorithm
  • Multiplicative Update Algorithm (2)
  • NMF vs PCA
  • Reference
  • Major Reference
  • Outline (2)
  • Not Included
  • Basic Concepts
  • What Is Machine Learning
  • Likelihood Function
  • Maximum Likelihood (ML)
  • IID ndash Independent Identically Distributed
  • Reference (2)
  • Expectation Maximization
  • Why We Need EM
  • More General
  • The Expectation Maximization (EM) Algorithm
  • Jensenrsquos Inequality
  • Lower Bounding the Log Likelihood
  • The E and M Steps of EM
  • The E Step
  • The M Step
  • EM Never Decreases the Likelihood
  • Reference (3)
  • Why Do We Need Graphical Model
  • Why Do We Need Graphical Models
  • Directed Acyclic Graphical Models (Bayesian Networks)
  • Directed Graphs for Statistical Models Plate Notation
  • pLSA ndash Probabilistic Latent Semantic Analysis
  • Latent Semantic Indexing (LSI) Review
  • pLSA ndash Probabilistic Latent Semantic Analysis (2)
  • pLSA
  • pLSA (2)
  • Joint Probability vs Likelihood
  • Document Decomposition
  • pLSA ndash Objective Function
  • EM Steps
  • Lower Bounding the Log Likelihood (2)
  • EM Steps (2)
  • Latent Subspace
  • pLSA vs LSA
  • pLSA vs LSA (2)
  • Applications
  • Text Mining
  • Scene Classification
  • Classification Result
  • Reference (4)
  • LDA ndash Latent Dirichilet Allocation
  • Problems in pLSA
  • Problems in pLSA (2)
  • Dirichlet Distribution
  • Dirichlet Distribution (2)
  • Example Dirichlet Distributions (K=3)
  • Example Dirichlet Distributions (K=3) (2)
  • The LDA Model
  • The LDA Model (2)
  • Joint Probability
  • Likelihood
  • Inference
  • Inference (2)
  • Variational Inference
  • Variantional Inference
  • VBEM vs EM
  • Parameter Estimation
  • Parameter Estimation (2)
  • Topic Examples in a 100-topic LDA Model)
  • Classification (50-topic LDA + SVM)
  • Problems in LDA
  • A Bayesian Hierarchical Model for Learning Natural Scene Catego
  • Codebook
  • Topic Distribution in Different Categories
  • Topic Hierarchical Clustering
  • More Topic Models
  • Are you really into Graphical Models
  • Reference (5)
  • Outline (3)
  • Slide 115
  • The Long Query Problem
  • Key Idea Dimension Reduction + Residual Error Preservation
  • Orthogonal Decomposition
  • A Probabilistic Implementation
  • Search (Online)
  • Search Example
  • Search Example (2)
  • Simultaneously Modeling Semantics and Structure of Threaded Dis
  • Semantic amp structure
  • Optimize them together
  • Reply reconstruction
  • Baselines
  • Evaluation
  • Expert finding
  • Baselines (2)
  • Evaluation (2)
  • Summary
Page 14: An Introduction To Matrix Decomposition and Graphical Model

PageRank ndash Power Iteration

bull Column j has nonzero elements in positions corresponding to outlinks of j (Nj in total)

bull Row i has nonzero element in positions corresponding to inlinks Ii

Column-Stochastic amp Irreduciblebull Column-Stochastic

bull where

bull Irreducible

Iterative PageRank Calculationbull For k=12hellip

bull Equivalently (λ=1 A is a Markov chain transition matrix)

bull Why can we use power iteration to find the first eigenvector

Convergence of the power iteration

bull Expand the initial approximation r0 in terms of the eigenvectors

SINGULAR VALUE DECOMPOSITION

SVD - Definitionbull Any m x n matrix A with m ge n can be factorized

bull

Singular Values And Singular Vectors

bull The diagonal elements σj of are the singular values of the matrix A

bull The columns of U and V are the left singular vectors and right singular vectors respectively

bull Equivalent form of SVD

Matrix approximation

bull Theorem Let Uk = (u1 u2 hellip uk) Vk = (v1 v2 hellip vk) and Σk = diag(σ1 σ2 hellip σk) and define

bull Then

bull It means that the best approximation of rank k for the matrix A is

SVD and PCAbull We can writebull Remember that in PCA we treat A as a row matrixbull V is just eigenvectors for A

ndash Each column in V is an eigenvector of row matrix Andash we use V to approximate a row in A

bull Equivalently we can writebull U is just eigenvectors for AT

ndash Each column in U is an eigenvector of column matrix Andash We use U to approximate a column in A

Example - LSI bull Build a term-by-document matrix A

bull Compute the SVD of A A = UΣVT

bull Approximate A by

ndash Uk Orthogonal basis that we use to approximate all the documents

ndash Dk Column j hold the coordinates of document j in the new basisndash Dk is the projection of A onto the subspace spanned by Uk

SVD and PCAbull For symmetric A SVD is closely related to PCA

bull PCA A = UΛUT

ndash U and Λ are eigenvectors and eigenvaluesbull SVD A = UΛVT

ndash U is left(column) eigenvectorsndash V is right(row) eigenvectorsndash Λ is the same eigenvalues

bull For symmetric A column eigenvectors equal to row eigenvectors

bull Note the difference of A in PCA and SVDndash SVD A is directly the data eg term-by-document matrixndash PCA A is covariance matrix A=XTX each row in X is a sample

Latent Semantic Indexing (LSI)1 Document file preparation preprocessing

ndash Indexing collecting termsndash Use stop list eliminate rdquomeaninglessrdquo wordsndash Stemming

2 Construction term-by-document matrix sparse matrix storage

3 Query matching distance measures4 Data compression by low rank approximation SVD5 Ranking and relevance feedback

Latent Semantic Indexingbull Assumption there is some underlying latent semantic

structure in the data

bull Eg car and automobile occur in similar documents as do cows and sheep

bull This structure can be enhanced by projecting the data (the term-by-document matrix and the queries) onto a lower dimensional space using SVD

Similarity Measures

bull Term to term AAT = UΣ2UT = (UΣ)(UΣ)T

UΣ are the coordinates of A (rows) projected to space V

bull Document to documentATA = VΣ2VT = (VΣ)(VΣ)T

VΣ are the coordinates of A (columns) projected to space U

Similarity Measures

bull Term to document A = UΣVT = (UΣfrac12)(VΣfrac12)T

UΣfrac12 are the coordinates of A (rows) projected to space VVΣfrac12 are the coordinates of A (columns) projected to space U

HITS (Hyperlink Induced Topic Search)

bull Idea Web includes two flavors of prominent pagesndash authorities contain high-quality informationndash hubs are comprehensive lists of links to authorities

bull A page is a good authority if many hubs point to itbull A page is a good hub if it points to many authoritiesbull Good authorities are pointed to by good hubs and good hubs

point to good authorities

Hubs Authorities

Power Iteration

bull Each page i has both a hub score hi and an authority score aibull HITS successively refines these scores by computing

bull Define the adjacency matrix L of the directed web graph

bull Now

HITS and SVD

bull L rows are outlinks columns are inlinks

bull a will be the dominant eigenvector of the authority matrix LTL bull h will be the dominant eigenvector of the hub matrix LLT

bull They are in fact the first left and right singular vectors of Lbull We are in fact running SVD on the adjacency matrix

HITS vs PageRankbull PageRank may be computed once HITS is computed per

query

bull HITS takes query into account PageRank doesnrsquot

bull PageRank has no concept of hubs

bull HITS is sensitive to local topology insertion or deletion of a small number of nodes may change the scores a lot

bull PageRank more stable because of its random jump step

NMF ndash NON-NEGATIVE MATRIX FACTORIZATION

Definition

bull Given a nonnegative matrix Vntimesm find non-negative matrix factors Wntimesk and Hktimesm such that

VntimesmasympWntimeskHktimesm

bull V column matrix each column is a data sample (n-dimension)bull Wi k-basis represents one basebull H coordinates of V projected to W

vj asymp Wntimeskhj

Motivationbull Non-negativity is natural in many applications

bull Probability is also non-negative

bull Additive model to capture local structure

Multiplicative Update Algorithmbull Cost function Euclidean distance

bull Multiplicative Update

Multiplicative Update Algorithmbull Cost function Divergence

ndash Reduce to Kullback-Leibler divergence whenndash A and B can be regarded as normalized probability distributions

bull Multiplicative update

bull PLSA is NMF with KL divergence

NMF vs PCA

bull n = 2429 faces m = 19x19 pixelsbull Positive values are illustrated with black pixels and negative values

with red pixels

bull NMF Parts-based representationbull PCA Holistic representations

Referencebull D D Lee and H S Seung Algorithms for non-negative matrix

factorization (pdf) NIPS 2001bull D D Lee and H S Seung Learning the parts of objects by non-negative

matrix factorization (pdf) Nature 401 788-791 (1999)

Major Referencebull Saara Hyvoumlnen Linear Algebra Methods for Data Mining Spring 2007

University of Helsinki (Highly recommend)

Outlinebull Basic concepts

ndash Likelihood iidndash ML MAP and Bayesian Inferencendash Expectation-Maximizationndash Mixture Gaussian Parameter estimation

bull pLSAndash Motivationndash Derivation amp Geometry propertiesndash Applications

bull LDAndash Motivation - Why to add a hyper parameterndash Dirichlet Distributionndash Variational EMndash Relations with other topic modalsndash Incorporating category information

bull Summary

Not Included

bull General graphical model theoriesbull Markov random field (belief propagation)bull Detailed derivation of LDA

BASIC CONCEPTS

What Is Machine Learning

Databull Let x = (x1 x2 xD) T denote a data point and D = x(1) x(2) x(N) a

data set D is sometimes associated with desired outputs y1 y2

Predictionsbull We are generally interested in predicting something based on the observed

data setbull Given D what can we say about x(N+1)

Modelbull To make predictions we need to make some assumptions We can often

express these assumptions in the form of a model with some parameters θbull Given data D we learn the model parameters from which we can predict

new data pointsbull The model can often be expressed as a probability distribution over data

points

Likelihood Function

bull Given a set of parameter values probability density function (PDF) will show that some data are more probable than other data

bull Inversely given the observed data and a model of interest Likelihood function is defined as

L(θ) = fθ(x|θ) = p(x|θ)

bull That is likelihood function L(θ) will show that some parameters are more likely to have produced the data

Maximum Likelihood (ML)

bull Maximum likelihood will find the best model parameters that make the data ldquomost likelyrdquo generated from this model

bull Suppose we are given n data samples (x1 x2 hellip xn)

bull Maximum likelihood will find θ that maximize L(θ)

bull Predictive distribution

IID ndash Independent Identically Distributed

bull IID means

bull The problem is considerably simplified as

bull Usually log likehood is used

Reference

bull Zoubin Ghahramani Machine Learning (4F13) 2006 Cambridge(Introduction to Machine Learning Lectures 1-2 Slides)

bull Gregor Heinrich Parameter estimation for text analysis Technical Note 2005-2008

EXPECTATION MAXIMIZATION

Why We Need EM

bull The ExpectationndashMaximization (EM) algorithm is a method for ML learning of parameters in latent variable models

bull Why we need latent variables

bull To describe complex model Gaussian Mixture Model

bull To discover the intrinsic structure inside a data set Topic Models such as pLSA LDA

More General

bull Data set bull Likelihood

bull Goal learn maximum likelihood (ML) parameter values

bull The maximum likelihood procedure finds parameters θ such that

bull Because of the integral (or sum) over latent variables the likelihood can be a very complicated and hard to optimize function

The Expectation Maximization (EM) Algorithm

bull The EM algorithm finds a (local) maximum of a latent variable model likelihood It starts from arbitrary values of the parameters and iterates two steps

bull E step Fill in values of latent variables according to posterior given data

bull M step Maximize likelihood as if latent variables were not hidden

bull Decomposes difficult problems into series of tractable steps

Jensenrsquos Inequality

Lower Bounding the Log Likelihoodbull Observed data D = yn Latent variables X = xn Parameters θbull Goal Maximize the log likelihood (ie ML learning) wrt θ

bull Any distribution q(X) over the hidden variables can be used to obtain a lower bound on the log likelihood using Jensenrsquos inequality

bull where H[q] is the entropy of q(X)

The E and M Steps of EM

bull The lower bound on the log likelihood is given by

bull EM alternates betweenbull E step optimize wrt distribution over hidden variables

holding params fixed

bull M step maximize wrt parameters holding hidden distribution fixed

The E Step

bull E step for fixed θ

bull The second term is the Kullback-Leibler divergencebull This means that for fixed θ F is bounded above by L and achieves

that bound whenbull So the E step simply sets

The M Step

bull M step maximize wrt parameters holding hidden distribution q fixed

bull The second equality comes from fact that entropy of q(X) does not depend directly on θ

bull The specific form of the M step depends on the model Often the maximum wrt θ can be found analytically

EM Never Decreases the Likelihood

bull The E and M steps together never decrease the log likelihood

bull The E step brings F(q θ) to the likelihood L(θ) 顶bull The M-step maximizes F(q θ) wrt θ 抬bull F(q θ) lt L(θ) by Jensen ndash or equivalently from the non-

negativity of KL

Reference

bull Zoubin Ghahramani Machine Learning (4F13) 2006 Cambridge(Unsupervised learning Lecture 5 Slides)

bull Christopher M Bishop (2006) Pattern Recognition and Machine Learning Springer

WHY DO WE NEED GRAPHICAL MODEL

Why Do We Need Graphical Models

bull Consndash Graphical model is so complex even with a few circleshellipndash We have to make too many assumptions

bull Prosndash We do need probability to explain our world But joint probability is

hard to computendash Graphical model can help us analyze and understand our problemsndash Graphs are an intuitive way of representing and visualizing the

relationships between many variablesndash With a graphical model we can decouple joint probability to

conditional probabilities which are usually easier

Directed Acyclic Graphical Models (Bayesian Networks)

bull A DAG Model Bayesian network corresponds to a factorization of the joint probability distribution

p(ABCDE) = p(A)p(B)p(C|AB)p(D|BC)p(E|CD)

bull In general

bull where pa(i) are the parents of node i

Directed Graphs for Statistical ModelsPlate Notation

bull A data set of N points generated from a Gaussian

PLSA ndash PROBABILISTIC LATENT SEMANTIC ANALYSIS

Latent Semantic Indexing (LSI) Review

bull For natural Language Queries simple term matching does not work effectivelyndash Ambiguous terms ndash Same Queries vary due to personal styles

bull Latent semantic indexingndash Creates this lsquolatent semantic spacersquo (hidden meaning)

bull LSI puts documents together even if they donrsquot have common words if the docs share frequently co-occurring terms

bull Disadvantagesndash Statistical foundation is missing

pLSA ndash Probabilistic Latent Semantic Analysis

bull Automated Document Indexing and Information retrievalbull Identification of Latent Classes using an Expectation

Maximization (EM) Algorithmbull Shown to solve

ndash Polysemybull Java could mean ldquocoffeerdquo and also the ldquoPL Javardquobull Cricket is a ldquogamerdquo and also an ldquoinsectrdquo

ndash Synonymybull ldquocomputerrdquo ldquopcrdquo ldquodesktoprdquo all could mean the same

bull Has a better statistical foundation than LSA

pLSA

M

Nd

d

z

w

M

d

z1

w1

z2

w2

z3

w3

zN

wN

hellip

z1 hellip zN are variables ziє[1K]K is the number of latent topics

pLSA

d1

z1

w1

z2

w2

zN1

wN1

hellip

d2

z1

w1

z2

w2

zN2

wN2

hellip

dM

z1

w1

z2

w2

zNm

wNm

hellip

p(w|z=1) p(w|z=2) p(w | z=Nm) are shared for all documents

Likelihood

Joint Probability vs Likelihood

bull Joint probability

bull Likelihood (only for observed variables)

bull p(d) is assumed to be uniform

Document Decomposition

bull Each document can be decomposed as

bull This is similar to the matrix decomposition if we consider each discrete distribution as a vector

p(w|d) = ZVtimesk p(z|d)

bull With many documents we hope to find latent topics as common basis

pLSA ndash Objective Function

bull pLSA tries to maximize the log likelihood

bull Due to the summation over z inside log we have to resort to EM

EM Steps

bull E-Stepndash Expectation of the likelihood function is calculated with the current

parameter valuesbull M-Step

ndash Update the parameters with the calculated posterior probabilitiesndash Find the parameters that maximizes the likelihood function

Lower Bounding the Log Likelihood

EM Steps

bull The E-Step

bull The M-Step

Latent Subspace

pLSA vs LSAbull LSA and PLSA perform dimensionality reduction

ndash In LSA by keeping only K singular valuesndash In PLSA by having K aspects

bull Comparison to SVDndash U Matrix related to P(z|d) (doc to aspect)ndash V Matrix related to P(w|z) (aspect to term)ndash E Matrix related to P(z) (aspect strength)

pLSA vs LSAbull The main difference is the way the approximation is done

bull PLSA generates a model (aspect model) and maximizes its predictive power

bull Selecting the proper value of K is heuristic in LSA

bull Model selection in statistics can determine optimal K in PLSA

Applications

bull Text mining topic discovering

bull Scene Classification

Text Mining

Scene Classification

Classification Result

Reference

bull Thomas Hofmann Probabilistic latent semantic analysis In Proc of Uncertainty in Artificial Intelligence UAI99 Stockholm 1999

bull Bosch A Zisserman A and Munoz X Scene Classification via pLSAProceedings of the European Conference on Computer Vision (2006)

bull Sivic J Russell B C Efros A A Zisserman A and Freeman W T Discovering Object Categories in Image Collections MIT AI Lab Memo AIM-2005-005 February 2005

LDA ndash LATENT DIRICHILET ALLOCATION

Problems in pLSA

bull pLSA provides no probabilistic model at the document level Each doc has its own topic mixture proportion

bull The number of parameters in the model grows linearly with M (the number of documents in the training set)

Problems in pLSA

bull There is no constraint for distributions p(z|di)

bull Easy to lead to serious problems with over-fitting

d1

z1

w1

z2

w2

zN1

wN1

hellip

d2

z1

w1

z2

w2

zN2

wN2

hellip

dm

z1

w1

z2

w2

zNm

wNm

hellip

p(z|d1) p(z|d2) p(z|dm)

Dirichlet Distribution

bull In the LDA model the topic mixture proportions for each document are assumed to follow some distribution

bull Requirement for such a distributionndash The samples (mixture proportions) generated from it are K-tuples of non-

negative numbers that sum to one That is the samples are multinormialsndash Easy to optimize

bull Dirichlet Distribution is one of such distributions

bull The space of all of these multinomials has a nice geometric interpretation as a (k-1)-simplex

Dirichlet Distribution

bull Definition

bull The density is zero outside this open (K minus 1)-dimensional simplex

k

i ii

ki ik

i i

ki i

kk

xx

xxxxp i

1

11

1

12121

1 0 st

)Γ()(Γ)|(

bull Various parameter α

(6 2 2) (3 7 5)

(2 3 4) (6 2 6)

Example Dirichlet Distributions (K=3)

Example Dirichlet Distributions (K=3)

bull Equal αi different

α0=01 α0=1 α0=10

k

i i10

The LDA Model

z4z3z2z1

w4w3w2w1

b

z4z3z2z1

w4w3w2w1

z4z3z2z1

w4w3w2w1

bull For each documentbull Choose ~Dirichlet()bull For each of the N words wn

ndash Choose a topic znraquo Multinomial()

ndash Choose a word wn from p(wn|znb) a multinomial probability conditioned on the topic zn

The LDA Model

bull For each documentbull Choose ~Dirichlet()bull For each of the N words wn

ndash Choose a topic znraquo Multinomial()

ndash Choose a word wn from p(wn|znb) a multinomial probability conditioned on the topic zn

Joint Probability

bull Given parameter α and β

where

Likelihood

bull Joint Probability

bull Marginal distribution of a document

bull Likelihood over all the documents

Inference

bull The likelihood can be computed by summing each documentbull Jansenrsquos inequality in EM

Inference

bull In E-Step we need to compute the posterior distribution of the hidden variables

bull Unfortunately this distribution is intractable to compute in general

bull We have to resort to variational approach

Variational Inference

bull In variational inference we consider a simplified graphical model with variational parameters and minimize the KL Divergence between the variational and posterior distributions

Variantional Inference

bull The difference between the lower bound and the likelihood is the KL divergence

bull Maximizing the lower bound L(b) with respect to and is equivalent to minimizing the KL divergence

VBEM vs EM

bull Only different in the E-Step

bull In standard EM q(X) is directly set to p(X|Dθ) and let KL=0bull In VBEM it is intractable to compute p(X|Dθ) Instead it

approximates p(X|Dθ) by a variational distribution q(X) by minimizing KL(q(X) | P(X|D θ)

bull This is also equivalent to maximizing the lower bound L(θ)

Parameter Estimation

bull Given a corpus of documents we would like to find the parameters and b which maximize the likelihood of the observed data

bull Strategy (Variational EM)

bull Lower bound log p(w|b) by a function L(b)bull Repeat until convergence

ndash E Maximize L(b) with respect to the variational parameters ndash M Maximize the bound with respect to parameters and b

Parameter Estimation

bull E-Step Variational Inference ndash repeat until convergence

bull M-Step Parameter estimation

β

α can be implemented using the Newton-Raphson method

Topic Examples in a 100-topic LDA Model)bull 16000 documents from a subset of the TREC AP corpus

Classification (50-topic LDA + SVM)

bull Reuters-21578 dataset ndash contains 8000 documents and 15818 words

(a) EARN vs NOT EARN (b) GRAIN vs NOT GRAIN

Problems in LDA

bull Dirichlet Distribution is helpful to avoid over-fitting But the assumption might be too strong

z4z3z2z1

w4w3w2w1

b

z4z3z2z1

w4w3w2w1

z4z3z2z1

w4w3w2w1

A Bayesian Hierarchical Model for Learning Natural Scene Categories

bull Incorporating category information

MNd

π

z

x

θ

β

Codebookbull 174 Local Image Patches

bull DetectionEvenly Sampled GridRandom SamplingSaliency DetectorLowersquos DoG Detector

bull RepresentationNormalized 11x11 gray values128-dim SIFT

Topic Distribution in Different Categories

Topic Hierarchical Clustering

More Topic Models

bull Dynamic topic models ICML 2006bull Correlated Topic Model NIPS 2005bull Hierarchical Dirichlet Process Journal of the American

Statistical Association 2003bull Nonparametric Bayes pachinko allocation UAI 2007bull Supervised LDA NIPS 2007bull MedLDA ndash Maximum Margin Discrimant LDA ICML 2009bull hellip

Are you really into Graphical Models

bull Describing Visual Scenes using Transformed Dirichlet Processes E Sudderth A Torralba W Freeman and A Willsky NIPS Dec 2005

Reference

bull David M Blei Andrew Y Ng Michael I Jordan Latent Dirichlet Allocation Journal of Machine Learning Research (JMLR) 2003

bull Matthew J Beal Variational Algorithms for Approximate Bayesian Inference PhD Thesis University of Cambridge 1998

bull L Fei-Fei and P Perona A Bayesian Hierarchical Model for Learning Natural Scene Categories CVPR 2005

Outline

bull Matrix Decompositionndash PCA SVD NMFndash LDA ICA Sparse Coding etc

bull Graphical Modelndash Basic concepts in probabilistic machine learningndash EMndash pLSAndash LDA

bull Two Applicationsndash Document decomposition for ldquolong queryrdquo retrieval ICCV 2009ndash Modeling Threaded Discussions SIGIR 2009

Large-Scale Indexing for ldquoLong Queryrdquo Retrieval (Similarity Search)

Xiao Zhang Zhiwei Li Lei Zhang Wei-Ying Ma and Heung-Yeung Shum

ICCV 2009

The Long Query Problem

bull If a query contains 1000 keywordsndash Need to access 1000 inverted listsndash The intersection of 1000 inverted lists may be emptyndash The union of 1000 inverted list may be the whole corpus

bull Dimension reductionTerm1 Term2 Term3 Term4 hellip TermN

Img1 1 2 0 0 hellip 2

f1 f2 hellip fM

Img1 02 01 hellip 003

Topic Projection

Dim = 1 million

Dim = 200

Key Idea Dimension Reduction + Residual Error Preservation

bull p original TF-IDF vector in vocabulary spacebull X projection matrix for dimension reductionbull w low dimensional feature vectorbull residual error

p Xw

An image = 0 01 02 03 04 05 06 07 08 09 1

0

2

4

6

8

10

12

14

16

+ a few words(10 words)

Orthogonal Decomposition

p Xw 1 11 1 1

2 21 2 1 2

1 1

1

k

k

W k W

W W Wk W

p x xp x x w

p wp x x

Base vector

Low dimensional representation

Residual

An image = 0 01 02 03 04 05 06 07 08 09 1

0

2

4

6

8

10

12

14

16

+ a few words(10 words)

T Tp q p q

p q

w w

X1 X2 X3 hellip Xk

A Probabilistic Implementation

x is a switch variable It controls a word generated from

bull a topic specific distribution

bull a document specific distribution

bull a background distribution

( | )p w d 1( 0 | ) ( | ) ( | )K

kp x d p w z k p z k d

( 1| ) ( | )p x d p w d ( 2 | ) ( | )p x d p w

C Chemudugunta etal Modeling General and Specific Aspects of Documents with a Probabilistic Topic Model NIPS 2006

Search (Online)

DS1 DS2 hellip

DS1 DS2 hellip

DS1 DS2 hellip

DS1 DS2 hellip

DS1 DS2 hellip

hellip

hellip

hellip

hellip

hellip

LSH Index

Doc 300 Doc 401 hellip

A query = 0 01 02 03 04 05 06 07 08 09 1

0

2

4

6

8

10

12

14

16

+ a few words

Re-rankingDoc 401 hellip

Doc 1

Doc 2

Doc 300

Doc 401

Doc N

Doc 300

Index 10M Images 46GBSearch Speed lt 100ms

Doc Meta

Search Example

Query Image

Search Example

Query Image

Simultaneously Modeling Semantics and Structure of Threaded Discussions A Sparse

Coding Approach and Its Applications

Chen LIN Jiang-Ming YANG Rui CAI Xin-jing WANG Wei WANG Lei ZHANG

SIGIR 2009

123

Semantic amp structure

124

SemanticTopics

StructureWho reply to who

Optimize them together

Model semantic

Model structure

125

Reply reconstruction

126

DocumentSimilarity

TopicSimilarity

StructureSimilarity

BaselinesNP

Reply to Nearest PostRR

Reply to RootDS

Document SimilarityLDA

Latent Dirichlet Allocation Project documents to topic space

SWB Special Words Topic Model with Background distribution Project documents to topic and junk topic space

127

Evaluation

method Slashdot Apple

All Posts Good Posts All Posts Good Posts

NP 0021 0012 0289 0239

RR 0183 0319 0269 0474

DS 0463 0643 0409 0628

LDA 0465 0644 0410 0648

SWB 0463 0644 0410 0641

SMSS 0524 0737 0517 0772

128

Expert finding

Reply reconstruction

Network construction

Expert finding

Methods

HITS

PageRank

hellip

129

Baselines

LMFormal Models for Expert Finding in Enterprise Corpora SIGIR

06Achieves stable performance in expert finding task using a

language modelPageRank

Benchmark nodal ranking methodHITS

Find hub nodes and authority nodeEABIF

Personalized Recommendation Driven by Information Flow SIGIR rsquo06

Find most influential node130

Evaluation

131

bull Bayesian estimate

Method MRR MAP P10

LM 0821 0698 0800

EABIF(ori) 0674 0362 0243

EABIF(rec) 0742 0318 0281

PageRank(ori) 0675 0377 0263

PageRank(rec) 0743 0321 0266

HITS(ori) 0906 0832 0900

HITS(rec) 0938 0822 0906

Summary

bull Matrix and probability are fundamental mathematics in information retrieval and computer visionndash Matrix decomposition ndash a good practice to learn matrixndash Graphical model ndash a good practice to learn probability

bull Graphical model is a good tool to analyze problems

bull The essence of decomposition is to discover a set of mid-level features to describe original documentsimages

bull It is more adaptable for various applications than matrix decomposition

  • An Introduction To Matrix Decomposition and Graphical Model
  • Outline
  • What Is Matrix Decomposition
  • Why We Need Matrix Decomposition
  • Why We Need Matrix Decomposition (2)
  • Principle Component Analysis
  • Definition ndash Eigenvalue amp Eigenvector
  • Definition ndash Principle Component Analysis
  • Principle Component Analysis (2)
  • Maximizing Variance
  • Optimization Problem
  • Property Data Decomposition
  • Face Recognition ndash Eigenface
  • Slide 14
  • Slide 15
  • PageRank ndash Power Iteration
  • Column-Stochastic amp Irreducible
  • Iterative PageRank Calculation
  • Convergence of the power iteration
  • Singular value decomposition
  • SVD - Definition
  • Singular Values And Singular Vectors
  • Matrix approximation
  • SVD and PCA
  • Example - LSI
  • SVD and PCA (2)
  • Latent Semantic Indexing (LSI)
  • Latent Semantic Indexing
  • Similarity Measures
  • Similarity Measures (2)
  • HITS (Hyperlink Induced Topic Search)
  • Power Iteration
  • HITS and SVD
  • HITS vs PageRank
  • NMF ndash Non-Negative Matrix Factorization
  • Definition
  • Motivation
  • Multiplicative Update Algorithm
  • Multiplicative Update Algorithm (2)
  • NMF vs PCA
  • Reference
  • Major Reference
  • Outline (2)
  • Not Included
  • Basic Concepts
  • What Is Machine Learning
  • Likelihood Function
  • Maximum Likelihood (ML)
  • IID ndash Independent Identically Distributed
  • Reference (2)
  • Expectation Maximization
  • Why We Need EM
  • More General
  • The Expectation Maximization (EM) Algorithm
  • Jensenrsquos Inequality
  • Lower Bounding the Log Likelihood
  • The E and M Steps of EM
  • The E Step
  • The M Step
  • EM Never Decreases the Likelihood
  • Reference (3)
  • Why Do We Need Graphical Model
  • Why Do We Need Graphical Models
  • Directed Acyclic Graphical Models (Bayesian Networks)
  • Directed Graphs for Statistical Models Plate Notation
  • pLSA ndash Probabilistic Latent Semantic Analysis
  • Latent Semantic Indexing (LSI) Review
  • pLSA ndash Probabilistic Latent Semantic Analysis (2)
  • pLSA
  • pLSA (2)
  • Joint Probability vs Likelihood
  • Document Decomposition
  • pLSA ndash Objective Function
  • EM Steps
  • Lower Bounding the Log Likelihood (2)
  • EM Steps (2)
  • Latent Subspace
  • pLSA vs LSA
  • pLSA vs LSA (2)
  • Applications
  • Text Mining
  • Scene Classification
  • Classification Result
  • Reference (4)
  • LDA ndash Latent Dirichilet Allocation
  • Problems in pLSA
  • Problems in pLSA (2)
  • Dirichlet Distribution
  • Dirichlet Distribution (2)
  • Example Dirichlet Distributions (K=3)
  • Example Dirichlet Distributions (K=3) (2)
  • The LDA Model
  • The LDA Model (2)
  • Joint Probability
  • Likelihood
  • Inference
  • Inference (2)
  • Variational Inference
  • Variantional Inference
  • VBEM vs EM
  • Parameter Estimation
  • Parameter Estimation (2)
  • Topic Examples in a 100-topic LDA Model)
  • Classification (50-topic LDA + SVM)
  • Problems in LDA
  • A Bayesian Hierarchical Model for Learning Natural Scene Catego
  • Codebook
  • Topic Distribution in Different Categories
  • Topic Hierarchical Clustering
  • More Topic Models
  • Are you really into Graphical Models
  • Reference (5)
  • Outline (3)
  • Slide 115
  • The Long Query Problem
  • Key Idea Dimension Reduction + Residual Error Preservation
  • Orthogonal Decomposition
  • A Probabilistic Implementation
  • Search (Online)
  • Search Example
  • Search Example (2)
  • Simultaneously Modeling Semantics and Structure of Threaded Dis
  • Semantic amp structure
  • Optimize them together
  • Reply reconstruction
  • Baselines
  • Evaluation
  • Expert finding
  • Baselines (2)
  • Evaluation (2)
  • Summary
Page 15: An Introduction To Matrix Decomposition and Graphical Model

Column-Stochastic amp Irreduciblebull Column-Stochastic

bull where

bull Irreducible

Iterative PageRank Calculationbull For k=12hellip

bull Equivalently (λ=1 A is a Markov chain transition matrix)

bull Why can we use power iteration to find the first eigenvector

Convergence of the power iteration

bull Expand the initial approximation r0 in terms of the eigenvectors

SINGULAR VALUE DECOMPOSITION

SVD - Definitionbull Any m x n matrix A with m ge n can be factorized

bull

Singular Values And Singular Vectors

bull The diagonal elements σj of are the singular values of the matrix A

bull The columns of U and V are the left singular vectors and right singular vectors respectively

bull Equivalent form of SVD

Matrix approximation

bull Theorem Let Uk = (u1 u2 hellip uk) Vk = (v1 v2 hellip vk) and Σk = diag(σ1 σ2 hellip σk) and define

bull Then

bull It means that the best approximation of rank k for the matrix A is

SVD and PCAbull We can writebull Remember that in PCA we treat A as a row matrixbull V is just eigenvectors for A

ndash Each column in V is an eigenvector of row matrix Andash we use V to approximate a row in A

bull Equivalently we can writebull U is just eigenvectors for AT

ndash Each column in U is an eigenvector of column matrix Andash We use U to approximate a column in A

Example - LSI bull Build a term-by-document matrix A

bull Compute the SVD of A A = UΣVT

bull Approximate A by

ndash Uk Orthogonal basis that we use to approximate all the documents

ndash Dk Column j hold the coordinates of document j in the new basisndash Dk is the projection of A onto the subspace spanned by Uk

SVD and PCAbull For symmetric A SVD is closely related to PCA

bull PCA A = UΛUT

ndash U and Λ are eigenvectors and eigenvaluesbull SVD A = UΛVT

ndash U is left(column) eigenvectorsndash V is right(row) eigenvectorsndash Λ is the same eigenvalues

bull For symmetric A column eigenvectors equal to row eigenvectors

bull Note the difference of A in PCA and SVDndash SVD A is directly the data eg term-by-document matrixndash PCA A is covariance matrix A=XTX each row in X is a sample

Latent Semantic Indexing (LSI)1 Document file preparation preprocessing

ndash Indexing collecting termsndash Use stop list eliminate rdquomeaninglessrdquo wordsndash Stemming

2 Construction term-by-document matrix sparse matrix storage

3 Query matching distance measures4 Data compression by low rank approximation SVD5 Ranking and relevance feedback

Latent Semantic Indexingbull Assumption there is some underlying latent semantic

structure in the data

bull Eg car and automobile occur in similar documents as do cows and sheep

bull This structure can be enhanced by projecting the data (the term-by-document matrix and the queries) onto a lower dimensional space using SVD

Similarity Measures

bull Term to term AAT = UΣ2UT = (UΣ)(UΣ)T

UΣ are the coordinates of A (rows) projected to space V

bull Document to documentATA = VΣ2VT = (VΣ)(VΣ)T

VΣ are the coordinates of A (columns) projected to space U

Similarity Measures

bull Term to document A = UΣVT = (UΣfrac12)(VΣfrac12)T

UΣfrac12 are the coordinates of A (rows) projected to space VVΣfrac12 are the coordinates of A (columns) projected to space U

HITS (Hyperlink Induced Topic Search)

bull Idea Web includes two flavors of prominent pagesndash authorities contain high-quality informationndash hubs are comprehensive lists of links to authorities

bull A page is a good authority if many hubs point to itbull A page is a good hub if it points to many authoritiesbull Good authorities are pointed to by good hubs and good hubs

point to good authorities

Hubs Authorities

Power Iteration

bull Each page i has both a hub score hi and an authority score aibull HITS successively refines these scores by computing

bull Define the adjacency matrix L of the directed web graph

bull Now

HITS and SVD

bull L rows are outlinks columns are inlinks

bull a will be the dominant eigenvector of the authority matrix LTL bull h will be the dominant eigenvector of the hub matrix LLT

bull They are in fact the first left and right singular vectors of Lbull We are in fact running SVD on the adjacency matrix

HITS vs PageRankbull PageRank may be computed once HITS is computed per

query

bull HITS takes query into account PageRank doesnrsquot

bull PageRank has no concept of hubs

bull HITS is sensitive to local topology insertion or deletion of a small number of nodes may change the scores a lot

bull PageRank more stable because of its random jump step

NMF ndash NON-NEGATIVE MATRIX FACTORIZATION

Definition

bull Given a nonnegative matrix Vntimesm find non-negative matrix factors Wntimesk and Hktimesm such that

VntimesmasympWntimeskHktimesm

bull V column matrix each column is a data sample (n-dimension)bull Wi k-basis represents one basebull H coordinates of V projected to W

vj asymp Wntimeskhj

Motivationbull Non-negativity is natural in many applications

bull Probability is also non-negative

bull Additive model to capture local structure

Multiplicative Update Algorithmbull Cost function Euclidean distance

bull Multiplicative Update

Multiplicative Update Algorithmbull Cost function Divergence

ndash Reduce to Kullback-Leibler divergence whenndash A and B can be regarded as normalized probability distributions

bull Multiplicative update

bull PLSA is NMF with KL divergence

NMF vs PCA

bull n = 2429 faces m = 19x19 pixelsbull Positive values are illustrated with black pixels and negative values

with red pixels

bull NMF Parts-based representationbull PCA Holistic representations

Referencebull D D Lee and H S Seung Algorithms for non-negative matrix

factorization (pdf) NIPS 2001bull D D Lee and H S Seung Learning the parts of objects by non-negative

matrix factorization (pdf) Nature 401 788-791 (1999)

Major Referencebull Saara Hyvoumlnen Linear Algebra Methods for Data Mining Spring 2007

University of Helsinki (Highly recommend)

Outlinebull Basic concepts

ndash Likelihood iidndash ML MAP and Bayesian Inferencendash Expectation-Maximizationndash Mixture Gaussian Parameter estimation

bull pLSAndash Motivationndash Derivation amp Geometry propertiesndash Applications

bull LDAndash Motivation - Why to add a hyper parameterndash Dirichlet Distributionndash Variational EMndash Relations with other topic modalsndash Incorporating category information

bull Summary

Not Included

bull General graphical model theoriesbull Markov random field (belief propagation)bull Detailed derivation of LDA

BASIC CONCEPTS

What Is Machine Learning

Databull Let x = (x1 x2 xD) T denote a data point and D = x(1) x(2) x(N) a

data set D is sometimes associated with desired outputs y1 y2

Predictionsbull We are generally interested in predicting something based on the observed

data setbull Given D what can we say about x(N+1)

Modelbull To make predictions we need to make some assumptions We can often

express these assumptions in the form of a model with some parameters θbull Given data D we learn the model parameters from which we can predict

new data pointsbull The model can often be expressed as a probability distribution over data

points

Likelihood Function

bull Given a set of parameter values probability density function (PDF) will show that some data are more probable than other data

bull Inversely given the observed data and a model of interest Likelihood function is defined as

L(θ) = fθ(x|θ) = p(x|θ)

bull That is likelihood function L(θ) will show that some parameters are more likely to have produced the data

Maximum Likelihood (ML)

bull Maximum likelihood will find the best model parameters that make the data ldquomost likelyrdquo generated from this model

bull Suppose we are given n data samples (x1 x2 hellip xn)

bull Maximum likelihood will find θ that maximize L(θ)

bull Predictive distribution

IID ndash Independent Identically Distributed

bull IID means

bull The problem is considerably simplified as

bull Usually log likehood is used

Reference

bull Zoubin Ghahramani Machine Learning (4F13) 2006 Cambridge(Introduction to Machine Learning Lectures 1-2 Slides)

bull Gregor Heinrich Parameter estimation for text analysis Technical Note 2005-2008

EXPECTATION MAXIMIZATION

Why We Need EM

bull The ExpectationndashMaximization (EM) algorithm is a method for ML learning of parameters in latent variable models

bull Why we need latent variables

bull To describe complex model Gaussian Mixture Model

bull To discover the intrinsic structure inside a data set Topic Models such as pLSA LDA

More General

bull Data set bull Likelihood

bull Goal learn maximum likelihood (ML) parameter values

bull The maximum likelihood procedure finds parameters θ such that

bull Because of the integral (or sum) over latent variables the likelihood can be a very complicated and hard to optimize function

The Expectation Maximization (EM) Algorithm

bull The EM algorithm finds a (local) maximum of a latent variable model likelihood It starts from arbitrary values of the parameters and iterates two steps

bull E step Fill in values of latent variables according to posterior given data

bull M step Maximize likelihood as if latent variables were not hidden

bull Decomposes difficult problems into series of tractable steps

Jensenrsquos Inequality

Lower Bounding the Log Likelihoodbull Observed data D = yn Latent variables X = xn Parameters θbull Goal Maximize the log likelihood (ie ML learning) wrt θ

bull Any distribution q(X) over the hidden variables can be used to obtain a lower bound on the log likelihood using Jensenrsquos inequality

bull where H[q] is the entropy of q(X)

The E and M Steps of EM

bull The lower bound on the log likelihood is given by

bull EM alternates betweenbull E step optimize wrt distribution over hidden variables

holding params fixed

bull M step maximize wrt parameters holding hidden distribution fixed

The E Step

bull E step for fixed θ

bull The second term is the Kullback-Leibler divergencebull This means that for fixed θ F is bounded above by L and achieves

that bound whenbull So the E step simply sets

The M Step

bull M step maximize wrt parameters holding hidden distribution q fixed

bull The second equality comes from fact that entropy of q(X) does not depend directly on θ

bull The specific form of the M step depends on the model Often the maximum wrt θ can be found analytically

EM Never Decreases the Likelihood

bull The E and M steps together never decrease the log likelihood

bull The E step brings F(q θ) to the likelihood L(θ) 顶bull The M-step maximizes F(q θ) wrt θ 抬bull F(q θ) lt L(θ) by Jensen ndash or equivalently from the non-

negativity of KL

Reference

bull Zoubin Ghahramani Machine Learning (4F13) 2006 Cambridge(Unsupervised learning Lecture 5 Slides)

bull Christopher M Bishop (2006) Pattern Recognition and Machine Learning Springer

WHY DO WE NEED GRAPHICAL MODEL

Why Do We Need Graphical Models

bull Consndash Graphical model is so complex even with a few circleshellipndash We have to make too many assumptions

bull Prosndash We do need probability to explain our world But joint probability is

hard to computendash Graphical model can help us analyze and understand our problemsndash Graphs are an intuitive way of representing and visualizing the

relationships between many variablesndash With a graphical model we can decouple joint probability to

conditional probabilities which are usually easier

Directed Acyclic Graphical Models (Bayesian Networks)

bull A DAG Model Bayesian network corresponds to a factorization of the joint probability distribution

p(ABCDE) = p(A)p(B)p(C|AB)p(D|BC)p(E|CD)

bull In general

bull where pa(i) are the parents of node i

Directed Graphs for Statistical ModelsPlate Notation

bull A data set of N points generated from a Gaussian

PLSA ndash PROBABILISTIC LATENT SEMANTIC ANALYSIS

Latent Semantic Indexing (LSI) Review

bull For natural Language Queries simple term matching does not work effectivelyndash Ambiguous terms ndash Same Queries vary due to personal styles

bull Latent semantic indexingndash Creates this lsquolatent semantic spacersquo (hidden meaning)

bull LSI puts documents together even if they donrsquot have common words if the docs share frequently co-occurring terms

bull Disadvantagesndash Statistical foundation is missing

pLSA ndash Probabilistic Latent Semantic Analysis

bull Automated Document Indexing and Information retrievalbull Identification of Latent Classes using an Expectation

Maximization (EM) Algorithmbull Shown to solve

ndash Polysemybull Java could mean ldquocoffeerdquo and also the ldquoPL Javardquobull Cricket is a ldquogamerdquo and also an ldquoinsectrdquo

ndash Synonymybull ldquocomputerrdquo ldquopcrdquo ldquodesktoprdquo all could mean the same

bull Has a better statistical foundation than LSA

pLSA

M

Nd

d

z

w

M

d

z1

w1

z2

w2

z3

w3

zN

wN

hellip

z1 hellip zN are variables ziє[1K]K is the number of latent topics

pLSA

d1

z1

w1

z2

w2

zN1

wN1

hellip

d2

z1

w1

z2

w2

zN2

wN2

hellip

dM

z1

w1

z2

w2

zNm

wNm

hellip

p(w|z=1) p(w|z=2) p(w | z=Nm) are shared for all documents

Likelihood

Joint Probability vs Likelihood

bull Joint probability

bull Likelihood (only for observed variables)

bull p(d) is assumed to be uniform

Document Decomposition

bull Each document can be decomposed as

bull This is similar to the matrix decomposition if we consider each discrete distribution as a vector

p(w|d) = ZVtimesk p(z|d)

bull With many documents we hope to find latent topics as common basis

pLSA ndash Objective Function

bull pLSA tries to maximize the log likelihood

bull Due to the summation over z inside log we have to resort to EM

EM Steps

bull E-Stepndash Expectation of the likelihood function is calculated with the current

parameter valuesbull M-Step

ndash Update the parameters with the calculated posterior probabilitiesndash Find the parameters that maximizes the likelihood function

Lower Bounding the Log Likelihood

EM Steps

bull The E-Step

bull The M-Step

Latent Subspace

pLSA vs LSAbull LSA and PLSA perform dimensionality reduction

ndash In LSA by keeping only K singular valuesndash In PLSA by having K aspects

bull Comparison to SVDndash U Matrix related to P(z|d) (doc to aspect)ndash V Matrix related to P(w|z) (aspect to term)ndash E Matrix related to P(z) (aspect strength)

pLSA vs LSAbull The main difference is the way the approximation is done

bull PLSA generates a model (aspect model) and maximizes its predictive power

bull Selecting the proper value of K is heuristic in LSA

bull Model selection in statistics can determine optimal K in PLSA

Applications

bull Text mining topic discovering

bull Scene Classification

Text Mining

Scene Classification

Classification Result

Reference

bull Thomas Hofmann Probabilistic latent semantic analysis In Proc of Uncertainty in Artificial Intelligence UAI99 Stockholm 1999

bull Bosch A Zisserman A and Munoz X Scene Classification via pLSAProceedings of the European Conference on Computer Vision (2006)

bull Sivic J Russell B C Efros A A Zisserman A and Freeman W T Discovering Object Categories in Image Collections MIT AI Lab Memo AIM-2005-005 February 2005

LDA ndash LATENT DIRICHILET ALLOCATION

Problems in pLSA

bull pLSA provides no probabilistic model at the document level Each doc has its own topic mixture proportion

bull The number of parameters in the model grows linearly with M (the number of documents in the training set)

Problems in pLSA

bull There is no constraint for distributions p(z|di)

bull Easy to lead to serious problems with over-fitting

d1

z1

w1

z2

w2

zN1

wN1

hellip

d2

z1

w1

z2

w2

zN2

wN2

hellip

dm

z1

w1

z2

w2

zNm

wNm

hellip

p(z|d1) p(z|d2) p(z|dm)

Dirichlet Distribution

bull In the LDA model the topic mixture proportions for each document are assumed to follow some distribution

bull Requirement for such a distributionndash The samples (mixture proportions) generated from it are K-tuples of non-

negative numbers that sum to one That is the samples are multinormialsndash Easy to optimize

bull Dirichlet Distribution is one of such distributions

bull The space of all of these multinomials has a nice geometric interpretation as a (k-1)-simplex

Dirichlet Distribution

bull Definition

bull The density is zero outside this open (K minus 1)-dimensional simplex

k

i ii

ki ik

i i

ki i

kk

xx

xxxxp i

1

11

1

12121

1 0 st

)Γ()(Γ)|(

bull Various parameter α

(6 2 2) (3 7 5)

(2 3 4) (6 2 6)

Example Dirichlet Distributions (K=3)

Example Dirichlet Distributions (K=3)

bull Equal αi different

α0=01 α0=1 α0=10

k

i i10

The LDA Model

z4z3z2z1

w4w3w2w1

b

z4z3z2z1

w4w3w2w1

z4z3z2z1

w4w3w2w1

bull For each documentbull Choose ~Dirichlet()bull For each of the N words wn

ndash Choose a topic znraquo Multinomial()

ndash Choose a word wn from p(wn|znb) a multinomial probability conditioned on the topic zn

The LDA Model

bull For each documentbull Choose ~Dirichlet()bull For each of the N words wn

ndash Choose a topic znraquo Multinomial()

ndash Choose a word wn from p(wn|znb) a multinomial probability conditioned on the topic zn

Joint Probability

bull Given parameter α and β

where

Likelihood

bull Joint Probability

bull Marginal distribution of a document

bull Likelihood over all the documents

Inference

bull The likelihood can be computed by summing each documentbull Jansenrsquos inequality in EM

Inference

bull In E-Step we need to compute the posterior distribution of the hidden variables

bull Unfortunately this distribution is intractable to compute in general

bull We have to resort to variational approach

Variational Inference

bull In variational inference we consider a simplified graphical model with variational parameters and minimize the KL Divergence between the variational and posterior distributions

Variantional Inference

bull The difference between the lower bound and the likelihood is the KL divergence

bull Maximizing the lower bound L(b) with respect to and is equivalent to minimizing the KL divergence

VBEM vs EM

bull Only different in the E-Step

bull In standard EM q(X) is directly set to p(X|Dθ) and let KL=0bull In VBEM it is intractable to compute p(X|Dθ) Instead it

approximates p(X|Dθ) by a variational distribution q(X) by minimizing KL(q(X) | P(X|D θ)

bull This is also equivalent to maximizing the lower bound L(θ)

Parameter Estimation

bull Given a corpus of documents we would like to find the parameters and b which maximize the likelihood of the observed data

bull Strategy (Variational EM)

bull Lower bound log p(w|b) by a function L(b)bull Repeat until convergence

ndash E Maximize L(b) with respect to the variational parameters ndash M Maximize the bound with respect to parameters and b

Parameter Estimation

bull E-Step Variational Inference ndash repeat until convergence

bull M-Step Parameter estimation

β

α can be implemented using the Newton-Raphson method

Topic Examples in a 100-topic LDA Model)bull 16000 documents from a subset of the TREC AP corpus

Classification (50-topic LDA + SVM)

bull Reuters-21578 dataset ndash contains 8000 documents and 15818 words

(a) EARN vs NOT EARN (b) GRAIN vs NOT GRAIN

Problems in LDA

bull Dirichlet Distribution is helpful to avoid over-fitting But the assumption might be too strong

z4z3z2z1

w4w3w2w1

b

z4z3z2z1

w4w3w2w1

z4z3z2z1

w4w3w2w1

A Bayesian Hierarchical Model for Learning Natural Scene Categories

bull Incorporating category information

MNd

π

z

x

θ

β

Codebookbull 174 Local Image Patches

bull DetectionEvenly Sampled GridRandom SamplingSaliency DetectorLowersquos DoG Detector

bull RepresentationNormalized 11x11 gray values128-dim SIFT

Topic Distribution in Different Categories

Topic Hierarchical Clustering

More Topic Models

bull Dynamic topic models ICML 2006bull Correlated Topic Model NIPS 2005bull Hierarchical Dirichlet Process Journal of the American

Statistical Association 2003bull Nonparametric Bayes pachinko allocation UAI 2007bull Supervised LDA NIPS 2007bull MedLDA ndash Maximum Margin Discrimant LDA ICML 2009bull hellip

Are you really into Graphical Models

bull Describing Visual Scenes using Transformed Dirichlet Processes E Sudderth A Torralba W Freeman and A Willsky NIPS Dec 2005

Reference

bull David M Blei Andrew Y Ng Michael I Jordan Latent Dirichlet Allocation Journal of Machine Learning Research (JMLR) 2003

bull Matthew J Beal Variational Algorithms for Approximate Bayesian Inference PhD Thesis University of Cambridge 1998

bull L Fei-Fei and P Perona A Bayesian Hierarchical Model for Learning Natural Scene Categories CVPR 2005

Outline

bull Matrix Decompositionndash PCA SVD NMFndash LDA ICA Sparse Coding etc

bull Graphical Modelndash Basic concepts in probabilistic machine learningndash EMndash pLSAndash LDA

bull Two Applicationsndash Document decomposition for ldquolong queryrdquo retrieval ICCV 2009ndash Modeling Threaded Discussions SIGIR 2009

Large-Scale Indexing for ldquoLong Queryrdquo Retrieval (Similarity Search)

Xiao Zhang Zhiwei Li Lei Zhang Wei-Ying Ma and Heung-Yeung Shum

ICCV 2009

The Long Query Problem

bull If a query contains 1000 keywordsndash Need to access 1000 inverted listsndash The intersection of 1000 inverted lists may be emptyndash The union of 1000 inverted list may be the whole corpus

bull Dimension reductionTerm1 Term2 Term3 Term4 hellip TermN

Img1 1 2 0 0 hellip 2

f1 f2 hellip fM

Img1 02 01 hellip 003

Topic Projection

Dim = 1 million

Dim = 200

Key Idea Dimension Reduction + Residual Error Preservation

bull p original TF-IDF vector in vocabulary spacebull X projection matrix for dimension reductionbull w low dimensional feature vectorbull residual error

p Xw

An image = 0 01 02 03 04 05 06 07 08 09 1

0

2

4

6

8

10

12

14

16

+ a few words(10 words)

Orthogonal Decomposition

p Xw 1 11 1 1

2 21 2 1 2

1 1

1

k

k

W k W

W W Wk W

p x xp x x w

p wp x x

Base vector

Low dimensional representation

Residual

An image = 0 01 02 03 04 05 06 07 08 09 1

0

2

4

6

8

10

12

14

16

+ a few words(10 words)

T Tp q p q

p q

w w

X1 X2 X3 hellip Xk

A Probabilistic Implementation

x is a switch variable It controls a word generated from

bull a topic specific distribution

bull a document specific distribution

bull a background distribution

( | )p w d 1( 0 | ) ( | ) ( | )K

kp x d p w z k p z k d

( 1| ) ( | )p x d p w d ( 2 | ) ( | )p x d p w

C Chemudugunta etal Modeling General and Specific Aspects of Documents with a Probabilistic Topic Model NIPS 2006

Search (Online)

DS1 DS2 hellip

DS1 DS2 hellip

DS1 DS2 hellip

DS1 DS2 hellip

DS1 DS2 hellip

hellip

hellip

hellip

hellip

hellip

LSH Index

Doc 300 Doc 401 hellip

A query = 0 01 02 03 04 05 06 07 08 09 1

0

2

4

6

8

10

12

14

16

+ a few words

Re-rankingDoc 401 hellip

Doc 1

Doc 2

Doc 300

Doc 401

Doc N

Doc 300

Index 10M Images 46GBSearch Speed lt 100ms

Doc Meta

Search Example

Query Image

Search Example

Query Image

Simultaneously Modeling Semantics and Structure of Threaded Discussions A Sparse

Coding Approach and Its Applications

Chen LIN Jiang-Ming YANG Rui CAI Xin-jing WANG Wei WANG Lei ZHANG

SIGIR 2009

123

Semantic amp structure

124

SemanticTopics

StructureWho reply to who

Optimize them together

Model semantic

Model structure

125

Reply reconstruction

126

DocumentSimilarity

TopicSimilarity

StructureSimilarity

BaselinesNP

Reply to Nearest PostRR

Reply to RootDS

Document SimilarityLDA

Latent Dirichlet Allocation Project documents to topic space

SWB Special Words Topic Model with Background distribution Project documents to topic and junk topic space

127

Evaluation

method Slashdot Apple

All Posts Good Posts All Posts Good Posts

NP 0021 0012 0289 0239

RR 0183 0319 0269 0474

DS 0463 0643 0409 0628

LDA 0465 0644 0410 0648

SWB 0463 0644 0410 0641

SMSS 0524 0737 0517 0772

128

Expert finding

Reply reconstruction

Network construction

Expert finding

Methods

HITS

PageRank

hellip

129

Baselines

LMFormal Models for Expert Finding in Enterprise Corpora SIGIR

06Achieves stable performance in expert finding task using a

language modelPageRank

Benchmark nodal ranking methodHITS

Find hub nodes and authority nodeEABIF

Personalized Recommendation Driven by Information Flow SIGIR rsquo06

Find most influential node130

Evaluation

131

bull Bayesian estimate

Method MRR MAP P10

LM 0821 0698 0800

EABIF(ori) 0674 0362 0243

EABIF(rec) 0742 0318 0281

PageRank(ori) 0675 0377 0263

PageRank(rec) 0743 0321 0266

HITS(ori) 0906 0832 0900

HITS(rec) 0938 0822 0906

Summary

bull Matrix and probability are fundamental mathematics in information retrieval and computer visionndash Matrix decomposition ndash a good practice to learn matrixndash Graphical model ndash a good practice to learn probability

bull Graphical model is a good tool to analyze problems

bull The essence of decomposition is to discover a set of mid-level features to describe original documentsimages

bull It is more adaptable for various applications than matrix decomposition

  • An Introduction To Matrix Decomposition and Graphical Model
  • Outline
  • What Is Matrix Decomposition
  • Why We Need Matrix Decomposition
  • Why We Need Matrix Decomposition (2)
  • Principle Component Analysis
  • Definition ndash Eigenvalue amp Eigenvector
  • Definition ndash Principle Component Analysis
  • Principle Component Analysis (2)
  • Maximizing Variance
  • Optimization Problem
  • Property Data Decomposition
  • Face Recognition ndash Eigenface
  • Slide 14
  • Slide 15
  • PageRank ndash Power Iteration
  • Column-Stochastic amp Irreducible
  • Iterative PageRank Calculation
  • Convergence of the power iteration
  • Singular value decomposition
  • SVD - Definition
  • Singular Values And Singular Vectors
  • Matrix approximation
  • SVD and PCA
  • Example - LSI
  • SVD and PCA (2)
  • Latent Semantic Indexing (LSI)
  • Latent Semantic Indexing
  • Similarity Measures
  • Similarity Measures (2)
  • HITS (Hyperlink Induced Topic Search)
  • Power Iteration
  • HITS and SVD
  • HITS vs PageRank
  • NMF ndash Non-Negative Matrix Factorization
  • Definition
  • Motivation
  • Multiplicative Update Algorithm
  • Multiplicative Update Algorithm (2)
  • NMF vs PCA
  • Reference
  • Major Reference
  • Outline (2)
  • Not Included
  • Basic Concepts
  • What Is Machine Learning
  • Likelihood Function
  • Maximum Likelihood (ML)
  • IID ndash Independent Identically Distributed
  • Reference (2)
  • Expectation Maximization
  • Why We Need EM
  • More General
  • The Expectation Maximization (EM) Algorithm
  • Jensenrsquos Inequality
  • Lower Bounding the Log Likelihood
  • The E and M Steps of EM
  • The E Step
  • The M Step
  • EM Never Decreases the Likelihood
  • Reference (3)
  • Why Do We Need Graphical Model
  • Why Do We Need Graphical Models
  • Directed Acyclic Graphical Models (Bayesian Networks)
  • Directed Graphs for Statistical Models Plate Notation
  • pLSA ndash Probabilistic Latent Semantic Analysis
  • Latent Semantic Indexing (LSI) Review
  • pLSA ndash Probabilistic Latent Semantic Analysis (2)
  • pLSA
  • pLSA (2)
  • Joint Probability vs Likelihood
  • Document Decomposition
  • pLSA ndash Objective Function
  • EM Steps
  • Lower Bounding the Log Likelihood (2)
  • EM Steps (2)
  • Latent Subspace
  • pLSA vs LSA
  • pLSA vs LSA (2)
  • Applications
  • Text Mining
  • Scene Classification
  • Classification Result
  • Reference (4)
  • LDA ndash Latent Dirichilet Allocation
  • Problems in pLSA
  • Problems in pLSA (2)
  • Dirichlet Distribution
  • Dirichlet Distribution (2)
  • Example Dirichlet Distributions (K=3)
  • Example Dirichlet Distributions (K=3) (2)
  • The LDA Model
  • The LDA Model (2)
  • Joint Probability
  • Likelihood
  • Inference
  • Inference (2)
  • Variational Inference
  • Variantional Inference
  • VBEM vs EM
  • Parameter Estimation
  • Parameter Estimation (2)
  • Topic Examples in a 100-topic LDA Model)
  • Classification (50-topic LDA + SVM)
  • Problems in LDA
  • A Bayesian Hierarchical Model for Learning Natural Scene Catego
  • Codebook
  • Topic Distribution in Different Categories
  • Topic Hierarchical Clustering
  • More Topic Models
  • Are you really into Graphical Models
  • Reference (5)
  • Outline (3)
  • Slide 115
  • The Long Query Problem
  • Key Idea Dimension Reduction + Residual Error Preservation
  • Orthogonal Decomposition
  • A Probabilistic Implementation
  • Search (Online)
  • Search Example
  • Search Example (2)
  • Simultaneously Modeling Semantics and Structure of Threaded Dis
  • Semantic amp structure
  • Optimize them together
  • Reply reconstruction
  • Baselines
  • Evaluation
  • Expert finding
  • Baselines (2)
  • Evaluation (2)
  • Summary
Page 16: An Introduction To Matrix Decomposition and Graphical Model

Iterative PageRank Calculationbull For k=12hellip

bull Equivalently (λ=1 A is a Markov chain transition matrix)

bull Why can we use power iteration to find the first eigenvector

Convergence of the power iteration

bull Expand the initial approximation r0 in terms of the eigenvectors

SINGULAR VALUE DECOMPOSITION

SVD - Definitionbull Any m x n matrix A with m ge n can be factorized

bull

Singular Values And Singular Vectors

bull The diagonal elements σj of are the singular values of the matrix A

bull The columns of U and V are the left singular vectors and right singular vectors respectively

bull Equivalent form of SVD

Matrix approximation

bull Theorem Let Uk = (u1 u2 hellip uk) Vk = (v1 v2 hellip vk) and Σk = diag(σ1 σ2 hellip σk) and define

bull Then

bull It means that the best approximation of rank k for the matrix A is

SVD and PCAbull We can writebull Remember that in PCA we treat A as a row matrixbull V is just eigenvectors for A

ndash Each column in V is an eigenvector of row matrix Andash we use V to approximate a row in A

bull Equivalently we can writebull U is just eigenvectors for AT

ndash Each column in U is an eigenvector of column matrix Andash We use U to approximate a column in A

Example - LSI bull Build a term-by-document matrix A

bull Compute the SVD of A A = UΣVT

bull Approximate A by

ndash Uk Orthogonal basis that we use to approximate all the documents

ndash Dk Column j hold the coordinates of document j in the new basisndash Dk is the projection of A onto the subspace spanned by Uk

SVD and PCAbull For symmetric A SVD is closely related to PCA

bull PCA A = UΛUT

ndash U and Λ are eigenvectors and eigenvaluesbull SVD A = UΛVT

ndash U is left(column) eigenvectorsndash V is right(row) eigenvectorsndash Λ is the same eigenvalues

bull For symmetric A column eigenvectors equal to row eigenvectors

bull Note the difference of A in PCA and SVDndash SVD A is directly the data eg term-by-document matrixndash PCA A is covariance matrix A=XTX each row in X is a sample

Latent Semantic Indexing (LSI)1 Document file preparation preprocessing

ndash Indexing collecting termsndash Use stop list eliminate rdquomeaninglessrdquo wordsndash Stemming

2 Construction term-by-document matrix sparse matrix storage

3 Query matching distance measures4 Data compression by low rank approximation SVD5 Ranking and relevance feedback

Latent Semantic Indexingbull Assumption there is some underlying latent semantic

structure in the data

bull Eg car and automobile occur in similar documents as do cows and sheep

bull This structure can be enhanced by projecting the data (the term-by-document matrix and the queries) onto a lower dimensional space using SVD

Similarity Measures

bull Term to term AAT = UΣ2UT = (UΣ)(UΣ)T

UΣ are the coordinates of A (rows) projected to space V

bull Document to documentATA = VΣ2VT = (VΣ)(VΣ)T

VΣ are the coordinates of A (columns) projected to space U

Similarity Measures

bull Term to document A = UΣVT = (UΣfrac12)(VΣfrac12)T

UΣfrac12 are the coordinates of A (rows) projected to space VVΣfrac12 are the coordinates of A (columns) projected to space U

HITS (Hyperlink Induced Topic Search)

bull Idea Web includes two flavors of prominent pagesndash authorities contain high-quality informationndash hubs are comprehensive lists of links to authorities

bull A page is a good authority if many hubs point to itbull A page is a good hub if it points to many authoritiesbull Good authorities are pointed to by good hubs and good hubs

point to good authorities

Hubs Authorities

Power Iteration

bull Each page i has both a hub score hi and an authority score aibull HITS successively refines these scores by computing

bull Define the adjacency matrix L of the directed web graph

bull Now

HITS and SVD

bull L rows are outlinks columns are inlinks

bull a will be the dominant eigenvector of the authority matrix LTL bull h will be the dominant eigenvector of the hub matrix LLT

bull They are in fact the first left and right singular vectors of Lbull We are in fact running SVD on the adjacency matrix

HITS vs PageRankbull PageRank may be computed once HITS is computed per

query

bull HITS takes query into account PageRank doesnrsquot

bull PageRank has no concept of hubs

bull HITS is sensitive to local topology insertion or deletion of a small number of nodes may change the scores a lot

bull PageRank more stable because of its random jump step

NMF ndash NON-NEGATIVE MATRIX FACTORIZATION

Definition

bull Given a nonnegative matrix Vntimesm find non-negative matrix factors Wntimesk and Hktimesm such that

VntimesmasympWntimeskHktimesm

bull V column matrix each column is a data sample (n-dimension)bull Wi k-basis represents one basebull H coordinates of V projected to W

vj asymp Wntimeskhj

Motivationbull Non-negativity is natural in many applications

bull Probability is also non-negative

bull Additive model to capture local structure

Multiplicative Update Algorithmbull Cost function Euclidean distance

bull Multiplicative Update

Multiplicative Update Algorithmbull Cost function Divergence

ndash Reduce to Kullback-Leibler divergence whenndash A and B can be regarded as normalized probability distributions

bull Multiplicative update

bull PLSA is NMF with KL divergence

NMF vs PCA

bull n = 2429 faces m = 19x19 pixelsbull Positive values are illustrated with black pixels and negative values

with red pixels

bull NMF Parts-based representationbull PCA Holistic representations

Referencebull D D Lee and H S Seung Algorithms for non-negative matrix

factorization (pdf) NIPS 2001bull D D Lee and H S Seung Learning the parts of objects by non-negative

matrix factorization (pdf) Nature 401 788-791 (1999)

Major Referencebull Saara Hyvoumlnen Linear Algebra Methods for Data Mining Spring 2007

University of Helsinki (Highly recommend)

Outlinebull Basic concepts

ndash Likelihood iidndash ML MAP and Bayesian Inferencendash Expectation-Maximizationndash Mixture Gaussian Parameter estimation

bull pLSAndash Motivationndash Derivation amp Geometry propertiesndash Applications

bull LDAndash Motivation - Why to add a hyper parameterndash Dirichlet Distributionndash Variational EMndash Relations with other topic modalsndash Incorporating category information

bull Summary

Not Included

bull General graphical model theoriesbull Markov random field (belief propagation)bull Detailed derivation of LDA

BASIC CONCEPTS

What Is Machine Learning

Databull Let x = (x1 x2 xD) T denote a data point and D = x(1) x(2) x(N) a

data set D is sometimes associated with desired outputs y1 y2

Predictionsbull We are generally interested in predicting something based on the observed

data setbull Given D what can we say about x(N+1)

Modelbull To make predictions we need to make some assumptions We can often

express these assumptions in the form of a model with some parameters θbull Given data D we learn the model parameters from which we can predict

new data pointsbull The model can often be expressed as a probability distribution over data

points

Likelihood Function

bull Given a set of parameter values probability density function (PDF) will show that some data are more probable than other data

bull Inversely given the observed data and a model of interest Likelihood function is defined as

L(θ) = fθ(x|θ) = p(x|θ)

bull That is likelihood function L(θ) will show that some parameters are more likely to have produced the data

Maximum Likelihood (ML)

bull Maximum likelihood will find the best model parameters that make the data ldquomost likelyrdquo generated from this model

bull Suppose we are given n data samples (x1 x2 hellip xn)

bull Maximum likelihood will find θ that maximize L(θ)

bull Predictive distribution

IID ndash Independent Identically Distributed

bull IID means

bull The problem is considerably simplified as

bull Usually log likehood is used

Reference

bull Zoubin Ghahramani Machine Learning (4F13) 2006 Cambridge(Introduction to Machine Learning Lectures 1-2 Slides)

bull Gregor Heinrich Parameter estimation for text analysis Technical Note 2005-2008

EXPECTATION MAXIMIZATION

Why We Need EM

bull The ExpectationndashMaximization (EM) algorithm is a method for ML learning of parameters in latent variable models

bull Why we need latent variables

bull To describe complex model Gaussian Mixture Model

bull To discover the intrinsic structure inside a data set Topic Models such as pLSA LDA

More General

bull Data set bull Likelihood

bull Goal learn maximum likelihood (ML) parameter values

bull The maximum likelihood procedure finds parameters θ such that

bull Because of the integral (or sum) over latent variables the likelihood can be a very complicated and hard to optimize function

The Expectation Maximization (EM) Algorithm

bull The EM algorithm finds a (local) maximum of a latent variable model likelihood It starts from arbitrary values of the parameters and iterates two steps

bull E step Fill in values of latent variables according to posterior given data

bull M step Maximize likelihood as if latent variables were not hidden

bull Decomposes difficult problems into series of tractable steps

Jensenrsquos Inequality

Lower Bounding the Log Likelihoodbull Observed data D = yn Latent variables X = xn Parameters θbull Goal Maximize the log likelihood (ie ML learning) wrt θ

bull Any distribution q(X) over the hidden variables can be used to obtain a lower bound on the log likelihood using Jensenrsquos inequality

bull where H[q] is the entropy of q(X)

The E and M Steps of EM

bull The lower bound on the log likelihood is given by

bull EM alternates betweenbull E step optimize wrt distribution over hidden variables

holding params fixed

bull M step maximize wrt parameters holding hidden distribution fixed

The E Step

bull E step for fixed θ

bull The second term is the Kullback-Leibler divergencebull This means that for fixed θ F is bounded above by L and achieves

that bound whenbull So the E step simply sets

The M Step

bull M step maximize wrt parameters holding hidden distribution q fixed

bull The second equality comes from fact that entropy of q(X) does not depend directly on θ

bull The specific form of the M step depends on the model Often the maximum wrt θ can be found analytically

EM Never Decreases the Likelihood

bull The E and M steps together never decrease the log likelihood

bull The E step brings F(q θ) to the likelihood L(θ) 顶bull The M-step maximizes F(q θ) wrt θ 抬bull F(q θ) lt L(θ) by Jensen ndash or equivalently from the non-

negativity of KL

Reference

bull Zoubin Ghahramani Machine Learning (4F13) 2006 Cambridge(Unsupervised learning Lecture 5 Slides)

bull Christopher M Bishop (2006) Pattern Recognition and Machine Learning Springer

WHY DO WE NEED GRAPHICAL MODEL

Why Do We Need Graphical Models

bull Consndash Graphical model is so complex even with a few circleshellipndash We have to make too many assumptions

bull Prosndash We do need probability to explain our world But joint probability is

hard to computendash Graphical model can help us analyze and understand our problemsndash Graphs are an intuitive way of representing and visualizing the

relationships between many variablesndash With a graphical model we can decouple joint probability to

conditional probabilities which are usually easier

Directed Acyclic Graphical Models (Bayesian Networks)

bull A DAG Model Bayesian network corresponds to a factorization of the joint probability distribution

p(ABCDE) = p(A)p(B)p(C|AB)p(D|BC)p(E|CD)

bull In general

bull where pa(i) are the parents of node i

Directed Graphs for Statistical ModelsPlate Notation

bull A data set of N points generated from a Gaussian

PLSA ndash PROBABILISTIC LATENT SEMANTIC ANALYSIS

Latent Semantic Indexing (LSI) Review

bull For natural Language Queries simple term matching does not work effectivelyndash Ambiguous terms ndash Same Queries vary due to personal styles

bull Latent semantic indexingndash Creates this lsquolatent semantic spacersquo (hidden meaning)

bull LSI puts documents together even if they donrsquot have common words if the docs share frequently co-occurring terms

bull Disadvantagesndash Statistical foundation is missing

pLSA ndash Probabilistic Latent Semantic Analysis

bull Automated Document Indexing and Information retrievalbull Identification of Latent Classes using an Expectation

Maximization (EM) Algorithmbull Shown to solve

ndash Polysemybull Java could mean ldquocoffeerdquo and also the ldquoPL Javardquobull Cricket is a ldquogamerdquo and also an ldquoinsectrdquo

ndash Synonymybull ldquocomputerrdquo ldquopcrdquo ldquodesktoprdquo all could mean the same

bull Has a better statistical foundation than LSA

pLSA

M

Nd

d

z

w

M

d

z1

w1

z2

w2

z3

w3

zN

wN

hellip

z1 hellip zN are variables ziє[1K]K is the number of latent topics

pLSA

d1

z1

w1

z2

w2

zN1

wN1

hellip

d2

z1

w1

z2

w2

zN2

wN2

hellip

dM

z1

w1

z2

w2

zNm

wNm

hellip

p(w|z=1) p(w|z=2) p(w | z=Nm) are shared for all documents

Likelihood

Joint Probability vs Likelihood

bull Joint probability

bull Likelihood (only for observed variables)

bull p(d) is assumed to be uniform

Document Decomposition

bull Each document can be decomposed as

bull This is similar to the matrix decomposition if we consider each discrete distribution as a vector

p(w|d) = ZVtimesk p(z|d)

bull With many documents we hope to find latent topics as common basis

pLSA ndash Objective Function

bull pLSA tries to maximize the log likelihood

bull Due to the summation over z inside log we have to resort to EM

EM Steps

bull E-Stepndash Expectation of the likelihood function is calculated with the current

parameter valuesbull M-Step

ndash Update the parameters with the calculated posterior probabilitiesndash Find the parameters that maximizes the likelihood function

Lower Bounding the Log Likelihood

EM Steps

bull The E-Step

bull The M-Step

Latent Subspace

pLSA vs LSAbull LSA and PLSA perform dimensionality reduction

ndash In LSA by keeping only K singular valuesndash In PLSA by having K aspects

bull Comparison to SVDndash U Matrix related to P(z|d) (doc to aspect)ndash V Matrix related to P(w|z) (aspect to term)ndash E Matrix related to P(z) (aspect strength)

pLSA vs LSAbull The main difference is the way the approximation is done

bull PLSA generates a model (aspect model) and maximizes its predictive power

bull Selecting the proper value of K is heuristic in LSA

bull Model selection in statistics can determine optimal K in PLSA

Applications

bull Text mining topic discovering

bull Scene Classification

Text Mining

Scene Classification

Classification Result

Reference

bull Thomas Hofmann Probabilistic latent semantic analysis In Proc of Uncertainty in Artificial Intelligence UAI99 Stockholm 1999

bull Bosch A Zisserman A and Munoz X Scene Classification via pLSAProceedings of the European Conference on Computer Vision (2006)

bull Sivic J Russell B C Efros A A Zisserman A and Freeman W T Discovering Object Categories in Image Collections MIT AI Lab Memo AIM-2005-005 February 2005

LDA ndash LATENT DIRICHILET ALLOCATION

Problems in pLSA

bull pLSA provides no probabilistic model at the document level Each doc has its own topic mixture proportion

bull The number of parameters in the model grows linearly with M (the number of documents in the training set)

Problems in pLSA

bull There is no constraint for distributions p(z|di)

bull Easy to lead to serious problems with over-fitting

d1

z1

w1

z2

w2

zN1

wN1

hellip

d2

z1

w1

z2

w2

zN2

wN2

hellip

dm

z1

w1

z2

w2

zNm

wNm

hellip

p(z|d1) p(z|d2) p(z|dm)

Dirichlet Distribution

bull In the LDA model the topic mixture proportions for each document are assumed to follow some distribution

bull Requirement for such a distributionndash The samples (mixture proportions) generated from it are K-tuples of non-

negative numbers that sum to one That is the samples are multinormialsndash Easy to optimize

bull Dirichlet Distribution is one of such distributions

bull The space of all of these multinomials has a nice geometric interpretation as a (k-1)-simplex

Dirichlet Distribution

bull Definition

bull The density is zero outside this open (K minus 1)-dimensional simplex

k

i ii

ki ik

i i

ki i

kk

xx

xxxxp i

1

11

1

12121

1 0 st

)Γ()(Γ)|(

bull Various parameter α

(6 2 2) (3 7 5)

(2 3 4) (6 2 6)

Example Dirichlet Distributions (K=3)

Example Dirichlet Distributions (K=3)

bull Equal αi different

α0=01 α0=1 α0=10

k

i i10

The LDA Model

z4z3z2z1

w4w3w2w1

b

z4z3z2z1

w4w3w2w1

z4z3z2z1

w4w3w2w1

bull For each documentbull Choose ~Dirichlet()bull For each of the N words wn

ndash Choose a topic znraquo Multinomial()

ndash Choose a word wn from p(wn|znb) a multinomial probability conditioned on the topic zn

The LDA Model

bull For each documentbull Choose ~Dirichlet()bull For each of the N words wn

ndash Choose a topic znraquo Multinomial()

ndash Choose a word wn from p(wn|znb) a multinomial probability conditioned on the topic zn

Joint Probability

bull Given parameter α and β

where

Likelihood

bull Joint Probability

bull Marginal distribution of a document

bull Likelihood over all the documents

Inference

bull The likelihood can be computed by summing each documentbull Jansenrsquos inequality in EM

Inference

bull In E-Step we need to compute the posterior distribution of the hidden variables

bull Unfortunately this distribution is intractable to compute in general

bull We have to resort to variational approach

Variational Inference

bull In variational inference we consider a simplified graphical model with variational parameters and minimize the KL Divergence between the variational and posterior distributions

Variantional Inference

bull The difference between the lower bound and the likelihood is the KL divergence

bull Maximizing the lower bound L(b) with respect to and is equivalent to minimizing the KL divergence

VBEM vs EM

bull Only different in the E-Step

bull In standard EM q(X) is directly set to p(X|Dθ) and let KL=0bull In VBEM it is intractable to compute p(X|Dθ) Instead it

approximates p(X|Dθ) by a variational distribution q(X) by minimizing KL(q(X) | P(X|D θ)

bull This is also equivalent to maximizing the lower bound L(θ)

Parameter Estimation

bull Given a corpus of documents we would like to find the parameters and b which maximize the likelihood of the observed data

bull Strategy (Variational EM)

bull Lower bound log p(w|b) by a function L(b)bull Repeat until convergence

ndash E Maximize L(b) with respect to the variational parameters ndash M Maximize the bound with respect to parameters and b

Parameter Estimation

bull E-Step Variational Inference ndash repeat until convergence

bull M-Step Parameter estimation

β

α can be implemented using the Newton-Raphson method

Topic Examples in a 100-topic LDA Model)bull 16000 documents from a subset of the TREC AP corpus

Classification (50-topic LDA + SVM)

bull Reuters-21578 dataset ndash contains 8000 documents and 15818 words

(a) EARN vs NOT EARN (b) GRAIN vs NOT GRAIN

Problems in LDA

bull Dirichlet Distribution is helpful to avoid over-fitting But the assumption might be too strong

z4z3z2z1

w4w3w2w1

b

z4z3z2z1

w4w3w2w1

z4z3z2z1

w4w3w2w1

A Bayesian Hierarchical Model for Learning Natural Scene Categories

bull Incorporating category information

MNd

π

z

x

θ

β

Codebookbull 174 Local Image Patches

bull DetectionEvenly Sampled GridRandom SamplingSaliency DetectorLowersquos DoG Detector

bull RepresentationNormalized 11x11 gray values128-dim SIFT

Topic Distribution in Different Categories

Topic Hierarchical Clustering

More Topic Models

bull Dynamic topic models ICML 2006bull Correlated Topic Model NIPS 2005bull Hierarchical Dirichlet Process Journal of the American

Statistical Association 2003bull Nonparametric Bayes pachinko allocation UAI 2007bull Supervised LDA NIPS 2007bull MedLDA ndash Maximum Margin Discrimant LDA ICML 2009bull hellip

Are you really into Graphical Models

bull Describing Visual Scenes using Transformed Dirichlet Processes E Sudderth A Torralba W Freeman and A Willsky NIPS Dec 2005

Reference

bull David M Blei Andrew Y Ng Michael I Jordan Latent Dirichlet Allocation Journal of Machine Learning Research (JMLR) 2003

bull Matthew J Beal Variational Algorithms for Approximate Bayesian Inference PhD Thesis University of Cambridge 1998

bull L Fei-Fei and P Perona A Bayesian Hierarchical Model for Learning Natural Scene Categories CVPR 2005

Outline

bull Matrix Decompositionndash PCA SVD NMFndash LDA ICA Sparse Coding etc

bull Graphical Modelndash Basic concepts in probabilistic machine learningndash EMndash pLSAndash LDA

bull Two Applicationsndash Document decomposition for ldquolong queryrdquo retrieval ICCV 2009ndash Modeling Threaded Discussions SIGIR 2009

Large-Scale Indexing for ldquoLong Queryrdquo Retrieval (Similarity Search)

Xiao Zhang Zhiwei Li Lei Zhang Wei-Ying Ma and Heung-Yeung Shum

ICCV 2009

The Long Query Problem

bull If a query contains 1000 keywordsndash Need to access 1000 inverted listsndash The intersection of 1000 inverted lists may be emptyndash The union of 1000 inverted list may be the whole corpus

bull Dimension reductionTerm1 Term2 Term3 Term4 hellip TermN

Img1 1 2 0 0 hellip 2

f1 f2 hellip fM

Img1 02 01 hellip 003

Topic Projection

Dim = 1 million

Dim = 200

Key Idea Dimension Reduction + Residual Error Preservation

bull p original TF-IDF vector in vocabulary spacebull X projection matrix for dimension reductionbull w low dimensional feature vectorbull residual error

p Xw

An image = 0 01 02 03 04 05 06 07 08 09 1

0

2

4

6

8

10

12

14

16

+ a few words(10 words)

Orthogonal Decomposition

p Xw 1 11 1 1

2 21 2 1 2

1 1

1

k

k

W k W

W W Wk W

p x xp x x w

p wp x x

Base vector

Low dimensional representation

Residual

An image = 0 01 02 03 04 05 06 07 08 09 1

0

2

4

6

8

10

12

14

16

+ a few words(10 words)

T Tp q p q

p q

w w

X1 X2 X3 hellip Xk

A Probabilistic Implementation

x is a switch variable It controls a word generated from

bull a topic specific distribution

bull a document specific distribution

bull a background distribution

( | )p w d 1( 0 | ) ( | ) ( | )K

kp x d p w z k p z k d

( 1| ) ( | )p x d p w d ( 2 | ) ( | )p x d p w

C Chemudugunta etal Modeling General and Specific Aspects of Documents with a Probabilistic Topic Model NIPS 2006

Search (Online)

DS1 DS2 hellip

DS1 DS2 hellip

DS1 DS2 hellip

DS1 DS2 hellip

DS1 DS2 hellip

hellip

hellip

hellip

hellip

hellip

LSH Index

Doc 300 Doc 401 hellip

A query = 0 01 02 03 04 05 06 07 08 09 1

0

2

4

6

8

10

12

14

16

+ a few words

Re-rankingDoc 401 hellip

Doc 1

Doc 2

Doc 300

Doc 401

Doc N

Doc 300

Index 10M Images 46GBSearch Speed lt 100ms

Doc Meta

Search Example

Query Image

Search Example

Query Image

Simultaneously Modeling Semantics and Structure of Threaded Discussions A Sparse

Coding Approach and Its Applications

Chen LIN Jiang-Ming YANG Rui CAI Xin-jing WANG Wei WANG Lei ZHANG

SIGIR 2009

123

Semantic amp structure

124

SemanticTopics

StructureWho reply to who

Optimize them together

Model semantic

Model structure

125

Reply reconstruction

126

DocumentSimilarity

TopicSimilarity

StructureSimilarity

BaselinesNP

Reply to Nearest PostRR

Reply to RootDS

Document SimilarityLDA

Latent Dirichlet Allocation Project documents to topic space

SWB Special Words Topic Model with Background distribution Project documents to topic and junk topic space

127

Evaluation

method Slashdot Apple

All Posts Good Posts All Posts Good Posts

NP 0021 0012 0289 0239

RR 0183 0319 0269 0474

DS 0463 0643 0409 0628

LDA 0465 0644 0410 0648

SWB 0463 0644 0410 0641

SMSS 0524 0737 0517 0772

128

Expert finding

Reply reconstruction

Network construction

Expert finding

Methods

HITS

PageRank

hellip

129

Baselines

LMFormal Models for Expert Finding in Enterprise Corpora SIGIR

06Achieves stable performance in expert finding task using a

language modelPageRank

Benchmark nodal ranking methodHITS

Find hub nodes and authority nodeEABIF

Personalized Recommendation Driven by Information Flow SIGIR rsquo06

Find most influential node130

Evaluation

131

bull Bayesian estimate

Method MRR MAP P10

LM 0821 0698 0800

EABIF(ori) 0674 0362 0243

EABIF(rec) 0742 0318 0281

PageRank(ori) 0675 0377 0263

PageRank(rec) 0743 0321 0266

HITS(ori) 0906 0832 0900

HITS(rec) 0938 0822 0906

Summary

bull Matrix and probability are fundamental mathematics in information retrieval and computer visionndash Matrix decomposition ndash a good practice to learn matrixndash Graphical model ndash a good practice to learn probability

bull Graphical model is a good tool to analyze problems

bull The essence of decomposition is to discover a set of mid-level features to describe original documentsimages

bull It is more adaptable for various applications than matrix decomposition

  • An Introduction To Matrix Decomposition and Graphical Model
  • Outline
  • What Is Matrix Decomposition
  • Why We Need Matrix Decomposition
  • Why We Need Matrix Decomposition (2)
  • Principle Component Analysis
  • Definition ndash Eigenvalue amp Eigenvector
  • Definition ndash Principle Component Analysis
  • Principle Component Analysis (2)
  • Maximizing Variance
  • Optimization Problem
  • Property Data Decomposition
  • Face Recognition ndash Eigenface
  • Slide 14
  • Slide 15
  • PageRank ndash Power Iteration
  • Column-Stochastic amp Irreducible
  • Iterative PageRank Calculation
  • Convergence of the power iteration
  • Singular value decomposition
  • SVD - Definition
  • Singular Values And Singular Vectors
  • Matrix approximation
  • SVD and PCA
  • Example - LSI
  • SVD and PCA (2)
  • Latent Semantic Indexing (LSI)
  • Latent Semantic Indexing
  • Similarity Measures
  • Similarity Measures (2)
  • HITS (Hyperlink Induced Topic Search)
  • Power Iteration
  • HITS and SVD
  • HITS vs PageRank
  • NMF ndash Non-Negative Matrix Factorization
  • Definition
  • Motivation
  • Multiplicative Update Algorithm
  • Multiplicative Update Algorithm (2)
  • NMF vs PCA
  • Reference
  • Major Reference
  • Outline (2)
  • Not Included
  • Basic Concepts
  • What Is Machine Learning
  • Likelihood Function
  • Maximum Likelihood (ML)
  • IID ndash Independent Identically Distributed
  • Reference (2)
  • Expectation Maximization
  • Why We Need EM
  • More General
  • The Expectation Maximization (EM) Algorithm
  • Jensenrsquos Inequality
  • Lower Bounding the Log Likelihood
  • The E and M Steps of EM
  • The E Step
  • The M Step
  • EM Never Decreases the Likelihood
  • Reference (3)
  • Why Do We Need Graphical Model
  • Why Do We Need Graphical Models
  • Directed Acyclic Graphical Models (Bayesian Networks)
  • Directed Graphs for Statistical Models Plate Notation
  • pLSA ndash Probabilistic Latent Semantic Analysis
  • Latent Semantic Indexing (LSI) Review
  • pLSA ndash Probabilistic Latent Semantic Analysis (2)
  • pLSA
  • pLSA (2)
  • Joint Probability vs Likelihood
  • Document Decomposition
  • pLSA ndash Objective Function
  • EM Steps
  • Lower Bounding the Log Likelihood (2)
  • EM Steps (2)
  • Latent Subspace
  • pLSA vs LSA
  • pLSA vs LSA (2)
  • Applications
  • Text Mining
  • Scene Classification
  • Classification Result
  • Reference (4)
  • LDA ndash Latent Dirichilet Allocation
  • Problems in pLSA
  • Problems in pLSA (2)
  • Dirichlet Distribution
  • Dirichlet Distribution (2)
  • Example Dirichlet Distributions (K=3)
  • Example Dirichlet Distributions (K=3) (2)
  • The LDA Model
  • The LDA Model (2)
  • Joint Probability
  • Likelihood
  • Inference
  • Inference (2)
  • Variational Inference
  • Variantional Inference
  • VBEM vs EM
  • Parameter Estimation
  • Parameter Estimation (2)
  • Topic Examples in a 100-topic LDA Model)
  • Classification (50-topic LDA + SVM)
  • Problems in LDA
  • A Bayesian Hierarchical Model for Learning Natural Scene Catego
  • Codebook
  • Topic Distribution in Different Categories
  • Topic Hierarchical Clustering
  • More Topic Models
  • Are you really into Graphical Models
  • Reference (5)
  • Outline (3)
  • Slide 115
  • The Long Query Problem
  • Key Idea Dimension Reduction + Residual Error Preservation
  • Orthogonal Decomposition
  • A Probabilistic Implementation
  • Search (Online)
  • Search Example
  • Search Example (2)
  • Simultaneously Modeling Semantics and Structure of Threaded Dis
  • Semantic amp structure
  • Optimize them together
  • Reply reconstruction
  • Baselines
  • Evaluation
  • Expert finding
  • Baselines (2)
  • Evaluation (2)
  • Summary
Page 17: An Introduction To Matrix Decomposition and Graphical Model

Convergence of the power iteration

bull Expand the initial approximation r0 in terms of the eigenvectors

SINGULAR VALUE DECOMPOSITION

SVD - Definitionbull Any m x n matrix A with m ge n can be factorized

bull

Singular Values And Singular Vectors

bull The diagonal elements σj of are the singular values of the matrix A

bull The columns of U and V are the left singular vectors and right singular vectors respectively

bull Equivalent form of SVD

Matrix approximation

bull Theorem Let Uk = (u1 u2 hellip uk) Vk = (v1 v2 hellip vk) and Σk = diag(σ1 σ2 hellip σk) and define

bull Then

bull It means that the best approximation of rank k for the matrix A is

SVD and PCAbull We can writebull Remember that in PCA we treat A as a row matrixbull V is just eigenvectors for A

ndash Each column in V is an eigenvector of row matrix Andash we use V to approximate a row in A

bull Equivalently we can writebull U is just eigenvectors for AT

ndash Each column in U is an eigenvector of column matrix Andash We use U to approximate a column in A

Example - LSI bull Build a term-by-document matrix A

bull Compute the SVD of A A = UΣVT

bull Approximate A by

ndash Uk Orthogonal basis that we use to approximate all the documents

ndash Dk Column j hold the coordinates of document j in the new basisndash Dk is the projection of A onto the subspace spanned by Uk

SVD and PCAbull For symmetric A SVD is closely related to PCA

bull PCA A = UΛUT

ndash U and Λ are eigenvectors and eigenvaluesbull SVD A = UΛVT

ndash U is left(column) eigenvectorsndash V is right(row) eigenvectorsndash Λ is the same eigenvalues

bull For symmetric A column eigenvectors equal to row eigenvectors

bull Note the difference of A in PCA and SVDndash SVD A is directly the data eg term-by-document matrixndash PCA A is covariance matrix A=XTX each row in X is a sample

Latent Semantic Indexing (LSI)1 Document file preparation preprocessing

ndash Indexing collecting termsndash Use stop list eliminate rdquomeaninglessrdquo wordsndash Stemming

2 Construction term-by-document matrix sparse matrix storage

3 Query matching distance measures4 Data compression by low rank approximation SVD5 Ranking and relevance feedback

Latent Semantic Indexingbull Assumption there is some underlying latent semantic

structure in the data

bull Eg car and automobile occur in similar documents as do cows and sheep

bull This structure can be enhanced by projecting the data (the term-by-document matrix and the queries) onto a lower dimensional space using SVD

Similarity Measures

bull Term to term AAT = UΣ2UT = (UΣ)(UΣ)T

UΣ are the coordinates of A (rows) projected to space V

bull Document to documentATA = VΣ2VT = (VΣ)(VΣ)T

VΣ are the coordinates of A (columns) projected to space U

Similarity Measures

bull Term to document A = UΣVT = (UΣfrac12)(VΣfrac12)T

UΣfrac12 are the coordinates of A (rows) projected to space VVΣfrac12 are the coordinates of A (columns) projected to space U

HITS (Hyperlink Induced Topic Search)

bull Idea Web includes two flavors of prominent pagesndash authorities contain high-quality informationndash hubs are comprehensive lists of links to authorities

bull A page is a good authority if many hubs point to itbull A page is a good hub if it points to many authoritiesbull Good authorities are pointed to by good hubs and good hubs

point to good authorities

Hubs Authorities

Power Iteration

bull Each page i has both a hub score hi and an authority score aibull HITS successively refines these scores by computing

bull Define the adjacency matrix L of the directed web graph

bull Now

HITS and SVD

bull L rows are outlinks columns are inlinks

bull a will be the dominant eigenvector of the authority matrix LTL bull h will be the dominant eigenvector of the hub matrix LLT

bull They are in fact the first left and right singular vectors of Lbull We are in fact running SVD on the adjacency matrix

HITS vs PageRankbull PageRank may be computed once HITS is computed per

query

bull HITS takes query into account PageRank doesnrsquot

bull PageRank has no concept of hubs

bull HITS is sensitive to local topology insertion or deletion of a small number of nodes may change the scores a lot

bull PageRank more stable because of its random jump step

NMF ndash NON-NEGATIVE MATRIX FACTORIZATION

Definition

bull Given a nonnegative matrix Vntimesm find non-negative matrix factors Wntimesk and Hktimesm such that

VntimesmasympWntimeskHktimesm

bull V column matrix each column is a data sample (n-dimension)bull Wi k-basis represents one basebull H coordinates of V projected to W

vj asymp Wntimeskhj

Motivationbull Non-negativity is natural in many applications

bull Probability is also non-negative

bull Additive model to capture local structure

Multiplicative Update Algorithmbull Cost function Euclidean distance

bull Multiplicative Update

Multiplicative Update Algorithmbull Cost function Divergence

ndash Reduce to Kullback-Leibler divergence whenndash A and B can be regarded as normalized probability distributions

bull Multiplicative update

bull PLSA is NMF with KL divergence

NMF vs PCA

bull n = 2429 faces m = 19x19 pixelsbull Positive values are illustrated with black pixels and negative values

with red pixels

bull NMF Parts-based representationbull PCA Holistic representations

Referencebull D D Lee and H S Seung Algorithms for non-negative matrix

factorization (pdf) NIPS 2001bull D D Lee and H S Seung Learning the parts of objects by non-negative

matrix factorization (pdf) Nature 401 788-791 (1999)

Major Referencebull Saara Hyvoumlnen Linear Algebra Methods for Data Mining Spring 2007

University of Helsinki (Highly recommend)

Outlinebull Basic concepts

ndash Likelihood iidndash ML MAP and Bayesian Inferencendash Expectation-Maximizationndash Mixture Gaussian Parameter estimation

bull pLSAndash Motivationndash Derivation amp Geometry propertiesndash Applications

bull LDAndash Motivation - Why to add a hyper parameterndash Dirichlet Distributionndash Variational EMndash Relations with other topic modalsndash Incorporating category information

bull Summary

Not Included

bull General graphical model theoriesbull Markov random field (belief propagation)bull Detailed derivation of LDA

BASIC CONCEPTS

What Is Machine Learning

Databull Let x = (x1 x2 xD) T denote a data point and D = x(1) x(2) x(N) a

data set D is sometimes associated with desired outputs y1 y2

Predictionsbull We are generally interested in predicting something based on the observed

data setbull Given D what can we say about x(N+1)

Modelbull To make predictions we need to make some assumptions We can often

express these assumptions in the form of a model with some parameters θbull Given data D we learn the model parameters from which we can predict

new data pointsbull The model can often be expressed as a probability distribution over data

points

Likelihood Function

bull Given a set of parameter values probability density function (PDF) will show that some data are more probable than other data

bull Inversely given the observed data and a model of interest Likelihood function is defined as

L(θ) = fθ(x|θ) = p(x|θ)

bull That is likelihood function L(θ) will show that some parameters are more likely to have produced the data

Maximum Likelihood (ML)

bull Maximum likelihood will find the best model parameters that make the data ldquomost likelyrdquo generated from this model

bull Suppose we are given n data samples (x1 x2 hellip xn)

bull Maximum likelihood will find θ that maximize L(θ)

bull Predictive distribution

IID ndash Independent Identically Distributed

bull IID means

bull The problem is considerably simplified as

bull Usually log likehood is used

Reference

bull Zoubin Ghahramani Machine Learning (4F13) 2006 Cambridge(Introduction to Machine Learning Lectures 1-2 Slides)

bull Gregor Heinrich Parameter estimation for text analysis Technical Note 2005-2008

EXPECTATION MAXIMIZATION

Why We Need EM

bull The ExpectationndashMaximization (EM) algorithm is a method for ML learning of parameters in latent variable models

bull Why we need latent variables

bull To describe complex model Gaussian Mixture Model

bull To discover the intrinsic structure inside a data set Topic Models such as pLSA LDA

More General

bull Data set bull Likelihood

bull Goal learn maximum likelihood (ML) parameter values

bull The maximum likelihood procedure finds parameters θ such that

bull Because of the integral (or sum) over latent variables the likelihood can be a very complicated and hard to optimize function

The Expectation Maximization (EM) Algorithm

bull The EM algorithm finds a (local) maximum of a latent variable model likelihood It starts from arbitrary values of the parameters and iterates two steps

bull E step Fill in values of latent variables according to posterior given data

bull M step Maximize likelihood as if latent variables were not hidden

bull Decomposes difficult problems into series of tractable steps

Jensenrsquos Inequality

Lower Bounding the Log Likelihoodbull Observed data D = yn Latent variables X = xn Parameters θbull Goal Maximize the log likelihood (ie ML learning) wrt θ

bull Any distribution q(X) over the hidden variables can be used to obtain a lower bound on the log likelihood using Jensenrsquos inequality

bull where H[q] is the entropy of q(X)

The E and M Steps of EM

bull The lower bound on the log likelihood is given by

bull EM alternates betweenbull E step optimize wrt distribution over hidden variables

holding params fixed

bull M step maximize wrt parameters holding hidden distribution fixed

The E Step

bull E step for fixed θ

bull The second term is the Kullback-Leibler divergencebull This means that for fixed θ F is bounded above by L and achieves

that bound whenbull So the E step simply sets

The M Step

bull M step maximize wrt parameters holding hidden distribution q fixed

bull The second equality comes from fact that entropy of q(X) does not depend directly on θ

bull The specific form of the M step depends on the model Often the maximum wrt θ can be found analytically

EM Never Decreases the Likelihood

bull The E and M steps together never decrease the log likelihood

bull The E step brings F(q θ) to the likelihood L(θ) 顶bull The M-step maximizes F(q θ) wrt θ 抬bull F(q θ) lt L(θ) by Jensen ndash or equivalently from the non-

negativity of KL

Reference

bull Zoubin Ghahramani Machine Learning (4F13) 2006 Cambridge(Unsupervised learning Lecture 5 Slides)

bull Christopher M Bishop (2006) Pattern Recognition and Machine Learning Springer

WHY DO WE NEED GRAPHICAL MODEL

Why Do We Need Graphical Models

bull Consndash Graphical model is so complex even with a few circleshellipndash We have to make too many assumptions

bull Prosndash We do need probability to explain our world But joint probability is

hard to computendash Graphical model can help us analyze and understand our problemsndash Graphs are an intuitive way of representing and visualizing the

relationships between many variablesndash With a graphical model we can decouple joint probability to

conditional probabilities which are usually easier

Directed Acyclic Graphical Models (Bayesian Networks)

bull A DAG Model Bayesian network corresponds to a factorization of the joint probability distribution

p(ABCDE) = p(A)p(B)p(C|AB)p(D|BC)p(E|CD)

bull In general

bull where pa(i) are the parents of node i

Directed Graphs for Statistical ModelsPlate Notation

bull A data set of N points generated from a Gaussian

PLSA ndash PROBABILISTIC LATENT SEMANTIC ANALYSIS

Latent Semantic Indexing (LSI) Review

bull For natural Language Queries simple term matching does not work effectivelyndash Ambiguous terms ndash Same Queries vary due to personal styles

bull Latent semantic indexingndash Creates this lsquolatent semantic spacersquo (hidden meaning)

bull LSI puts documents together even if they donrsquot have common words if the docs share frequently co-occurring terms

bull Disadvantagesndash Statistical foundation is missing

pLSA ndash Probabilistic Latent Semantic Analysis

bull Automated Document Indexing and Information retrievalbull Identification of Latent Classes using an Expectation

Maximization (EM) Algorithmbull Shown to solve

ndash Polysemybull Java could mean ldquocoffeerdquo and also the ldquoPL Javardquobull Cricket is a ldquogamerdquo and also an ldquoinsectrdquo

ndash Synonymybull ldquocomputerrdquo ldquopcrdquo ldquodesktoprdquo all could mean the same

bull Has a better statistical foundation than LSA

pLSA

M

Nd

d

z

w

M

d

z1

w1

z2

w2

z3

w3

zN

wN

hellip

z1 hellip zN are variables ziє[1K]K is the number of latent topics

pLSA

d1

z1

w1

z2

w2

zN1

wN1

hellip

d2

z1

w1

z2

w2

zN2

wN2

hellip

dM

z1

w1

z2

w2

zNm

wNm

hellip

p(w|z=1) p(w|z=2) p(w | z=Nm) are shared for all documents

Likelihood

Joint Probability vs Likelihood

bull Joint probability

bull Likelihood (only for observed variables)

bull p(d) is assumed to be uniform

Document Decomposition

bull Each document can be decomposed as

bull This is similar to the matrix decomposition if we consider each discrete distribution as a vector

p(w|d) = ZVtimesk p(z|d)

bull With many documents we hope to find latent topics as common basis

pLSA ndash Objective Function

bull pLSA tries to maximize the log likelihood

bull Due to the summation over z inside log we have to resort to EM

EM Steps

bull E-Stepndash Expectation of the likelihood function is calculated with the current

parameter valuesbull M-Step

ndash Update the parameters with the calculated posterior probabilitiesndash Find the parameters that maximizes the likelihood function

Lower Bounding the Log Likelihood

EM Steps

bull The E-Step

bull The M-Step

Latent Subspace

pLSA vs LSAbull LSA and PLSA perform dimensionality reduction

ndash In LSA by keeping only K singular valuesndash In PLSA by having K aspects

bull Comparison to SVDndash U Matrix related to P(z|d) (doc to aspect)ndash V Matrix related to P(w|z) (aspect to term)ndash E Matrix related to P(z) (aspect strength)

pLSA vs LSAbull The main difference is the way the approximation is done

bull PLSA generates a model (aspect model) and maximizes its predictive power

bull Selecting the proper value of K is heuristic in LSA

bull Model selection in statistics can determine optimal K in PLSA

Applications

bull Text mining topic discovering

bull Scene Classification

Text Mining

Scene Classification

Classification Result

Reference

bull Thomas Hofmann Probabilistic latent semantic analysis In Proc of Uncertainty in Artificial Intelligence UAI99 Stockholm 1999

bull Bosch A Zisserman A and Munoz X Scene Classification via pLSAProceedings of the European Conference on Computer Vision (2006)

bull Sivic J Russell B C Efros A A Zisserman A and Freeman W T Discovering Object Categories in Image Collections MIT AI Lab Memo AIM-2005-005 February 2005

LDA ndash LATENT DIRICHILET ALLOCATION

Problems in pLSA

bull pLSA provides no probabilistic model at the document level Each doc has its own topic mixture proportion

bull The number of parameters in the model grows linearly with M (the number of documents in the training set)

Problems in pLSA

bull There is no constraint for distributions p(z|di)

bull Easy to lead to serious problems with over-fitting

d1

z1

w1

z2

w2

zN1

wN1

hellip

d2

z1

w1

z2

w2

zN2

wN2

hellip

dm

z1

w1

z2

w2

zNm

wNm

hellip

p(z|d1) p(z|d2) p(z|dm)

Dirichlet Distribution

bull In the LDA model the topic mixture proportions for each document are assumed to follow some distribution

bull Requirement for such a distributionndash The samples (mixture proportions) generated from it are K-tuples of non-

negative numbers that sum to one That is the samples are multinormialsndash Easy to optimize

bull Dirichlet Distribution is one of such distributions

bull The space of all of these multinomials has a nice geometric interpretation as a (k-1)-simplex

Dirichlet Distribution

bull Definition

bull The density is zero outside this open (K minus 1)-dimensional simplex

k

i ii

ki ik

i i

ki i

kk

xx

xxxxp i

1

11

1

12121

1 0 st

)Γ()(Γ)|(

bull Various parameter α

(6 2 2) (3 7 5)

(2 3 4) (6 2 6)

Example Dirichlet Distributions (K=3)

Example Dirichlet Distributions (K=3)

bull Equal αi different

α0=01 α0=1 α0=10

k

i i10

The LDA Model

z4z3z2z1

w4w3w2w1

b

z4z3z2z1

w4w3w2w1

z4z3z2z1

w4w3w2w1

bull For each documentbull Choose ~Dirichlet()bull For each of the N words wn

ndash Choose a topic znraquo Multinomial()

ndash Choose a word wn from p(wn|znb) a multinomial probability conditioned on the topic zn

The LDA Model

bull For each documentbull Choose ~Dirichlet()bull For each of the N words wn

ndash Choose a topic znraquo Multinomial()

ndash Choose a word wn from p(wn|znb) a multinomial probability conditioned on the topic zn

Joint Probability

bull Given parameter α and β

where

Likelihood

bull Joint Probability

bull Marginal distribution of a document

bull Likelihood over all the documents

Inference

bull The likelihood can be computed by summing each documentbull Jansenrsquos inequality in EM

Inference

bull In E-Step we need to compute the posterior distribution of the hidden variables

bull Unfortunately this distribution is intractable to compute in general

bull We have to resort to variational approach

Variational Inference

bull In variational inference we consider a simplified graphical model with variational parameters and minimize the KL Divergence between the variational and posterior distributions

Variantional Inference

bull The difference between the lower bound and the likelihood is the KL divergence

bull Maximizing the lower bound L(b) with respect to and is equivalent to minimizing the KL divergence

VBEM vs EM

bull Only different in the E-Step

bull In standard EM q(X) is directly set to p(X|Dθ) and let KL=0bull In VBEM it is intractable to compute p(X|Dθ) Instead it

approximates p(X|Dθ) by a variational distribution q(X) by minimizing KL(q(X) | P(X|D θ)

bull This is also equivalent to maximizing the lower bound L(θ)

Parameter Estimation

bull Given a corpus of documents we would like to find the parameters and b which maximize the likelihood of the observed data

bull Strategy (Variational EM)

bull Lower bound log p(w|b) by a function L(b)bull Repeat until convergence

ndash E Maximize L(b) with respect to the variational parameters ndash M Maximize the bound with respect to parameters and b

Parameter Estimation

bull E-Step Variational Inference ndash repeat until convergence

bull M-Step Parameter estimation

β

α can be implemented using the Newton-Raphson method

Topic Examples in a 100-topic LDA Model)bull 16000 documents from a subset of the TREC AP corpus

Classification (50-topic LDA + SVM)

bull Reuters-21578 dataset ndash contains 8000 documents and 15818 words

(a) EARN vs NOT EARN (b) GRAIN vs NOT GRAIN

Problems in LDA

bull Dirichlet Distribution is helpful to avoid over-fitting But the assumption might be too strong

z4z3z2z1

w4w3w2w1

b

z4z3z2z1

w4w3w2w1

z4z3z2z1

w4w3w2w1

A Bayesian Hierarchical Model for Learning Natural Scene Categories

bull Incorporating category information

MNd

π

z

x

θ

β

Codebookbull 174 Local Image Patches

bull DetectionEvenly Sampled GridRandom SamplingSaliency DetectorLowersquos DoG Detector

bull RepresentationNormalized 11x11 gray values128-dim SIFT

Topic Distribution in Different Categories

Topic Hierarchical Clustering

More Topic Models

bull Dynamic topic models ICML 2006bull Correlated Topic Model NIPS 2005bull Hierarchical Dirichlet Process Journal of the American

Statistical Association 2003bull Nonparametric Bayes pachinko allocation UAI 2007bull Supervised LDA NIPS 2007bull MedLDA ndash Maximum Margin Discrimant LDA ICML 2009bull hellip

Are you really into Graphical Models

bull Describing Visual Scenes using Transformed Dirichlet Processes E Sudderth A Torralba W Freeman and A Willsky NIPS Dec 2005

Reference

bull David M Blei Andrew Y Ng Michael I Jordan Latent Dirichlet Allocation Journal of Machine Learning Research (JMLR) 2003

bull Matthew J Beal Variational Algorithms for Approximate Bayesian Inference PhD Thesis University of Cambridge 1998

bull L Fei-Fei and P Perona A Bayesian Hierarchical Model for Learning Natural Scene Categories CVPR 2005

Outline

bull Matrix Decompositionndash PCA SVD NMFndash LDA ICA Sparse Coding etc

bull Graphical Modelndash Basic concepts in probabilistic machine learningndash EMndash pLSAndash LDA

bull Two Applicationsndash Document decomposition for ldquolong queryrdquo retrieval ICCV 2009ndash Modeling Threaded Discussions SIGIR 2009

Large-Scale Indexing for ldquoLong Queryrdquo Retrieval (Similarity Search)

Xiao Zhang Zhiwei Li Lei Zhang Wei-Ying Ma and Heung-Yeung Shum

ICCV 2009

The Long Query Problem

bull If a query contains 1000 keywordsndash Need to access 1000 inverted listsndash The intersection of 1000 inverted lists may be emptyndash The union of 1000 inverted list may be the whole corpus

bull Dimension reductionTerm1 Term2 Term3 Term4 hellip TermN

Img1 1 2 0 0 hellip 2

f1 f2 hellip fM

Img1 02 01 hellip 003

Topic Projection

Dim = 1 million

Dim = 200

Key Idea Dimension Reduction + Residual Error Preservation

bull p original TF-IDF vector in vocabulary spacebull X projection matrix for dimension reductionbull w low dimensional feature vectorbull residual error

p Xw

An image = 0 01 02 03 04 05 06 07 08 09 1

0

2

4

6

8

10

12

14

16

+ a few words(10 words)

Orthogonal Decomposition

p Xw 1 11 1 1

2 21 2 1 2

1 1

1

k

k

W k W

W W Wk W

p x xp x x w

p wp x x

Base vector

Low dimensional representation

Residual

An image = 0 01 02 03 04 05 06 07 08 09 1

0

2

4

6

8

10

12

14

16

+ a few words(10 words)

T Tp q p q

p q

w w

X1 X2 X3 hellip Xk

A Probabilistic Implementation

x is a switch variable It controls a word generated from

bull a topic specific distribution

bull a document specific distribution

bull a background distribution

( | )p w d 1( 0 | ) ( | ) ( | )K

kp x d p w z k p z k d

( 1| ) ( | )p x d p w d ( 2 | ) ( | )p x d p w

C Chemudugunta etal Modeling General and Specific Aspects of Documents with a Probabilistic Topic Model NIPS 2006

Search (Online)

DS1 DS2 hellip

DS1 DS2 hellip

DS1 DS2 hellip

DS1 DS2 hellip

DS1 DS2 hellip

hellip

hellip

hellip

hellip

hellip

LSH Index

Doc 300 Doc 401 hellip

A query = 0 01 02 03 04 05 06 07 08 09 1

0

2

4

6

8

10

12

14

16

+ a few words

Re-rankingDoc 401 hellip

Doc 1

Doc 2

Doc 300

Doc 401

Doc N

Doc 300

Index 10M Images 46GBSearch Speed lt 100ms

Doc Meta

Search Example

Query Image

Search Example

Query Image

Simultaneously Modeling Semantics and Structure of Threaded Discussions A Sparse

Coding Approach and Its Applications

Chen LIN Jiang-Ming YANG Rui CAI Xin-jing WANG Wei WANG Lei ZHANG

SIGIR 2009

123

Semantic amp structure

124

SemanticTopics

StructureWho reply to who

Optimize them together

Model semantic

Model structure

125

Reply reconstruction

126

DocumentSimilarity

TopicSimilarity

StructureSimilarity

BaselinesNP

Reply to Nearest PostRR

Reply to RootDS

Document SimilarityLDA

Latent Dirichlet Allocation Project documents to topic space

SWB Special Words Topic Model with Background distribution Project documents to topic and junk topic space

127

Evaluation

method Slashdot Apple

All Posts Good Posts All Posts Good Posts

NP 0021 0012 0289 0239

RR 0183 0319 0269 0474

DS 0463 0643 0409 0628

LDA 0465 0644 0410 0648

SWB 0463 0644 0410 0641

SMSS 0524 0737 0517 0772

128

Expert finding

Reply reconstruction

Network construction

Expert finding

Methods

HITS

PageRank

hellip

129

Baselines

LMFormal Models for Expert Finding in Enterprise Corpora SIGIR

06Achieves stable performance in expert finding task using a

language modelPageRank

Benchmark nodal ranking methodHITS

Find hub nodes and authority nodeEABIF

Personalized Recommendation Driven by Information Flow SIGIR rsquo06

Find most influential node130

Evaluation

131

bull Bayesian estimate

Method MRR MAP P10

LM 0821 0698 0800

EABIF(ori) 0674 0362 0243

EABIF(rec) 0742 0318 0281

PageRank(ori) 0675 0377 0263

PageRank(rec) 0743 0321 0266

HITS(ori) 0906 0832 0900

HITS(rec) 0938 0822 0906

Summary

bull Matrix and probability are fundamental mathematics in information retrieval and computer visionndash Matrix decomposition ndash a good practice to learn matrixndash Graphical model ndash a good practice to learn probability

bull Graphical model is a good tool to analyze problems

bull The essence of decomposition is to discover a set of mid-level features to describe original documentsimages

bull It is more adaptable for various applications than matrix decomposition

  • An Introduction To Matrix Decomposition and Graphical Model
  • Outline
  • What Is Matrix Decomposition
  • Why We Need Matrix Decomposition
  • Why We Need Matrix Decomposition (2)
  • Principle Component Analysis
  • Definition ndash Eigenvalue amp Eigenvector
  • Definition ndash Principle Component Analysis
  • Principle Component Analysis (2)
  • Maximizing Variance
  • Optimization Problem
  • Property Data Decomposition
  • Face Recognition ndash Eigenface
  • Slide 14
  • Slide 15
  • PageRank ndash Power Iteration
  • Column-Stochastic amp Irreducible
  • Iterative PageRank Calculation
  • Convergence of the power iteration
  • Singular value decomposition
  • SVD - Definition
  • Singular Values And Singular Vectors
  • Matrix approximation
  • SVD and PCA
  • Example - LSI
  • SVD and PCA (2)
  • Latent Semantic Indexing (LSI)
  • Latent Semantic Indexing
  • Similarity Measures
  • Similarity Measures (2)
  • HITS (Hyperlink Induced Topic Search)
  • Power Iteration
  • HITS and SVD
  • HITS vs PageRank
  • NMF ndash Non-Negative Matrix Factorization
  • Definition
  • Motivation
  • Multiplicative Update Algorithm
  • Multiplicative Update Algorithm (2)
  • NMF vs PCA
  • Reference
  • Major Reference
  • Outline (2)
  • Not Included
  • Basic Concepts
  • What Is Machine Learning
  • Likelihood Function
  • Maximum Likelihood (ML)
  • IID ndash Independent Identically Distributed
  • Reference (2)
  • Expectation Maximization
  • Why We Need EM
  • More General
  • The Expectation Maximization (EM) Algorithm
  • Jensenrsquos Inequality
  • Lower Bounding the Log Likelihood
  • The E and M Steps of EM
  • The E Step
  • The M Step
  • EM Never Decreases the Likelihood
  • Reference (3)
  • Why Do We Need Graphical Model
  • Why Do We Need Graphical Models
  • Directed Acyclic Graphical Models (Bayesian Networks)
  • Directed Graphs for Statistical Models Plate Notation
  • pLSA ndash Probabilistic Latent Semantic Analysis
  • Latent Semantic Indexing (LSI) Review
  • pLSA ndash Probabilistic Latent Semantic Analysis (2)
  • pLSA
  • pLSA (2)
  • Joint Probability vs Likelihood
  • Document Decomposition
  • pLSA ndash Objective Function
  • EM Steps
  • Lower Bounding the Log Likelihood (2)
  • EM Steps (2)
  • Latent Subspace
  • pLSA vs LSA
  • pLSA vs LSA (2)
  • Applications
  • Text Mining
  • Scene Classification
  • Classification Result
  • Reference (4)
  • LDA ndash Latent Dirichilet Allocation
  • Problems in pLSA
  • Problems in pLSA (2)
  • Dirichlet Distribution
  • Dirichlet Distribution (2)
  • Example Dirichlet Distributions (K=3)
  • Example Dirichlet Distributions (K=3) (2)
  • The LDA Model
  • The LDA Model (2)
  • Joint Probability
  • Likelihood
  • Inference
  • Inference (2)
  • Variational Inference
  • Variantional Inference
  • VBEM vs EM
  • Parameter Estimation
  • Parameter Estimation (2)
  • Topic Examples in a 100-topic LDA Model)
  • Classification (50-topic LDA + SVM)
  • Problems in LDA
  • A Bayesian Hierarchical Model for Learning Natural Scene Catego
  • Codebook
  • Topic Distribution in Different Categories
  • Topic Hierarchical Clustering
  • More Topic Models
  • Are you really into Graphical Models
  • Reference (5)
  • Outline (3)
  • Slide 115
  • The Long Query Problem
  • Key Idea Dimension Reduction + Residual Error Preservation
  • Orthogonal Decomposition
  • A Probabilistic Implementation
  • Search (Online)
  • Search Example
  • Search Example (2)
  • Simultaneously Modeling Semantics and Structure of Threaded Dis
  • Semantic amp structure
  • Optimize them together
  • Reply reconstruction
  • Baselines
  • Evaluation
  • Expert finding
  • Baselines (2)
  • Evaluation (2)
  • Summary
Page 18: An Introduction To Matrix Decomposition and Graphical Model

SINGULAR VALUE DECOMPOSITION

SVD - Definitionbull Any m x n matrix A with m ge n can be factorized

bull

Singular Values And Singular Vectors

bull The diagonal elements σj of are the singular values of the matrix A

bull The columns of U and V are the left singular vectors and right singular vectors respectively

bull Equivalent form of SVD

Matrix approximation

bull Theorem Let Uk = (u1 u2 hellip uk) Vk = (v1 v2 hellip vk) and Σk = diag(σ1 σ2 hellip σk) and define

bull Then

bull It means that the best approximation of rank k for the matrix A is

SVD and PCAbull We can writebull Remember that in PCA we treat A as a row matrixbull V is just eigenvectors for A

ndash Each column in V is an eigenvector of row matrix Andash we use V to approximate a row in A

bull Equivalently we can writebull U is just eigenvectors for AT

ndash Each column in U is an eigenvector of column matrix Andash We use U to approximate a column in A

Example - LSI bull Build a term-by-document matrix A

bull Compute the SVD of A A = UΣVT

bull Approximate A by

ndash Uk Orthogonal basis that we use to approximate all the documents

ndash Dk Column j hold the coordinates of document j in the new basisndash Dk is the projection of A onto the subspace spanned by Uk

SVD and PCAbull For symmetric A SVD is closely related to PCA

bull PCA A = UΛUT

ndash U and Λ are eigenvectors and eigenvaluesbull SVD A = UΛVT

ndash U is left(column) eigenvectorsndash V is right(row) eigenvectorsndash Λ is the same eigenvalues

bull For symmetric A column eigenvectors equal to row eigenvectors

bull Note the difference of A in PCA and SVDndash SVD A is directly the data eg term-by-document matrixndash PCA A is covariance matrix A=XTX each row in X is a sample

Latent Semantic Indexing (LSI)1 Document file preparation preprocessing

ndash Indexing collecting termsndash Use stop list eliminate rdquomeaninglessrdquo wordsndash Stemming

2 Construction term-by-document matrix sparse matrix storage

3 Query matching distance measures4 Data compression by low rank approximation SVD5 Ranking and relevance feedback

Latent Semantic Indexingbull Assumption there is some underlying latent semantic

structure in the data

bull Eg car and automobile occur in similar documents as do cows and sheep

bull This structure can be enhanced by projecting the data (the term-by-document matrix and the queries) onto a lower dimensional space using SVD

Similarity Measures

bull Term to term AAT = UΣ2UT = (UΣ)(UΣ)T

UΣ are the coordinates of A (rows) projected to space V

bull Document to documentATA = VΣ2VT = (VΣ)(VΣ)T

VΣ are the coordinates of A (columns) projected to space U

Similarity Measures

bull Term to document A = UΣVT = (UΣfrac12)(VΣfrac12)T

UΣfrac12 are the coordinates of A (rows) projected to space VVΣfrac12 are the coordinates of A (columns) projected to space U

HITS (Hyperlink Induced Topic Search)

bull Idea Web includes two flavors of prominent pagesndash authorities contain high-quality informationndash hubs are comprehensive lists of links to authorities

bull A page is a good authority if many hubs point to itbull A page is a good hub if it points to many authoritiesbull Good authorities are pointed to by good hubs and good hubs

point to good authorities

Hubs Authorities

Power Iteration

bull Each page i has both a hub score hi and an authority score aibull HITS successively refines these scores by computing

bull Define the adjacency matrix L of the directed web graph

bull Now

HITS and SVD

bull L rows are outlinks columns are inlinks

bull a will be the dominant eigenvector of the authority matrix LTL bull h will be the dominant eigenvector of the hub matrix LLT

bull They are in fact the first left and right singular vectors of Lbull We are in fact running SVD on the adjacency matrix

HITS vs PageRankbull PageRank may be computed once HITS is computed per

query

bull HITS takes query into account PageRank doesnrsquot

bull PageRank has no concept of hubs

bull HITS is sensitive to local topology insertion or deletion of a small number of nodes may change the scores a lot

bull PageRank more stable because of its random jump step

NMF ndash NON-NEGATIVE MATRIX FACTORIZATION

Definition

bull Given a nonnegative matrix Vntimesm find non-negative matrix factors Wntimesk and Hktimesm such that

VntimesmasympWntimeskHktimesm

bull V column matrix each column is a data sample (n-dimension)bull Wi k-basis represents one basebull H coordinates of V projected to W

vj asymp Wntimeskhj

Motivationbull Non-negativity is natural in many applications

bull Probability is also non-negative

bull Additive model to capture local structure

Multiplicative Update Algorithmbull Cost function Euclidean distance

bull Multiplicative Update

Multiplicative Update Algorithmbull Cost function Divergence

ndash Reduce to Kullback-Leibler divergence whenndash A and B can be regarded as normalized probability distributions

bull Multiplicative update

bull PLSA is NMF with KL divergence

NMF vs PCA

bull n = 2429 faces m = 19x19 pixelsbull Positive values are illustrated with black pixels and negative values

with red pixels

bull NMF Parts-based representationbull PCA Holistic representations

Referencebull D D Lee and H S Seung Algorithms for non-negative matrix

factorization (pdf) NIPS 2001bull D D Lee and H S Seung Learning the parts of objects by non-negative

matrix factorization (pdf) Nature 401 788-791 (1999)

Major Referencebull Saara Hyvoumlnen Linear Algebra Methods for Data Mining Spring 2007

University of Helsinki (Highly recommend)

Outlinebull Basic concepts

ndash Likelihood iidndash ML MAP and Bayesian Inferencendash Expectation-Maximizationndash Mixture Gaussian Parameter estimation

bull pLSAndash Motivationndash Derivation amp Geometry propertiesndash Applications

bull LDAndash Motivation - Why to add a hyper parameterndash Dirichlet Distributionndash Variational EMndash Relations with other topic modalsndash Incorporating category information

bull Summary

Not Included

bull General graphical model theoriesbull Markov random field (belief propagation)bull Detailed derivation of LDA

BASIC CONCEPTS

What Is Machine Learning

Databull Let x = (x1 x2 xD) T denote a data point and D = x(1) x(2) x(N) a

data set D is sometimes associated with desired outputs y1 y2

Predictionsbull We are generally interested in predicting something based on the observed

data setbull Given D what can we say about x(N+1)

Modelbull To make predictions we need to make some assumptions We can often

express these assumptions in the form of a model with some parameters θbull Given data D we learn the model parameters from which we can predict

new data pointsbull The model can often be expressed as a probability distribution over data

points

Likelihood Function

bull Given a set of parameter values probability density function (PDF) will show that some data are more probable than other data

bull Inversely given the observed data and a model of interest Likelihood function is defined as

L(θ) = fθ(x|θ) = p(x|θ)

bull That is likelihood function L(θ) will show that some parameters are more likely to have produced the data

Maximum Likelihood (ML)

bull Maximum likelihood will find the best model parameters that make the data ldquomost likelyrdquo generated from this model

bull Suppose we are given n data samples (x1 x2 hellip xn)

bull Maximum likelihood will find θ that maximize L(θ)

bull Predictive distribution

IID ndash Independent Identically Distributed

bull IID means

bull The problem is considerably simplified as

bull Usually log likehood is used

Reference

bull Zoubin Ghahramani Machine Learning (4F13) 2006 Cambridge(Introduction to Machine Learning Lectures 1-2 Slides)

bull Gregor Heinrich Parameter estimation for text analysis Technical Note 2005-2008

EXPECTATION MAXIMIZATION

Why We Need EM

bull The ExpectationndashMaximization (EM) algorithm is a method for ML learning of parameters in latent variable models

bull Why we need latent variables

bull To describe complex model Gaussian Mixture Model

bull To discover the intrinsic structure inside a data set Topic Models such as pLSA LDA

More General

bull Data set bull Likelihood

bull Goal learn maximum likelihood (ML) parameter values

bull The maximum likelihood procedure finds parameters θ such that

bull Because of the integral (or sum) over latent variables the likelihood can be a very complicated and hard to optimize function

The Expectation Maximization (EM) Algorithm

bull The EM algorithm finds a (local) maximum of a latent variable model likelihood It starts from arbitrary values of the parameters and iterates two steps

bull E step Fill in values of latent variables according to posterior given data

bull M step Maximize likelihood as if latent variables were not hidden

bull Decomposes difficult problems into series of tractable steps

Jensenrsquos Inequality

Lower Bounding the Log Likelihoodbull Observed data D = yn Latent variables X = xn Parameters θbull Goal Maximize the log likelihood (ie ML learning) wrt θ

bull Any distribution q(X) over the hidden variables can be used to obtain a lower bound on the log likelihood using Jensenrsquos inequality

bull where H[q] is the entropy of q(X)

The E and M Steps of EM

bull The lower bound on the log likelihood is given by

bull EM alternates betweenbull E step optimize wrt distribution over hidden variables

holding params fixed

bull M step maximize wrt parameters holding hidden distribution fixed

The E Step

bull E step for fixed θ

bull The second term is the Kullback-Leibler divergencebull This means that for fixed θ F is bounded above by L and achieves

that bound whenbull So the E step simply sets

The M Step

bull M step maximize wrt parameters holding hidden distribution q fixed

bull The second equality comes from fact that entropy of q(X) does not depend directly on θ

bull The specific form of the M step depends on the model Often the maximum wrt θ can be found analytically

EM Never Decreases the Likelihood

bull The E and M steps together never decrease the log likelihood

bull The E step brings F(q θ) to the likelihood L(θ) 顶bull The M-step maximizes F(q θ) wrt θ 抬bull F(q θ) lt L(θ) by Jensen ndash or equivalently from the non-

negativity of KL

Reference

bull Zoubin Ghahramani Machine Learning (4F13) 2006 Cambridge(Unsupervised learning Lecture 5 Slides)

bull Christopher M Bishop (2006) Pattern Recognition and Machine Learning Springer

WHY DO WE NEED GRAPHICAL MODEL

Why Do We Need Graphical Models

bull Consndash Graphical model is so complex even with a few circleshellipndash We have to make too many assumptions

bull Prosndash We do need probability to explain our world But joint probability is

hard to computendash Graphical model can help us analyze and understand our problemsndash Graphs are an intuitive way of representing and visualizing the

relationships between many variablesndash With a graphical model we can decouple joint probability to

conditional probabilities which are usually easier

Directed Acyclic Graphical Models (Bayesian Networks)

bull A DAG Model Bayesian network corresponds to a factorization of the joint probability distribution

p(ABCDE) = p(A)p(B)p(C|AB)p(D|BC)p(E|CD)

bull In general

bull where pa(i) are the parents of node i

Directed Graphs for Statistical ModelsPlate Notation

bull A data set of N points generated from a Gaussian

PLSA ndash PROBABILISTIC LATENT SEMANTIC ANALYSIS

Latent Semantic Indexing (LSI) Review

bull For natural Language Queries simple term matching does not work effectivelyndash Ambiguous terms ndash Same Queries vary due to personal styles

bull Latent semantic indexingndash Creates this lsquolatent semantic spacersquo (hidden meaning)

bull LSI puts documents together even if they donrsquot have common words if the docs share frequently co-occurring terms

bull Disadvantagesndash Statistical foundation is missing

pLSA ndash Probabilistic Latent Semantic Analysis

bull Automated Document Indexing and Information retrievalbull Identification of Latent Classes using an Expectation

Maximization (EM) Algorithmbull Shown to solve

ndash Polysemybull Java could mean ldquocoffeerdquo and also the ldquoPL Javardquobull Cricket is a ldquogamerdquo and also an ldquoinsectrdquo

ndash Synonymybull ldquocomputerrdquo ldquopcrdquo ldquodesktoprdquo all could mean the same

bull Has a better statistical foundation than LSA

pLSA

M

Nd

d

z

w

M

d

z1

w1

z2

w2

z3

w3

zN

wN

hellip

z1 hellip zN are variables ziє[1K]K is the number of latent topics

pLSA

d1

z1

w1

z2

w2

zN1

wN1

hellip

d2

z1

w1

z2

w2

zN2

wN2

hellip

dM

z1

w1

z2

w2

zNm

wNm

hellip

p(w|z=1) p(w|z=2) p(w | z=Nm) are shared for all documents

Likelihood

Joint Probability vs Likelihood

bull Joint probability

bull Likelihood (only for observed variables)

bull p(d) is assumed to be uniform

Document Decomposition

bull Each document can be decomposed as

bull This is similar to the matrix decomposition if we consider each discrete distribution as a vector

p(w|d) = ZVtimesk p(z|d)

bull With many documents we hope to find latent topics as common basis

pLSA ndash Objective Function

bull pLSA tries to maximize the log likelihood

bull Due to the summation over z inside log we have to resort to EM

EM Steps

bull E-Stepndash Expectation of the likelihood function is calculated with the current

parameter valuesbull M-Step

ndash Update the parameters with the calculated posterior probabilitiesndash Find the parameters that maximizes the likelihood function

Lower Bounding the Log Likelihood

EM Steps

bull The E-Step

bull The M-Step

Latent Subspace

pLSA vs LSAbull LSA and PLSA perform dimensionality reduction

ndash In LSA by keeping only K singular valuesndash In PLSA by having K aspects

bull Comparison to SVDndash U Matrix related to P(z|d) (doc to aspect)ndash V Matrix related to P(w|z) (aspect to term)ndash E Matrix related to P(z) (aspect strength)

pLSA vs LSAbull The main difference is the way the approximation is done

bull PLSA generates a model (aspect model) and maximizes its predictive power

bull Selecting the proper value of K is heuristic in LSA

bull Model selection in statistics can determine optimal K in PLSA

Applications

bull Text mining topic discovering

bull Scene Classification

Text Mining

Scene Classification

Classification Result

Reference

bull Thomas Hofmann Probabilistic latent semantic analysis In Proc of Uncertainty in Artificial Intelligence UAI99 Stockholm 1999

bull Bosch A Zisserman A and Munoz X Scene Classification via pLSAProceedings of the European Conference on Computer Vision (2006)

bull Sivic J Russell B C Efros A A Zisserman A and Freeman W T Discovering Object Categories in Image Collections MIT AI Lab Memo AIM-2005-005 February 2005

LDA ndash LATENT DIRICHILET ALLOCATION

Problems in pLSA

bull pLSA provides no probabilistic model at the document level Each doc has its own topic mixture proportion

bull The number of parameters in the model grows linearly with M (the number of documents in the training set)

Problems in pLSA

bull There is no constraint for distributions p(z|di)

bull Easy to lead to serious problems with over-fitting

d1

z1

w1

z2

w2

zN1

wN1

hellip

d2

z1

w1

z2

w2

zN2

wN2

hellip

dm

z1

w1

z2

w2

zNm

wNm

hellip

p(z|d1) p(z|d2) p(z|dm)

Dirichlet Distribution

bull In the LDA model the topic mixture proportions for each document are assumed to follow some distribution

bull Requirement for such a distributionndash The samples (mixture proportions) generated from it are K-tuples of non-

negative numbers that sum to one That is the samples are multinormialsndash Easy to optimize

bull Dirichlet Distribution is one of such distributions

bull The space of all of these multinomials has a nice geometric interpretation as a (k-1)-simplex

Dirichlet Distribution

bull Definition

bull The density is zero outside this open (K minus 1)-dimensional simplex

k

i ii

ki ik

i i

ki i

kk

xx

xxxxp i

1

11

1

12121

1 0 st

)Γ()(Γ)|(

bull Various parameter α

(6 2 2) (3 7 5)

(2 3 4) (6 2 6)

Example Dirichlet Distributions (K=3)

Example Dirichlet Distributions (K=3)

bull Equal αi different

α0=01 α0=1 α0=10

k

i i10

The LDA Model

z4z3z2z1

w4w3w2w1

b

z4z3z2z1

w4w3w2w1

z4z3z2z1

w4w3w2w1

bull For each documentbull Choose ~Dirichlet()bull For each of the N words wn

ndash Choose a topic znraquo Multinomial()

ndash Choose a word wn from p(wn|znb) a multinomial probability conditioned on the topic zn

The LDA Model

bull For each documentbull Choose ~Dirichlet()bull For each of the N words wn

ndash Choose a topic znraquo Multinomial()

ndash Choose a word wn from p(wn|znb) a multinomial probability conditioned on the topic zn

Joint Probability

bull Given parameter α and β

where

Likelihood

bull Joint Probability

bull Marginal distribution of a document

bull Likelihood over all the documents

Inference

bull The likelihood can be computed by summing each documentbull Jansenrsquos inequality in EM

Inference

bull In E-Step we need to compute the posterior distribution of the hidden variables

bull Unfortunately this distribution is intractable to compute in general

bull We have to resort to variational approach

Variational Inference

bull In variational inference we consider a simplified graphical model with variational parameters and minimize the KL Divergence between the variational and posterior distributions

Variantional Inference

bull The difference between the lower bound and the likelihood is the KL divergence

bull Maximizing the lower bound L(b) with respect to and is equivalent to minimizing the KL divergence

VBEM vs EM

bull Only different in the E-Step

bull In standard EM q(X) is directly set to p(X|Dθ) and let KL=0bull In VBEM it is intractable to compute p(X|Dθ) Instead it

approximates p(X|Dθ) by a variational distribution q(X) by minimizing KL(q(X) | P(X|D θ)

bull This is also equivalent to maximizing the lower bound L(θ)

Parameter Estimation

bull Given a corpus of documents we would like to find the parameters and b which maximize the likelihood of the observed data

bull Strategy (Variational EM)

bull Lower bound log p(w|b) by a function L(b)bull Repeat until convergence

ndash E Maximize L(b) with respect to the variational parameters ndash M Maximize the bound with respect to parameters and b

Parameter Estimation

bull E-Step Variational Inference ndash repeat until convergence

bull M-Step Parameter estimation

β

α can be implemented using the Newton-Raphson method

Topic Examples in a 100-topic LDA Model)bull 16000 documents from a subset of the TREC AP corpus

Classification (50-topic LDA + SVM)

bull Reuters-21578 dataset ndash contains 8000 documents and 15818 words

(a) EARN vs NOT EARN (b) GRAIN vs NOT GRAIN

Problems in LDA

bull Dirichlet Distribution is helpful to avoid over-fitting But the assumption might be too strong

z4z3z2z1

w4w3w2w1

b

z4z3z2z1

w4w3w2w1

z4z3z2z1

w4w3w2w1

A Bayesian Hierarchical Model for Learning Natural Scene Categories

bull Incorporating category information

MNd

π

z

x

θ

β

Codebookbull 174 Local Image Patches

bull DetectionEvenly Sampled GridRandom SamplingSaliency DetectorLowersquos DoG Detector

bull RepresentationNormalized 11x11 gray values128-dim SIFT

Topic Distribution in Different Categories

Topic Hierarchical Clustering

More Topic Models

bull Dynamic topic models ICML 2006bull Correlated Topic Model NIPS 2005bull Hierarchical Dirichlet Process Journal of the American

Statistical Association 2003bull Nonparametric Bayes pachinko allocation UAI 2007bull Supervised LDA NIPS 2007bull MedLDA ndash Maximum Margin Discrimant LDA ICML 2009bull hellip

Are you really into Graphical Models

bull Describing Visual Scenes using Transformed Dirichlet Processes E Sudderth A Torralba W Freeman and A Willsky NIPS Dec 2005

Reference

bull David M Blei Andrew Y Ng Michael I Jordan Latent Dirichlet Allocation Journal of Machine Learning Research (JMLR) 2003

bull Matthew J Beal Variational Algorithms for Approximate Bayesian Inference PhD Thesis University of Cambridge 1998

bull L Fei-Fei and P Perona A Bayesian Hierarchical Model for Learning Natural Scene Categories CVPR 2005

Outline

bull Matrix Decompositionndash PCA SVD NMFndash LDA ICA Sparse Coding etc

bull Graphical Modelndash Basic concepts in probabilistic machine learningndash EMndash pLSAndash LDA

bull Two Applicationsndash Document decomposition for ldquolong queryrdquo retrieval ICCV 2009ndash Modeling Threaded Discussions SIGIR 2009

Large-Scale Indexing for ldquoLong Queryrdquo Retrieval (Similarity Search)

Xiao Zhang Zhiwei Li Lei Zhang Wei-Ying Ma and Heung-Yeung Shum

ICCV 2009

The Long Query Problem

bull If a query contains 1000 keywordsndash Need to access 1000 inverted listsndash The intersection of 1000 inverted lists may be emptyndash The union of 1000 inverted list may be the whole corpus

bull Dimension reductionTerm1 Term2 Term3 Term4 hellip TermN

Img1 1 2 0 0 hellip 2

f1 f2 hellip fM

Img1 02 01 hellip 003

Topic Projection

Dim = 1 million

Dim = 200

Key Idea Dimension Reduction + Residual Error Preservation

bull p original TF-IDF vector in vocabulary spacebull X projection matrix for dimension reductionbull w low dimensional feature vectorbull residual error

p Xw

An image = 0 01 02 03 04 05 06 07 08 09 1

0

2

4

6

8

10

12

14

16

+ a few words(10 words)

Orthogonal Decomposition

p Xw 1 11 1 1

2 21 2 1 2

1 1

1

k

k

W k W

W W Wk W

p x xp x x w

p wp x x

Base vector

Low dimensional representation

Residual

An image = 0 01 02 03 04 05 06 07 08 09 1

0

2

4

6

8

10

12

14

16

+ a few words(10 words)

T Tp q p q

p q

w w

X1 X2 X3 hellip Xk

A Probabilistic Implementation

x is a switch variable It controls a word generated from

bull a topic specific distribution

bull a document specific distribution

bull a background distribution

( | )p w d 1( 0 | ) ( | ) ( | )K

kp x d p w z k p z k d

( 1| ) ( | )p x d p w d ( 2 | ) ( | )p x d p w

C Chemudugunta etal Modeling General and Specific Aspects of Documents with a Probabilistic Topic Model NIPS 2006

Search (Online)

DS1 DS2 hellip

DS1 DS2 hellip

DS1 DS2 hellip

DS1 DS2 hellip

DS1 DS2 hellip

hellip

hellip

hellip

hellip

hellip

LSH Index

Doc 300 Doc 401 hellip

A query = 0 01 02 03 04 05 06 07 08 09 1

0

2

4

6

8

10

12

14

16

+ a few words

Re-rankingDoc 401 hellip

Doc 1

Doc 2

Doc 300

Doc 401

Doc N

Doc 300

Index 10M Images 46GBSearch Speed lt 100ms

Doc Meta

Search Example

Query Image

Search Example

Query Image

Simultaneously Modeling Semantics and Structure of Threaded Discussions A Sparse

Coding Approach and Its Applications

Chen LIN Jiang-Ming YANG Rui CAI Xin-jing WANG Wei WANG Lei ZHANG

SIGIR 2009

123

Semantic amp structure

124

SemanticTopics

StructureWho reply to who

Optimize them together

Model semantic

Model structure

125

Reply reconstruction

126

DocumentSimilarity

TopicSimilarity

StructureSimilarity

BaselinesNP

Reply to Nearest PostRR

Reply to RootDS

Document SimilarityLDA

Latent Dirichlet Allocation Project documents to topic space

SWB Special Words Topic Model with Background distribution Project documents to topic and junk topic space

127

Evaluation

method Slashdot Apple

All Posts Good Posts All Posts Good Posts

NP 0021 0012 0289 0239

RR 0183 0319 0269 0474

DS 0463 0643 0409 0628

LDA 0465 0644 0410 0648

SWB 0463 0644 0410 0641

SMSS 0524 0737 0517 0772

128

Expert finding

Reply reconstruction

Network construction

Expert finding

Methods

HITS

PageRank

hellip

129

Baselines

LMFormal Models for Expert Finding in Enterprise Corpora SIGIR

06Achieves stable performance in expert finding task using a

language modelPageRank

Benchmark nodal ranking methodHITS

Find hub nodes and authority nodeEABIF

Personalized Recommendation Driven by Information Flow SIGIR rsquo06

Find most influential node130

Evaluation

131

bull Bayesian estimate

Method MRR MAP P10

LM 0821 0698 0800

EABIF(ori) 0674 0362 0243

EABIF(rec) 0742 0318 0281

PageRank(ori) 0675 0377 0263

PageRank(rec) 0743 0321 0266

HITS(ori) 0906 0832 0900

HITS(rec) 0938 0822 0906

Summary

bull Matrix and probability are fundamental mathematics in information retrieval and computer visionndash Matrix decomposition ndash a good practice to learn matrixndash Graphical model ndash a good practice to learn probability

bull Graphical model is a good tool to analyze problems

bull The essence of decomposition is to discover a set of mid-level features to describe original documentsimages

bull It is more adaptable for various applications than matrix decomposition

  • An Introduction To Matrix Decomposition and Graphical Model
  • Outline
  • What Is Matrix Decomposition
  • Why We Need Matrix Decomposition
  • Why We Need Matrix Decomposition (2)
  • Principle Component Analysis
  • Definition ndash Eigenvalue amp Eigenvector
  • Definition ndash Principle Component Analysis
  • Principle Component Analysis (2)
  • Maximizing Variance
  • Optimization Problem
  • Property Data Decomposition
  • Face Recognition ndash Eigenface
  • Slide 14
  • Slide 15
  • PageRank ndash Power Iteration
  • Column-Stochastic amp Irreducible
  • Iterative PageRank Calculation
  • Convergence of the power iteration
  • Singular value decomposition
  • SVD - Definition
  • Singular Values And Singular Vectors
  • Matrix approximation
  • SVD and PCA
  • Example - LSI
  • SVD and PCA (2)
  • Latent Semantic Indexing (LSI)
  • Latent Semantic Indexing
  • Similarity Measures
  • Similarity Measures (2)
  • HITS (Hyperlink Induced Topic Search)
  • Power Iteration
  • HITS and SVD
  • HITS vs PageRank
  • NMF ndash Non-Negative Matrix Factorization
  • Definition
  • Motivation
  • Multiplicative Update Algorithm
  • Multiplicative Update Algorithm (2)
  • NMF vs PCA
  • Reference
  • Major Reference
  • Outline (2)
  • Not Included
  • Basic Concepts
  • What Is Machine Learning
  • Likelihood Function
  • Maximum Likelihood (ML)
  • IID ndash Independent Identically Distributed
  • Reference (2)
  • Expectation Maximization
  • Why We Need EM
  • More General
  • The Expectation Maximization (EM) Algorithm
  • Jensenrsquos Inequality
  • Lower Bounding the Log Likelihood
  • The E and M Steps of EM
  • The E Step
  • The M Step
  • EM Never Decreases the Likelihood
  • Reference (3)
  • Why Do We Need Graphical Model
  • Why Do We Need Graphical Models
  • Directed Acyclic Graphical Models (Bayesian Networks)
  • Directed Graphs for Statistical Models Plate Notation
  • pLSA ndash Probabilistic Latent Semantic Analysis
  • Latent Semantic Indexing (LSI) Review
  • pLSA ndash Probabilistic Latent Semantic Analysis (2)
  • pLSA
  • pLSA (2)
  • Joint Probability vs Likelihood
  • Document Decomposition
  • pLSA ndash Objective Function
  • EM Steps
  • Lower Bounding the Log Likelihood (2)
  • EM Steps (2)
  • Latent Subspace
  • pLSA vs LSA
  • pLSA vs LSA (2)
  • Applications
  • Text Mining
  • Scene Classification
  • Classification Result
  • Reference (4)
  • LDA ndash Latent Dirichilet Allocation
  • Problems in pLSA
  • Problems in pLSA (2)
  • Dirichlet Distribution
  • Dirichlet Distribution (2)
  • Example Dirichlet Distributions (K=3)
  • Example Dirichlet Distributions (K=3) (2)
  • The LDA Model
  • The LDA Model (2)
  • Joint Probability
  • Likelihood
  • Inference
  • Inference (2)
  • Variational Inference
  • Variantional Inference
  • VBEM vs EM
  • Parameter Estimation
  • Parameter Estimation (2)
  • Topic Examples in a 100-topic LDA Model)
  • Classification (50-topic LDA + SVM)
  • Problems in LDA
  • A Bayesian Hierarchical Model for Learning Natural Scene Catego
  • Codebook
  • Topic Distribution in Different Categories
  • Topic Hierarchical Clustering
  • More Topic Models
  • Are you really into Graphical Models
  • Reference (5)
  • Outline (3)
  • Slide 115
  • The Long Query Problem
  • Key Idea Dimension Reduction + Residual Error Preservation
  • Orthogonal Decomposition
  • A Probabilistic Implementation
  • Search (Online)
  • Search Example
  • Search Example (2)
  • Simultaneously Modeling Semantics and Structure of Threaded Dis
  • Semantic amp structure
  • Optimize them together
  • Reply reconstruction
  • Baselines
  • Evaluation
  • Expert finding
  • Baselines (2)
  • Evaluation (2)
  • Summary
Page 19: An Introduction To Matrix Decomposition and Graphical Model

SVD - Definitionbull Any m x n matrix A with m ge n can be factorized

bull

Singular Values And Singular Vectors

bull The diagonal elements σj of are the singular values of the matrix A

bull The columns of U and V are the left singular vectors and right singular vectors respectively

bull Equivalent form of SVD

Matrix approximation

bull Theorem Let Uk = (u1 u2 hellip uk) Vk = (v1 v2 hellip vk) and Σk = diag(σ1 σ2 hellip σk) and define

bull Then

bull It means that the best approximation of rank k for the matrix A is

SVD and PCAbull We can writebull Remember that in PCA we treat A as a row matrixbull V is just eigenvectors for A

ndash Each column in V is an eigenvector of row matrix Andash we use V to approximate a row in A

bull Equivalently we can writebull U is just eigenvectors for AT

ndash Each column in U is an eigenvector of column matrix Andash We use U to approximate a column in A

Example - LSI bull Build a term-by-document matrix A

bull Compute the SVD of A A = UΣVT

bull Approximate A by

ndash Uk Orthogonal basis that we use to approximate all the documents

ndash Dk Column j hold the coordinates of document j in the new basisndash Dk is the projection of A onto the subspace spanned by Uk

SVD and PCAbull For symmetric A SVD is closely related to PCA

bull PCA A = UΛUT

ndash U and Λ are eigenvectors and eigenvaluesbull SVD A = UΛVT

ndash U is left(column) eigenvectorsndash V is right(row) eigenvectorsndash Λ is the same eigenvalues

bull For symmetric A column eigenvectors equal to row eigenvectors

bull Note the difference of A in PCA and SVDndash SVD A is directly the data eg term-by-document matrixndash PCA A is covariance matrix A=XTX each row in X is a sample

Latent Semantic Indexing (LSI)1 Document file preparation preprocessing

ndash Indexing collecting termsndash Use stop list eliminate rdquomeaninglessrdquo wordsndash Stemming

2 Construction term-by-document matrix sparse matrix storage

3 Query matching distance measures4 Data compression by low rank approximation SVD5 Ranking and relevance feedback

Latent Semantic Indexingbull Assumption there is some underlying latent semantic

structure in the data

bull Eg car and automobile occur in similar documents as do cows and sheep

bull This structure can be enhanced by projecting the data (the term-by-document matrix and the queries) onto a lower dimensional space using SVD

Similarity Measures

bull Term to term AAT = UΣ2UT = (UΣ)(UΣ)T

UΣ are the coordinates of A (rows) projected to space V

bull Document to documentATA = VΣ2VT = (VΣ)(VΣ)T

VΣ are the coordinates of A (columns) projected to space U

Similarity Measures

bull Term to document A = UΣVT = (UΣfrac12)(VΣfrac12)T

UΣfrac12 are the coordinates of A (rows) projected to space VVΣfrac12 are the coordinates of A (columns) projected to space U

HITS (Hyperlink Induced Topic Search)

bull Idea Web includes two flavors of prominent pagesndash authorities contain high-quality informationndash hubs are comprehensive lists of links to authorities

bull A page is a good authority if many hubs point to itbull A page is a good hub if it points to many authoritiesbull Good authorities are pointed to by good hubs and good hubs

point to good authorities

Hubs Authorities

Power Iteration

bull Each page i has both a hub score hi and an authority score aibull HITS successively refines these scores by computing

bull Define the adjacency matrix L of the directed web graph

bull Now

HITS and SVD

bull L rows are outlinks columns are inlinks

bull a will be the dominant eigenvector of the authority matrix LTL bull h will be the dominant eigenvector of the hub matrix LLT

bull They are in fact the first left and right singular vectors of Lbull We are in fact running SVD on the adjacency matrix

HITS vs PageRankbull PageRank may be computed once HITS is computed per

query

bull HITS takes query into account PageRank doesnrsquot

bull PageRank has no concept of hubs

bull HITS is sensitive to local topology insertion or deletion of a small number of nodes may change the scores a lot

bull PageRank more stable because of its random jump step

NMF ndash NON-NEGATIVE MATRIX FACTORIZATION

Definition

bull Given a nonnegative matrix Vntimesm find non-negative matrix factors Wntimesk and Hktimesm such that

VntimesmasympWntimeskHktimesm

bull V column matrix each column is a data sample (n-dimension)bull Wi k-basis represents one basebull H coordinates of V projected to W

vj asymp Wntimeskhj

Motivationbull Non-negativity is natural in many applications

bull Probability is also non-negative

bull Additive model to capture local structure

Multiplicative Update Algorithmbull Cost function Euclidean distance

bull Multiplicative Update

Multiplicative Update Algorithmbull Cost function Divergence

ndash Reduce to Kullback-Leibler divergence whenndash A and B can be regarded as normalized probability distributions

bull Multiplicative update

bull PLSA is NMF with KL divergence

NMF vs PCA

bull n = 2429 faces m = 19x19 pixelsbull Positive values are illustrated with black pixels and negative values

with red pixels

bull NMF Parts-based representationbull PCA Holistic representations

Referencebull D D Lee and H S Seung Algorithms for non-negative matrix

factorization (pdf) NIPS 2001bull D D Lee and H S Seung Learning the parts of objects by non-negative

matrix factorization (pdf) Nature 401 788-791 (1999)

Major Referencebull Saara Hyvoumlnen Linear Algebra Methods for Data Mining Spring 2007

University of Helsinki (Highly recommend)

Outlinebull Basic concepts

ndash Likelihood iidndash ML MAP and Bayesian Inferencendash Expectation-Maximizationndash Mixture Gaussian Parameter estimation

bull pLSAndash Motivationndash Derivation amp Geometry propertiesndash Applications

bull LDAndash Motivation - Why to add a hyper parameterndash Dirichlet Distributionndash Variational EMndash Relations with other topic modalsndash Incorporating category information

bull Summary

Not Included

bull General graphical model theoriesbull Markov random field (belief propagation)bull Detailed derivation of LDA

BASIC CONCEPTS

What Is Machine Learning

Databull Let x = (x1 x2 xD) T denote a data point and D = x(1) x(2) x(N) a

data set D is sometimes associated with desired outputs y1 y2

Predictionsbull We are generally interested in predicting something based on the observed

data setbull Given D what can we say about x(N+1)

Modelbull To make predictions we need to make some assumptions We can often

express these assumptions in the form of a model with some parameters θbull Given data D we learn the model parameters from which we can predict

new data pointsbull The model can often be expressed as a probability distribution over data

points

Likelihood Function

bull Given a set of parameter values probability density function (PDF) will show that some data are more probable than other data

bull Inversely given the observed data and a model of interest Likelihood function is defined as

L(θ) = fθ(x|θ) = p(x|θ)

bull That is likelihood function L(θ) will show that some parameters are more likely to have produced the data

Maximum Likelihood (ML)

bull Maximum likelihood will find the best model parameters that make the data ldquomost likelyrdquo generated from this model

bull Suppose we are given n data samples (x1 x2 hellip xn)

bull Maximum likelihood will find θ that maximize L(θ)

bull Predictive distribution

IID ndash Independent Identically Distributed

bull IID means

bull The problem is considerably simplified as

bull Usually log likehood is used

Reference

bull Zoubin Ghahramani Machine Learning (4F13) 2006 Cambridge(Introduction to Machine Learning Lectures 1-2 Slides)

bull Gregor Heinrich Parameter estimation for text analysis Technical Note 2005-2008

EXPECTATION MAXIMIZATION

Why We Need EM

bull The ExpectationndashMaximization (EM) algorithm is a method for ML learning of parameters in latent variable models

bull Why we need latent variables

bull To describe complex model Gaussian Mixture Model

bull To discover the intrinsic structure inside a data set Topic Models such as pLSA LDA

More General

bull Data set bull Likelihood

bull Goal learn maximum likelihood (ML) parameter values

bull The maximum likelihood procedure finds parameters θ such that

bull Because of the integral (or sum) over latent variables the likelihood can be a very complicated and hard to optimize function

The Expectation Maximization (EM) Algorithm

bull The EM algorithm finds a (local) maximum of a latent variable model likelihood It starts from arbitrary values of the parameters and iterates two steps

bull E step Fill in values of latent variables according to posterior given data

bull M step Maximize likelihood as if latent variables were not hidden

bull Decomposes difficult problems into series of tractable steps

Jensenrsquos Inequality

Lower Bounding the Log Likelihoodbull Observed data D = yn Latent variables X = xn Parameters θbull Goal Maximize the log likelihood (ie ML learning) wrt θ

bull Any distribution q(X) over the hidden variables can be used to obtain a lower bound on the log likelihood using Jensenrsquos inequality

bull where H[q] is the entropy of q(X)

The E and M Steps of EM

bull The lower bound on the log likelihood is given by

bull EM alternates betweenbull E step optimize wrt distribution over hidden variables

holding params fixed

bull M step maximize wrt parameters holding hidden distribution fixed

The E Step

bull E step for fixed θ

bull The second term is the Kullback-Leibler divergencebull This means that for fixed θ F is bounded above by L and achieves

that bound whenbull So the E step simply sets

The M Step

bull M step maximize wrt parameters holding hidden distribution q fixed

bull The second equality comes from fact that entropy of q(X) does not depend directly on θ

bull The specific form of the M step depends on the model Often the maximum wrt θ can be found analytically

EM Never Decreases the Likelihood

bull The E and M steps together never decrease the log likelihood

bull The E step brings F(q θ) to the likelihood L(θ) 顶bull The M-step maximizes F(q θ) wrt θ 抬bull F(q θ) lt L(θ) by Jensen ndash or equivalently from the non-

negativity of KL

Reference

bull Zoubin Ghahramani Machine Learning (4F13) 2006 Cambridge(Unsupervised learning Lecture 5 Slides)

bull Christopher M Bishop (2006) Pattern Recognition and Machine Learning Springer

WHY DO WE NEED GRAPHICAL MODEL

Why Do We Need Graphical Models

bull Consndash Graphical model is so complex even with a few circleshellipndash We have to make too many assumptions

bull Prosndash We do need probability to explain our world But joint probability is

hard to computendash Graphical model can help us analyze and understand our problemsndash Graphs are an intuitive way of representing and visualizing the

relationships between many variablesndash With a graphical model we can decouple joint probability to

conditional probabilities which are usually easier

Directed Acyclic Graphical Models (Bayesian Networks)

bull A DAG Model Bayesian network corresponds to a factorization of the joint probability distribution

p(ABCDE) = p(A)p(B)p(C|AB)p(D|BC)p(E|CD)

bull In general

bull where pa(i) are the parents of node i

Directed Graphs for Statistical ModelsPlate Notation

bull A data set of N points generated from a Gaussian

PLSA ndash PROBABILISTIC LATENT SEMANTIC ANALYSIS

Latent Semantic Indexing (LSI) Review

bull For natural Language Queries simple term matching does not work effectivelyndash Ambiguous terms ndash Same Queries vary due to personal styles

bull Latent semantic indexingndash Creates this lsquolatent semantic spacersquo (hidden meaning)

bull LSI puts documents together even if they donrsquot have common words if the docs share frequently co-occurring terms

bull Disadvantagesndash Statistical foundation is missing

pLSA ndash Probabilistic Latent Semantic Analysis

bull Automated Document Indexing and Information retrievalbull Identification of Latent Classes using an Expectation

Maximization (EM) Algorithmbull Shown to solve

ndash Polysemybull Java could mean ldquocoffeerdquo and also the ldquoPL Javardquobull Cricket is a ldquogamerdquo and also an ldquoinsectrdquo

ndash Synonymybull ldquocomputerrdquo ldquopcrdquo ldquodesktoprdquo all could mean the same

bull Has a better statistical foundation than LSA

pLSA

M

Nd

d

z

w

M

d

z1

w1

z2

w2

z3

w3

zN

wN

hellip

z1 hellip zN are variables ziє[1K]K is the number of latent topics

pLSA

d1

z1

w1

z2

w2

zN1

wN1

hellip

d2

z1

w1

z2

w2

zN2

wN2

hellip

dM

z1

w1

z2

w2

zNm

wNm

hellip

p(w|z=1) p(w|z=2) p(w | z=Nm) are shared for all documents

Likelihood

Joint Probability vs Likelihood

bull Joint probability

bull Likelihood (only for observed variables)

bull p(d) is assumed to be uniform

Document Decomposition

bull Each document can be decomposed as

bull This is similar to the matrix decomposition if we consider each discrete distribution as a vector

p(w|d) = ZVtimesk p(z|d)

bull With many documents we hope to find latent topics as common basis

pLSA ndash Objective Function

bull pLSA tries to maximize the log likelihood

bull Due to the summation over z inside log we have to resort to EM

EM Steps

bull E-Stepndash Expectation of the likelihood function is calculated with the current

parameter valuesbull M-Step

ndash Update the parameters with the calculated posterior probabilitiesndash Find the parameters that maximizes the likelihood function

Lower Bounding the Log Likelihood

EM Steps

bull The E-Step

bull The M-Step

Latent Subspace

pLSA vs LSAbull LSA and PLSA perform dimensionality reduction

ndash In LSA by keeping only K singular valuesndash In PLSA by having K aspects

bull Comparison to SVDndash U Matrix related to P(z|d) (doc to aspect)ndash V Matrix related to P(w|z) (aspect to term)ndash E Matrix related to P(z) (aspect strength)

pLSA vs LSAbull The main difference is the way the approximation is done

bull PLSA generates a model (aspect model) and maximizes its predictive power

bull Selecting the proper value of K is heuristic in LSA

bull Model selection in statistics can determine optimal K in PLSA

Applications

bull Text mining topic discovering

bull Scene Classification

Text Mining

Scene Classification

Classification Result

Reference

bull Thomas Hofmann Probabilistic latent semantic analysis In Proc of Uncertainty in Artificial Intelligence UAI99 Stockholm 1999

bull Bosch A Zisserman A and Munoz X Scene Classification via pLSAProceedings of the European Conference on Computer Vision (2006)

bull Sivic J Russell B C Efros A A Zisserman A and Freeman W T Discovering Object Categories in Image Collections MIT AI Lab Memo AIM-2005-005 February 2005

LDA ndash LATENT DIRICHILET ALLOCATION

Problems in pLSA

bull pLSA provides no probabilistic model at the document level Each doc has its own topic mixture proportion

bull The number of parameters in the model grows linearly with M (the number of documents in the training set)

Problems in pLSA

bull There is no constraint for distributions p(z|di)

bull Easy to lead to serious problems with over-fitting

d1

z1

w1

z2

w2

zN1

wN1

hellip

d2

z1

w1

z2

w2

zN2

wN2

hellip

dm

z1

w1

z2

w2

zNm

wNm

hellip

p(z|d1) p(z|d2) p(z|dm)

Dirichlet Distribution

bull In the LDA model the topic mixture proportions for each document are assumed to follow some distribution

bull Requirement for such a distributionndash The samples (mixture proportions) generated from it are K-tuples of non-

negative numbers that sum to one That is the samples are multinormialsndash Easy to optimize

bull Dirichlet Distribution is one of such distributions

bull The space of all of these multinomials has a nice geometric interpretation as a (k-1)-simplex

Dirichlet Distribution

bull Definition

bull The density is zero outside this open (K minus 1)-dimensional simplex

k

i ii

ki ik

i i

ki i

kk

xx

xxxxp i

1

11

1

12121

1 0 st

)Γ()(Γ)|(

bull Various parameter α

(6 2 2) (3 7 5)

(2 3 4) (6 2 6)

Example Dirichlet Distributions (K=3)

Example Dirichlet Distributions (K=3)

bull Equal αi different

α0=01 α0=1 α0=10

k

i i10

The LDA Model

z4z3z2z1

w4w3w2w1

b

z4z3z2z1

w4w3w2w1

z4z3z2z1

w4w3w2w1

bull For each documentbull Choose ~Dirichlet()bull For each of the N words wn

ndash Choose a topic znraquo Multinomial()

ndash Choose a word wn from p(wn|znb) a multinomial probability conditioned on the topic zn

The LDA Model

bull For each documentbull Choose ~Dirichlet()bull For each of the N words wn

ndash Choose a topic znraquo Multinomial()

ndash Choose a word wn from p(wn|znb) a multinomial probability conditioned on the topic zn

Joint Probability

bull Given parameter α and β

where

Likelihood

bull Joint Probability

bull Marginal distribution of a document

bull Likelihood over all the documents

Inference

bull The likelihood can be computed by summing each documentbull Jansenrsquos inequality in EM

Inference

bull In E-Step we need to compute the posterior distribution of the hidden variables

bull Unfortunately this distribution is intractable to compute in general

bull We have to resort to variational approach

Variational Inference

bull In variational inference we consider a simplified graphical model with variational parameters and minimize the KL Divergence between the variational and posterior distributions

Variantional Inference

bull The difference between the lower bound and the likelihood is the KL divergence

bull Maximizing the lower bound L(b) with respect to and is equivalent to minimizing the KL divergence

VBEM vs EM

bull Only different in the E-Step

bull In standard EM q(X) is directly set to p(X|Dθ) and let KL=0bull In VBEM it is intractable to compute p(X|Dθ) Instead it

approximates p(X|Dθ) by a variational distribution q(X) by minimizing KL(q(X) | P(X|D θ)

bull This is also equivalent to maximizing the lower bound L(θ)

Parameter Estimation

bull Given a corpus of documents we would like to find the parameters and b which maximize the likelihood of the observed data

bull Strategy (Variational EM)

bull Lower bound log p(w|b) by a function L(b)bull Repeat until convergence

ndash E Maximize L(b) with respect to the variational parameters ndash M Maximize the bound with respect to parameters and b

Parameter Estimation

bull E-Step Variational Inference ndash repeat until convergence

bull M-Step Parameter estimation

β

α can be implemented using the Newton-Raphson method

Topic Examples in a 100-topic LDA Model)bull 16000 documents from a subset of the TREC AP corpus

Classification (50-topic LDA + SVM)

bull Reuters-21578 dataset ndash contains 8000 documents and 15818 words

(a) EARN vs NOT EARN (b) GRAIN vs NOT GRAIN

Problems in LDA

bull Dirichlet Distribution is helpful to avoid over-fitting But the assumption might be too strong

z4z3z2z1

w4w3w2w1

b

z4z3z2z1

w4w3w2w1

z4z3z2z1

w4w3w2w1

A Bayesian Hierarchical Model for Learning Natural Scene Categories

bull Incorporating category information

MNd

π

z

x

θ

β

Codebookbull 174 Local Image Patches

bull DetectionEvenly Sampled GridRandom SamplingSaliency DetectorLowersquos DoG Detector

bull RepresentationNormalized 11x11 gray values128-dim SIFT

Topic Distribution in Different Categories

Topic Hierarchical Clustering

More Topic Models

bull Dynamic topic models ICML 2006bull Correlated Topic Model NIPS 2005bull Hierarchical Dirichlet Process Journal of the American

Statistical Association 2003bull Nonparametric Bayes pachinko allocation UAI 2007bull Supervised LDA NIPS 2007bull MedLDA ndash Maximum Margin Discrimant LDA ICML 2009bull hellip

Are you really into Graphical Models

bull Describing Visual Scenes using Transformed Dirichlet Processes E Sudderth A Torralba W Freeman and A Willsky NIPS Dec 2005

Reference

bull David M Blei Andrew Y Ng Michael I Jordan Latent Dirichlet Allocation Journal of Machine Learning Research (JMLR) 2003

bull Matthew J Beal Variational Algorithms for Approximate Bayesian Inference PhD Thesis University of Cambridge 1998

bull L Fei-Fei and P Perona A Bayesian Hierarchical Model for Learning Natural Scene Categories CVPR 2005

Outline

bull Matrix Decompositionndash PCA SVD NMFndash LDA ICA Sparse Coding etc

bull Graphical Modelndash Basic concepts in probabilistic machine learningndash EMndash pLSAndash LDA

bull Two Applicationsndash Document decomposition for ldquolong queryrdquo retrieval ICCV 2009ndash Modeling Threaded Discussions SIGIR 2009

Large-Scale Indexing for ldquoLong Queryrdquo Retrieval (Similarity Search)

Xiao Zhang Zhiwei Li Lei Zhang Wei-Ying Ma and Heung-Yeung Shum

ICCV 2009

The Long Query Problem

bull If a query contains 1000 keywordsndash Need to access 1000 inverted listsndash The intersection of 1000 inverted lists may be emptyndash The union of 1000 inverted list may be the whole corpus

bull Dimension reductionTerm1 Term2 Term3 Term4 hellip TermN

Img1 1 2 0 0 hellip 2

f1 f2 hellip fM

Img1 02 01 hellip 003

Topic Projection

Dim = 1 million

Dim = 200

Key Idea Dimension Reduction + Residual Error Preservation

bull p original TF-IDF vector in vocabulary spacebull X projection matrix for dimension reductionbull w low dimensional feature vectorbull residual error

p Xw

An image = 0 01 02 03 04 05 06 07 08 09 1

0

2

4

6

8

10

12

14

16

+ a few words(10 words)

Orthogonal Decomposition

p Xw 1 11 1 1

2 21 2 1 2

1 1

1

k

k

W k W

W W Wk W

p x xp x x w

p wp x x

Base vector

Low dimensional representation

Residual

An image = 0 01 02 03 04 05 06 07 08 09 1

0

2

4

6

8

10

12

14

16

+ a few words(10 words)

T Tp q p q

p q

w w

X1 X2 X3 hellip Xk

A Probabilistic Implementation

x is a switch variable It controls a word generated from

bull a topic specific distribution

bull a document specific distribution

bull a background distribution

( | )p w d 1( 0 | ) ( | ) ( | )K

kp x d p w z k p z k d

( 1| ) ( | )p x d p w d ( 2 | ) ( | )p x d p w

C Chemudugunta etal Modeling General and Specific Aspects of Documents with a Probabilistic Topic Model NIPS 2006

Search (Online)

DS1 DS2 hellip

DS1 DS2 hellip

DS1 DS2 hellip

DS1 DS2 hellip

DS1 DS2 hellip

hellip

hellip

hellip

hellip

hellip

LSH Index

Doc 300 Doc 401 hellip

A query = 0 01 02 03 04 05 06 07 08 09 1

0

2

4

6

8

10

12

14

16

+ a few words

Re-rankingDoc 401 hellip

Doc 1

Doc 2

Doc 300

Doc 401

Doc N

Doc 300

Index 10M Images 46GBSearch Speed lt 100ms

Doc Meta

Search Example

Query Image

Search Example

Query Image

Simultaneously Modeling Semantics and Structure of Threaded Discussions A Sparse

Coding Approach and Its Applications

Chen LIN Jiang-Ming YANG Rui CAI Xin-jing WANG Wei WANG Lei ZHANG

SIGIR 2009

123

Semantic amp structure

124

SemanticTopics

StructureWho reply to who

Optimize them together

Model semantic

Model structure

125

Reply reconstruction

126

DocumentSimilarity

TopicSimilarity

StructureSimilarity

BaselinesNP

Reply to Nearest PostRR

Reply to RootDS

Document SimilarityLDA

Latent Dirichlet Allocation Project documents to topic space

SWB Special Words Topic Model with Background distribution Project documents to topic and junk topic space

127

Evaluation

method Slashdot Apple

All Posts Good Posts All Posts Good Posts

NP 0021 0012 0289 0239

RR 0183 0319 0269 0474

DS 0463 0643 0409 0628

LDA 0465 0644 0410 0648

SWB 0463 0644 0410 0641

SMSS 0524 0737 0517 0772

128

Expert finding

Reply reconstruction

Network construction

Expert finding

Methods

HITS

PageRank

hellip

129

Baselines

LMFormal Models for Expert Finding in Enterprise Corpora SIGIR

06Achieves stable performance in expert finding task using a

language modelPageRank

Benchmark nodal ranking methodHITS

Find hub nodes and authority nodeEABIF

Personalized Recommendation Driven by Information Flow SIGIR rsquo06

Find most influential node130

Evaluation

131

bull Bayesian estimate

Method MRR MAP P10

LM 0821 0698 0800

EABIF(ori) 0674 0362 0243

EABIF(rec) 0742 0318 0281

PageRank(ori) 0675 0377 0263

PageRank(rec) 0743 0321 0266

HITS(ori) 0906 0832 0900

HITS(rec) 0938 0822 0906

Summary

bull Matrix and probability are fundamental mathematics in information retrieval and computer visionndash Matrix decomposition ndash a good practice to learn matrixndash Graphical model ndash a good practice to learn probability

bull Graphical model is a good tool to analyze problems

bull The essence of decomposition is to discover a set of mid-level features to describe original documentsimages

bull It is more adaptable for various applications than matrix decomposition

  • An Introduction To Matrix Decomposition and Graphical Model
  • Outline
  • What Is Matrix Decomposition
  • Why We Need Matrix Decomposition
  • Why We Need Matrix Decomposition (2)
  • Principle Component Analysis
  • Definition ndash Eigenvalue amp Eigenvector
  • Definition ndash Principle Component Analysis
  • Principle Component Analysis (2)
  • Maximizing Variance
  • Optimization Problem
  • Property Data Decomposition
  • Face Recognition ndash Eigenface
  • Slide 14
  • Slide 15
  • PageRank ndash Power Iteration
  • Column-Stochastic amp Irreducible
  • Iterative PageRank Calculation
  • Convergence of the power iteration
  • Singular value decomposition
  • SVD - Definition
  • Singular Values And Singular Vectors
  • Matrix approximation
  • SVD and PCA
  • Example - LSI
  • SVD and PCA (2)
  • Latent Semantic Indexing (LSI)
  • Latent Semantic Indexing
  • Similarity Measures
  • Similarity Measures (2)
  • HITS (Hyperlink Induced Topic Search)
  • Power Iteration
  • HITS and SVD
  • HITS vs PageRank
  • NMF ndash Non-Negative Matrix Factorization
  • Definition
  • Motivation
  • Multiplicative Update Algorithm
  • Multiplicative Update Algorithm (2)
  • NMF vs PCA
  • Reference
  • Major Reference
  • Outline (2)
  • Not Included
  • Basic Concepts
  • What Is Machine Learning
  • Likelihood Function
  • Maximum Likelihood (ML)
  • IID ndash Independent Identically Distributed
  • Reference (2)
  • Expectation Maximization
  • Why We Need EM
  • More General
  • The Expectation Maximization (EM) Algorithm
  • Jensenrsquos Inequality
  • Lower Bounding the Log Likelihood
  • The E and M Steps of EM
  • The E Step
  • The M Step
  • EM Never Decreases the Likelihood
  • Reference (3)
  • Why Do We Need Graphical Model
  • Why Do We Need Graphical Models
  • Directed Acyclic Graphical Models (Bayesian Networks)
  • Directed Graphs for Statistical Models Plate Notation
  • pLSA ndash Probabilistic Latent Semantic Analysis
  • Latent Semantic Indexing (LSI) Review
  • pLSA ndash Probabilistic Latent Semantic Analysis (2)
  • pLSA
  • pLSA (2)
  • Joint Probability vs Likelihood
  • Document Decomposition
  • pLSA ndash Objective Function
  • EM Steps
  • Lower Bounding the Log Likelihood (2)
  • EM Steps (2)
  • Latent Subspace
  • pLSA vs LSA
  • pLSA vs LSA (2)
  • Applications
  • Text Mining
  • Scene Classification
  • Classification Result
  • Reference (4)
  • LDA ndash Latent Dirichilet Allocation
  • Problems in pLSA
  • Problems in pLSA (2)
  • Dirichlet Distribution
  • Dirichlet Distribution (2)
  • Example Dirichlet Distributions (K=3)
  • Example Dirichlet Distributions (K=3) (2)
  • The LDA Model
  • The LDA Model (2)
  • Joint Probability
  • Likelihood
  • Inference
  • Inference (2)
  • Variational Inference
  • Variantional Inference
  • VBEM vs EM
  • Parameter Estimation
  • Parameter Estimation (2)
  • Topic Examples in a 100-topic LDA Model)
  • Classification (50-topic LDA + SVM)
  • Problems in LDA
  • A Bayesian Hierarchical Model for Learning Natural Scene Catego
  • Codebook
  • Topic Distribution in Different Categories
  • Topic Hierarchical Clustering
  • More Topic Models
  • Are you really into Graphical Models
  • Reference (5)
  • Outline (3)
  • Slide 115
  • The Long Query Problem
  • Key Idea Dimension Reduction + Residual Error Preservation
  • Orthogonal Decomposition
  • A Probabilistic Implementation
  • Search (Online)
  • Search Example
  • Search Example (2)
  • Simultaneously Modeling Semantics and Structure of Threaded Dis
  • Semantic amp structure
  • Optimize them together
  • Reply reconstruction
  • Baselines
  • Evaluation
  • Expert finding
  • Baselines (2)
  • Evaluation (2)
  • Summary
Page 20: An Introduction To Matrix Decomposition and Graphical Model

Singular Values And Singular Vectors

bull The diagonal elements σj of are the singular values of the matrix A

bull The columns of U and V are the left singular vectors and right singular vectors respectively

bull Equivalent form of SVD

Matrix approximation

bull Theorem Let Uk = (u1 u2 hellip uk) Vk = (v1 v2 hellip vk) and Σk = diag(σ1 σ2 hellip σk) and define

bull Then

bull It means that the best approximation of rank k for the matrix A is

SVD and PCAbull We can writebull Remember that in PCA we treat A as a row matrixbull V is just eigenvectors for A

ndash Each column in V is an eigenvector of row matrix Andash we use V to approximate a row in A

bull Equivalently we can writebull U is just eigenvectors for AT

ndash Each column in U is an eigenvector of column matrix Andash We use U to approximate a column in A

Example - LSI bull Build a term-by-document matrix A

bull Compute the SVD of A A = UΣVT

bull Approximate A by

ndash Uk Orthogonal basis that we use to approximate all the documents

ndash Dk Column j hold the coordinates of document j in the new basisndash Dk is the projection of A onto the subspace spanned by Uk

SVD and PCAbull For symmetric A SVD is closely related to PCA

bull PCA A = UΛUT

ndash U and Λ are eigenvectors and eigenvaluesbull SVD A = UΛVT

ndash U is left(column) eigenvectorsndash V is right(row) eigenvectorsndash Λ is the same eigenvalues

bull For symmetric A column eigenvectors equal to row eigenvectors

bull Note the difference of A in PCA and SVDndash SVD A is directly the data eg term-by-document matrixndash PCA A is covariance matrix A=XTX each row in X is a sample

Latent Semantic Indexing (LSI)1 Document file preparation preprocessing

ndash Indexing collecting termsndash Use stop list eliminate rdquomeaninglessrdquo wordsndash Stemming

2 Construction term-by-document matrix sparse matrix storage

3 Query matching distance measures4 Data compression by low rank approximation SVD5 Ranking and relevance feedback

Latent Semantic Indexingbull Assumption there is some underlying latent semantic

structure in the data

bull Eg car and automobile occur in similar documents as do cows and sheep

bull This structure can be enhanced by projecting the data (the term-by-document matrix and the queries) onto a lower dimensional space using SVD

Similarity Measures

bull Term to term AAT = UΣ2UT = (UΣ)(UΣ)T

UΣ are the coordinates of A (rows) projected to space V

bull Document to documentATA = VΣ2VT = (VΣ)(VΣ)T

VΣ are the coordinates of A (columns) projected to space U

Similarity Measures

bull Term to document A = UΣVT = (UΣfrac12)(VΣfrac12)T

UΣfrac12 are the coordinates of A (rows) projected to space VVΣfrac12 are the coordinates of A (columns) projected to space U

HITS (Hyperlink Induced Topic Search)

bull Idea Web includes two flavors of prominent pagesndash authorities contain high-quality informationndash hubs are comprehensive lists of links to authorities

bull A page is a good authority if many hubs point to itbull A page is a good hub if it points to many authoritiesbull Good authorities are pointed to by good hubs and good hubs

point to good authorities

Hubs Authorities

Power Iteration

bull Each page i has both a hub score hi and an authority score aibull HITS successively refines these scores by computing

bull Define the adjacency matrix L of the directed web graph

bull Now

HITS and SVD

bull L rows are outlinks columns are inlinks

bull a will be the dominant eigenvector of the authority matrix LTL bull h will be the dominant eigenvector of the hub matrix LLT

bull They are in fact the first left and right singular vectors of Lbull We are in fact running SVD on the adjacency matrix

HITS vs PageRankbull PageRank may be computed once HITS is computed per

query

bull HITS takes query into account PageRank doesnrsquot

bull PageRank has no concept of hubs

bull HITS is sensitive to local topology insertion or deletion of a small number of nodes may change the scores a lot

bull PageRank more stable because of its random jump step

NMF ndash NON-NEGATIVE MATRIX FACTORIZATION

Definition

bull Given a nonnegative matrix Vntimesm find non-negative matrix factors Wntimesk and Hktimesm such that

VntimesmasympWntimeskHktimesm

bull V column matrix each column is a data sample (n-dimension)bull Wi k-basis represents one basebull H coordinates of V projected to W

vj asymp Wntimeskhj

Motivationbull Non-negativity is natural in many applications

bull Probability is also non-negative

bull Additive model to capture local structure

Multiplicative Update Algorithmbull Cost function Euclidean distance

bull Multiplicative Update

Multiplicative Update Algorithmbull Cost function Divergence

ndash Reduce to Kullback-Leibler divergence whenndash A and B can be regarded as normalized probability distributions

bull Multiplicative update

bull PLSA is NMF with KL divergence

NMF vs PCA

bull n = 2429 faces m = 19x19 pixelsbull Positive values are illustrated with black pixels and negative values

with red pixels

bull NMF Parts-based representationbull PCA Holistic representations

Referencebull D D Lee and H S Seung Algorithms for non-negative matrix

factorization (pdf) NIPS 2001bull D D Lee and H S Seung Learning the parts of objects by non-negative

matrix factorization (pdf) Nature 401 788-791 (1999)

Major Referencebull Saara Hyvoumlnen Linear Algebra Methods for Data Mining Spring 2007

University of Helsinki (Highly recommend)

Outlinebull Basic concepts

ndash Likelihood iidndash ML MAP and Bayesian Inferencendash Expectation-Maximizationndash Mixture Gaussian Parameter estimation

bull pLSAndash Motivationndash Derivation amp Geometry propertiesndash Applications

bull LDAndash Motivation - Why to add a hyper parameterndash Dirichlet Distributionndash Variational EMndash Relations with other topic modalsndash Incorporating category information

bull Summary

Not Included

bull General graphical model theoriesbull Markov random field (belief propagation)bull Detailed derivation of LDA

BASIC CONCEPTS

What Is Machine Learning

Databull Let x = (x1 x2 xD) T denote a data point and D = x(1) x(2) x(N) a

data set D is sometimes associated with desired outputs y1 y2

Predictionsbull We are generally interested in predicting something based on the observed

data setbull Given D what can we say about x(N+1)

Modelbull To make predictions we need to make some assumptions We can often

express these assumptions in the form of a model with some parameters θbull Given data D we learn the model parameters from which we can predict

new data pointsbull The model can often be expressed as a probability distribution over data

points

Likelihood Function

bull Given a set of parameter values probability density function (PDF) will show that some data are more probable than other data

bull Inversely given the observed data and a model of interest Likelihood function is defined as

L(θ) = fθ(x|θ) = p(x|θ)

bull That is likelihood function L(θ) will show that some parameters are more likely to have produced the data

Maximum Likelihood (ML)

bull Maximum likelihood will find the best model parameters that make the data ldquomost likelyrdquo generated from this model

bull Suppose we are given n data samples (x1 x2 hellip xn)

bull Maximum likelihood will find θ that maximize L(θ)

bull Predictive distribution

IID ndash Independent Identically Distributed

bull IID means

bull The problem is considerably simplified as

bull Usually log likehood is used

Reference

bull Zoubin Ghahramani Machine Learning (4F13) 2006 Cambridge(Introduction to Machine Learning Lectures 1-2 Slides)

bull Gregor Heinrich Parameter estimation for text analysis Technical Note 2005-2008

EXPECTATION MAXIMIZATION

Why We Need EM

bull The ExpectationndashMaximization (EM) algorithm is a method for ML learning of parameters in latent variable models

bull Why we need latent variables

bull To describe complex model Gaussian Mixture Model

bull To discover the intrinsic structure inside a data set Topic Models such as pLSA LDA

More General

bull Data set bull Likelihood

bull Goal learn maximum likelihood (ML) parameter values

bull The maximum likelihood procedure finds parameters θ such that

bull Because of the integral (or sum) over latent variables the likelihood can be a very complicated and hard to optimize function

The Expectation Maximization (EM) Algorithm

bull The EM algorithm finds a (local) maximum of a latent variable model likelihood It starts from arbitrary values of the parameters and iterates two steps

bull E step Fill in values of latent variables according to posterior given data

bull M step Maximize likelihood as if latent variables were not hidden

bull Decomposes difficult problems into series of tractable steps

Jensenrsquos Inequality

Lower Bounding the Log Likelihoodbull Observed data D = yn Latent variables X = xn Parameters θbull Goal Maximize the log likelihood (ie ML learning) wrt θ

bull Any distribution q(X) over the hidden variables can be used to obtain a lower bound on the log likelihood using Jensenrsquos inequality

bull where H[q] is the entropy of q(X)

The E and M Steps of EM

bull The lower bound on the log likelihood is given by

bull EM alternates betweenbull E step optimize wrt distribution over hidden variables

holding params fixed

bull M step maximize wrt parameters holding hidden distribution fixed

The E Step

bull E step for fixed θ

bull The second term is the Kullback-Leibler divergencebull This means that for fixed θ F is bounded above by L and achieves

that bound whenbull So the E step simply sets

The M Step

bull M step maximize wrt parameters holding hidden distribution q fixed

bull The second equality comes from fact that entropy of q(X) does not depend directly on θ

bull The specific form of the M step depends on the model Often the maximum wrt θ can be found analytically

EM Never Decreases the Likelihood

bull The E and M steps together never decrease the log likelihood

bull The E step brings F(q θ) to the likelihood L(θ) 顶bull The M-step maximizes F(q θ) wrt θ 抬bull F(q θ) lt L(θ) by Jensen ndash or equivalently from the non-

negativity of KL

Reference

bull Zoubin Ghahramani Machine Learning (4F13) 2006 Cambridge(Unsupervised learning Lecture 5 Slides)

bull Christopher M Bishop (2006) Pattern Recognition and Machine Learning Springer

WHY DO WE NEED GRAPHICAL MODEL

Why Do We Need Graphical Models

bull Consndash Graphical model is so complex even with a few circleshellipndash We have to make too many assumptions

bull Prosndash We do need probability to explain our world But joint probability is

hard to computendash Graphical model can help us analyze and understand our problemsndash Graphs are an intuitive way of representing and visualizing the

relationships between many variablesndash With a graphical model we can decouple joint probability to

conditional probabilities which are usually easier

Directed Acyclic Graphical Models (Bayesian Networks)

bull A DAG Model Bayesian network corresponds to a factorization of the joint probability distribution

p(ABCDE) = p(A)p(B)p(C|AB)p(D|BC)p(E|CD)

bull In general

bull where pa(i) are the parents of node i

Directed Graphs for Statistical ModelsPlate Notation

bull A data set of N points generated from a Gaussian

PLSA ndash PROBABILISTIC LATENT SEMANTIC ANALYSIS

Latent Semantic Indexing (LSI) Review

bull For natural Language Queries simple term matching does not work effectivelyndash Ambiguous terms ndash Same Queries vary due to personal styles

bull Latent semantic indexingndash Creates this lsquolatent semantic spacersquo (hidden meaning)

bull LSI puts documents together even if they donrsquot have common words if the docs share frequently co-occurring terms

bull Disadvantagesndash Statistical foundation is missing

pLSA ndash Probabilistic Latent Semantic Analysis

bull Automated Document Indexing and Information retrievalbull Identification of Latent Classes using an Expectation

Maximization (EM) Algorithmbull Shown to solve

ndash Polysemybull Java could mean ldquocoffeerdquo and also the ldquoPL Javardquobull Cricket is a ldquogamerdquo and also an ldquoinsectrdquo

ndash Synonymybull ldquocomputerrdquo ldquopcrdquo ldquodesktoprdquo all could mean the same

bull Has a better statistical foundation than LSA

pLSA

M

Nd

d

z

w

M

d

z1

w1

z2

w2

z3

w3

zN

wN

hellip

z1 hellip zN are variables ziє[1K]K is the number of latent topics

pLSA

d1

z1

w1

z2

w2

zN1

wN1

hellip

d2

z1

w1

z2

w2

zN2

wN2

hellip

dM

z1

w1

z2

w2

zNm

wNm

hellip

p(w|z=1) p(w|z=2) p(w | z=Nm) are shared for all documents

Likelihood

Joint Probability vs Likelihood

bull Joint probability

bull Likelihood (only for observed variables)

bull p(d) is assumed to be uniform

Document Decomposition

bull Each document can be decomposed as

bull This is similar to the matrix decomposition if we consider each discrete distribution as a vector

p(w|d) = ZVtimesk p(z|d)

bull With many documents we hope to find latent topics as common basis

pLSA ndash Objective Function

bull pLSA tries to maximize the log likelihood

bull Due to the summation over z inside log we have to resort to EM

EM Steps

bull E-Stepndash Expectation of the likelihood function is calculated with the current

parameter valuesbull M-Step

ndash Update the parameters with the calculated posterior probabilitiesndash Find the parameters that maximizes the likelihood function

Lower Bounding the Log Likelihood

EM Steps

bull The E-Step

bull The M-Step

Latent Subspace

pLSA vs LSAbull LSA and PLSA perform dimensionality reduction

ndash In LSA by keeping only K singular valuesndash In PLSA by having K aspects

bull Comparison to SVDndash U Matrix related to P(z|d) (doc to aspect)ndash V Matrix related to P(w|z) (aspect to term)ndash E Matrix related to P(z) (aspect strength)

pLSA vs LSAbull The main difference is the way the approximation is done

bull PLSA generates a model (aspect model) and maximizes its predictive power

bull Selecting the proper value of K is heuristic in LSA

bull Model selection in statistics can determine optimal K in PLSA

Applications

bull Text mining topic discovering

bull Scene Classification

Text Mining

Scene Classification

Classification Result

Reference

bull Thomas Hofmann Probabilistic latent semantic analysis In Proc of Uncertainty in Artificial Intelligence UAI99 Stockholm 1999

bull Bosch A Zisserman A and Munoz X Scene Classification via pLSAProceedings of the European Conference on Computer Vision (2006)

bull Sivic J Russell B C Efros A A Zisserman A and Freeman W T Discovering Object Categories in Image Collections MIT AI Lab Memo AIM-2005-005 February 2005

LDA ndash LATENT DIRICHILET ALLOCATION

Problems in pLSA

bull pLSA provides no probabilistic model at the document level Each doc has its own topic mixture proportion

bull The number of parameters in the model grows linearly with M (the number of documents in the training set)

Problems in pLSA

bull There is no constraint for distributions p(z|di)

bull Easy to lead to serious problems with over-fitting

d1

z1

w1

z2

w2

zN1

wN1

hellip

d2

z1

w1

z2

w2

zN2

wN2

hellip

dm

z1

w1

z2

w2

zNm

wNm

hellip

p(z|d1) p(z|d2) p(z|dm)

Dirichlet Distribution

bull In the LDA model the topic mixture proportions for each document are assumed to follow some distribution

bull Requirement for such a distributionndash The samples (mixture proportions) generated from it are K-tuples of non-

negative numbers that sum to one That is the samples are multinormialsndash Easy to optimize

bull Dirichlet Distribution is one of such distributions

bull The space of all of these multinomials has a nice geometric interpretation as a (k-1)-simplex

Dirichlet Distribution

bull Definition

bull The density is zero outside this open (K minus 1)-dimensional simplex

k

i ii

ki ik

i i

ki i

kk

xx

xxxxp i

1

11

1

12121

1 0 st

)Γ()(Γ)|(

bull Various parameter α

(6 2 2) (3 7 5)

(2 3 4) (6 2 6)

Example Dirichlet Distributions (K=3)

Example Dirichlet Distributions (K=3)

bull Equal αi different

α0=01 α0=1 α0=10

k

i i10

The LDA Model

z4z3z2z1

w4w3w2w1

b

z4z3z2z1

w4w3w2w1

z4z3z2z1

w4w3w2w1

bull For each documentbull Choose ~Dirichlet()bull For each of the N words wn

ndash Choose a topic znraquo Multinomial()

ndash Choose a word wn from p(wn|znb) a multinomial probability conditioned on the topic zn

The LDA Model

bull For each documentbull Choose ~Dirichlet()bull For each of the N words wn

ndash Choose a topic znraquo Multinomial()

ndash Choose a word wn from p(wn|znb) a multinomial probability conditioned on the topic zn

Joint Probability

bull Given parameter α and β

where

Likelihood

bull Joint Probability

bull Marginal distribution of a document

bull Likelihood over all the documents

Inference

bull The likelihood can be computed by summing each documentbull Jansenrsquos inequality in EM

Inference

bull In E-Step we need to compute the posterior distribution of the hidden variables

bull Unfortunately this distribution is intractable to compute in general

bull We have to resort to variational approach

Variational Inference

bull In variational inference we consider a simplified graphical model with variational parameters and minimize the KL Divergence between the variational and posterior distributions

Variantional Inference

bull The difference between the lower bound and the likelihood is the KL divergence

bull Maximizing the lower bound L(b) with respect to and is equivalent to minimizing the KL divergence

VBEM vs EM

bull Only different in the E-Step

bull In standard EM q(X) is directly set to p(X|Dθ) and let KL=0bull In VBEM it is intractable to compute p(X|Dθ) Instead it

approximates p(X|Dθ) by a variational distribution q(X) by minimizing KL(q(X) | P(X|D θ)

bull This is also equivalent to maximizing the lower bound L(θ)

Parameter Estimation

bull Given a corpus of documents we would like to find the parameters and b which maximize the likelihood of the observed data

bull Strategy (Variational EM)

bull Lower bound log p(w|b) by a function L(b)bull Repeat until convergence

ndash E Maximize L(b) with respect to the variational parameters ndash M Maximize the bound with respect to parameters and b

Parameter Estimation

bull E-Step Variational Inference ndash repeat until convergence

bull M-Step Parameter estimation

β

α can be implemented using the Newton-Raphson method

Topic Examples in a 100-topic LDA Model)bull 16000 documents from a subset of the TREC AP corpus

Classification (50-topic LDA + SVM)

bull Reuters-21578 dataset ndash contains 8000 documents and 15818 words

(a) EARN vs NOT EARN (b) GRAIN vs NOT GRAIN

Problems in LDA

bull Dirichlet Distribution is helpful to avoid over-fitting But the assumption might be too strong

z4z3z2z1

w4w3w2w1

b

z4z3z2z1

w4w3w2w1

z4z3z2z1

w4w3w2w1

A Bayesian Hierarchical Model for Learning Natural Scene Categories

bull Incorporating category information

MNd

π

z

x

θ

β

Codebookbull 174 Local Image Patches

bull DetectionEvenly Sampled GridRandom SamplingSaliency DetectorLowersquos DoG Detector

bull RepresentationNormalized 11x11 gray values128-dim SIFT

Topic Distribution in Different Categories

Topic Hierarchical Clustering

More Topic Models

bull Dynamic topic models ICML 2006bull Correlated Topic Model NIPS 2005bull Hierarchical Dirichlet Process Journal of the American

Statistical Association 2003bull Nonparametric Bayes pachinko allocation UAI 2007bull Supervised LDA NIPS 2007bull MedLDA ndash Maximum Margin Discrimant LDA ICML 2009bull hellip

Are you really into Graphical Models

bull Describing Visual Scenes using Transformed Dirichlet Processes E Sudderth A Torralba W Freeman and A Willsky NIPS Dec 2005

Reference

bull David M Blei Andrew Y Ng Michael I Jordan Latent Dirichlet Allocation Journal of Machine Learning Research (JMLR) 2003

bull Matthew J Beal Variational Algorithms for Approximate Bayesian Inference PhD Thesis University of Cambridge 1998

bull L Fei-Fei and P Perona A Bayesian Hierarchical Model for Learning Natural Scene Categories CVPR 2005

Outline

bull Matrix Decompositionndash PCA SVD NMFndash LDA ICA Sparse Coding etc

bull Graphical Modelndash Basic concepts in probabilistic machine learningndash EMndash pLSAndash LDA

bull Two Applicationsndash Document decomposition for ldquolong queryrdquo retrieval ICCV 2009ndash Modeling Threaded Discussions SIGIR 2009

Large-Scale Indexing for ldquoLong Queryrdquo Retrieval (Similarity Search)

Xiao Zhang Zhiwei Li Lei Zhang Wei-Ying Ma and Heung-Yeung Shum

ICCV 2009

The Long Query Problem

bull If a query contains 1000 keywordsndash Need to access 1000 inverted listsndash The intersection of 1000 inverted lists may be emptyndash The union of 1000 inverted list may be the whole corpus

bull Dimension reductionTerm1 Term2 Term3 Term4 hellip TermN

Img1 1 2 0 0 hellip 2

f1 f2 hellip fM

Img1 02 01 hellip 003

Topic Projection

Dim = 1 million

Dim = 200

Key Idea Dimension Reduction + Residual Error Preservation

bull p original TF-IDF vector in vocabulary spacebull X projection matrix for dimension reductionbull w low dimensional feature vectorbull residual error

p Xw

An image = 0 01 02 03 04 05 06 07 08 09 1

0

2

4

6

8

10

12

14

16

+ a few words(10 words)

Orthogonal Decomposition

p Xw 1 11 1 1

2 21 2 1 2

1 1

1

k

k

W k W

W W Wk W

p x xp x x w

p wp x x

Base vector

Low dimensional representation

Residual

An image = 0 01 02 03 04 05 06 07 08 09 1

0

2

4

6

8

10

12

14

16

+ a few words(10 words)

T Tp q p q

p q

w w

X1 X2 X3 hellip Xk

A Probabilistic Implementation

x is a switch variable It controls a word generated from

bull a topic specific distribution

bull a document specific distribution

bull a background distribution

( | )p w d 1( 0 | ) ( | ) ( | )K

kp x d p w z k p z k d

( 1| ) ( | )p x d p w d ( 2 | ) ( | )p x d p w

C Chemudugunta etal Modeling General and Specific Aspects of Documents with a Probabilistic Topic Model NIPS 2006

Search (Online)

DS1 DS2 hellip

DS1 DS2 hellip

DS1 DS2 hellip

DS1 DS2 hellip

DS1 DS2 hellip

hellip

hellip

hellip

hellip

hellip

LSH Index

Doc 300 Doc 401 hellip

A query = 0 01 02 03 04 05 06 07 08 09 1

0

2

4

6

8

10

12

14

16

+ a few words

Re-rankingDoc 401 hellip

Doc 1

Doc 2

Doc 300

Doc 401

Doc N

Doc 300

Index 10M Images 46GBSearch Speed lt 100ms

Doc Meta

Search Example

Query Image

Search Example

Query Image

Simultaneously Modeling Semantics and Structure of Threaded Discussions A Sparse

Coding Approach and Its Applications

Chen LIN Jiang-Ming YANG Rui CAI Xin-jing WANG Wei WANG Lei ZHANG

SIGIR 2009

123

Semantic amp structure

124

SemanticTopics

StructureWho reply to who

Optimize them together

Model semantic

Model structure

125

Reply reconstruction

126

DocumentSimilarity

TopicSimilarity

StructureSimilarity

BaselinesNP

Reply to Nearest PostRR

Reply to RootDS

Document SimilarityLDA

Latent Dirichlet Allocation Project documents to topic space

SWB Special Words Topic Model with Background distribution Project documents to topic and junk topic space

127

Evaluation

method Slashdot Apple

All Posts Good Posts All Posts Good Posts

NP 0021 0012 0289 0239

RR 0183 0319 0269 0474

DS 0463 0643 0409 0628

LDA 0465 0644 0410 0648

SWB 0463 0644 0410 0641

SMSS 0524 0737 0517 0772

128

Expert finding

Reply reconstruction

Network construction

Expert finding

Methods

HITS

PageRank

hellip

129

Baselines

LMFormal Models for Expert Finding in Enterprise Corpora SIGIR

06Achieves stable performance in expert finding task using a

language modelPageRank

Benchmark nodal ranking methodHITS

Find hub nodes and authority nodeEABIF

Personalized Recommendation Driven by Information Flow SIGIR rsquo06

Find most influential node130

Evaluation

131

bull Bayesian estimate

Method MRR MAP P10

LM 0821 0698 0800

EABIF(ori) 0674 0362 0243

EABIF(rec) 0742 0318 0281

PageRank(ori) 0675 0377 0263

PageRank(rec) 0743 0321 0266

HITS(ori) 0906 0832 0900

HITS(rec) 0938 0822 0906

Summary

bull Matrix and probability are fundamental mathematics in information retrieval and computer visionndash Matrix decomposition ndash a good practice to learn matrixndash Graphical model ndash a good practice to learn probability

bull Graphical model is a good tool to analyze problems

bull The essence of decomposition is to discover a set of mid-level features to describe original documentsimages

bull It is more adaptable for various applications than matrix decomposition

  • An Introduction To Matrix Decomposition and Graphical Model
  • Outline
  • What Is Matrix Decomposition
  • Why We Need Matrix Decomposition
  • Why We Need Matrix Decomposition (2)
  • Principle Component Analysis
  • Definition ndash Eigenvalue amp Eigenvector
  • Definition ndash Principle Component Analysis
  • Principle Component Analysis (2)
  • Maximizing Variance
  • Optimization Problem
  • Property Data Decomposition
  • Face Recognition ndash Eigenface
  • Slide 14
  • Slide 15
  • PageRank ndash Power Iteration
  • Column-Stochastic amp Irreducible
  • Iterative PageRank Calculation
  • Convergence of the power iteration
  • Singular value decomposition
  • SVD - Definition
  • Singular Values And Singular Vectors
  • Matrix approximation
  • SVD and PCA
  • Example - LSI
  • SVD and PCA (2)
  • Latent Semantic Indexing (LSI)
  • Latent Semantic Indexing
  • Similarity Measures
  • Similarity Measures (2)
  • HITS (Hyperlink Induced Topic Search)
  • Power Iteration
  • HITS and SVD
  • HITS vs PageRank
  • NMF ndash Non-Negative Matrix Factorization
  • Definition
  • Motivation
  • Multiplicative Update Algorithm
  • Multiplicative Update Algorithm (2)
  • NMF vs PCA
  • Reference
  • Major Reference
  • Outline (2)
  • Not Included
  • Basic Concepts
  • What Is Machine Learning
  • Likelihood Function
  • Maximum Likelihood (ML)
  • IID ndash Independent Identically Distributed
  • Reference (2)
  • Expectation Maximization
  • Why We Need EM
  • More General
  • The Expectation Maximization (EM) Algorithm
  • Jensenrsquos Inequality
  • Lower Bounding the Log Likelihood
  • The E and M Steps of EM
  • The E Step
  • The M Step
  • EM Never Decreases the Likelihood
  • Reference (3)
  • Why Do We Need Graphical Model
  • Why Do We Need Graphical Models
  • Directed Acyclic Graphical Models (Bayesian Networks)
  • Directed Graphs for Statistical Models Plate Notation
  • pLSA ndash Probabilistic Latent Semantic Analysis
  • Latent Semantic Indexing (LSI) Review
  • pLSA ndash Probabilistic Latent Semantic Analysis (2)
  • pLSA
  • pLSA (2)
  • Joint Probability vs Likelihood
  • Document Decomposition
  • pLSA ndash Objective Function
  • EM Steps
  • Lower Bounding the Log Likelihood (2)
  • EM Steps (2)
  • Latent Subspace
  • pLSA vs LSA
  • pLSA vs LSA (2)
  • Applications
  • Text Mining
  • Scene Classification
  • Classification Result
  • Reference (4)
  • LDA ndash Latent Dirichilet Allocation
  • Problems in pLSA
  • Problems in pLSA (2)
  • Dirichlet Distribution
  • Dirichlet Distribution (2)
  • Example Dirichlet Distributions (K=3)
  • Example Dirichlet Distributions (K=3) (2)
  • The LDA Model
  • The LDA Model (2)
  • Joint Probability
  • Likelihood
  • Inference
  • Inference (2)
  • Variational Inference
  • Variantional Inference
  • VBEM vs EM
  • Parameter Estimation
  • Parameter Estimation (2)
  • Topic Examples in a 100-topic LDA Model)
  • Classification (50-topic LDA + SVM)
  • Problems in LDA
  • A Bayesian Hierarchical Model for Learning Natural Scene Catego
  • Codebook
  • Topic Distribution in Different Categories
  • Topic Hierarchical Clustering
  • More Topic Models
  • Are you really into Graphical Models
  • Reference (5)
  • Outline (3)
  • Slide 115
  • The Long Query Problem
  • Key Idea Dimension Reduction + Residual Error Preservation
  • Orthogonal Decomposition
  • A Probabilistic Implementation
  • Search (Online)
  • Search Example
  • Search Example (2)
  • Simultaneously Modeling Semantics and Structure of Threaded Dis
  • Semantic amp structure
  • Optimize them together
  • Reply reconstruction
  • Baselines
  • Evaluation
  • Expert finding
  • Baselines (2)
  • Evaluation (2)
  • Summary
Page 21: An Introduction To Matrix Decomposition and Graphical Model

Matrix approximation

bull Theorem Let Uk = (u1 u2 hellip uk) Vk = (v1 v2 hellip vk) and Σk = diag(σ1 σ2 hellip σk) and define

bull Then

bull It means that the best approximation of rank k for the matrix A is

SVD and PCAbull We can writebull Remember that in PCA we treat A as a row matrixbull V is just eigenvectors for A

ndash Each column in V is an eigenvector of row matrix Andash we use V to approximate a row in A

bull Equivalently we can writebull U is just eigenvectors for AT

ndash Each column in U is an eigenvector of column matrix Andash We use U to approximate a column in A

Example - LSI bull Build a term-by-document matrix A

bull Compute the SVD of A A = UΣVT

bull Approximate A by

ndash Uk Orthogonal basis that we use to approximate all the documents

ndash Dk Column j hold the coordinates of document j in the new basisndash Dk is the projection of A onto the subspace spanned by Uk

SVD and PCAbull For symmetric A SVD is closely related to PCA

bull PCA A = UΛUT

ndash U and Λ are eigenvectors and eigenvaluesbull SVD A = UΛVT

ndash U is left(column) eigenvectorsndash V is right(row) eigenvectorsndash Λ is the same eigenvalues

bull For symmetric A column eigenvectors equal to row eigenvectors

bull Note the difference of A in PCA and SVDndash SVD A is directly the data eg term-by-document matrixndash PCA A is covariance matrix A=XTX each row in X is a sample

Latent Semantic Indexing (LSI)1 Document file preparation preprocessing

ndash Indexing collecting termsndash Use stop list eliminate rdquomeaninglessrdquo wordsndash Stemming

2 Construction term-by-document matrix sparse matrix storage

3 Query matching distance measures4 Data compression by low rank approximation SVD5 Ranking and relevance feedback

Latent Semantic Indexingbull Assumption there is some underlying latent semantic

structure in the data

bull Eg car and automobile occur in similar documents as do cows and sheep

bull This structure can be enhanced by projecting the data (the term-by-document matrix and the queries) onto a lower dimensional space using SVD

Similarity Measures

bull Term to term AAT = UΣ2UT = (UΣ)(UΣ)T

UΣ are the coordinates of A (rows) projected to space V

bull Document to documentATA = VΣ2VT = (VΣ)(VΣ)T

VΣ are the coordinates of A (columns) projected to space U

Similarity Measures

bull Term to document A = UΣVT = (UΣfrac12)(VΣfrac12)T

UΣfrac12 are the coordinates of A (rows) projected to space VVΣfrac12 are the coordinates of A (columns) projected to space U

HITS (Hyperlink Induced Topic Search)

bull Idea Web includes two flavors of prominent pagesndash authorities contain high-quality informationndash hubs are comprehensive lists of links to authorities

bull A page is a good authority if many hubs point to itbull A page is a good hub if it points to many authoritiesbull Good authorities are pointed to by good hubs and good hubs

point to good authorities

Hubs Authorities

Power Iteration

bull Each page i has both a hub score hi and an authority score aibull HITS successively refines these scores by computing

bull Define the adjacency matrix L of the directed web graph

bull Now

HITS and SVD

bull L rows are outlinks columns are inlinks

bull a will be the dominant eigenvector of the authority matrix LTL bull h will be the dominant eigenvector of the hub matrix LLT

bull They are in fact the first left and right singular vectors of Lbull We are in fact running SVD on the adjacency matrix

HITS vs PageRankbull PageRank may be computed once HITS is computed per

query

bull HITS takes query into account PageRank doesnrsquot

bull PageRank has no concept of hubs

bull HITS is sensitive to local topology insertion or deletion of a small number of nodes may change the scores a lot

bull PageRank more stable because of its random jump step

NMF ndash NON-NEGATIVE MATRIX FACTORIZATION

Definition

bull Given a nonnegative matrix Vntimesm find non-negative matrix factors Wntimesk and Hktimesm such that

VntimesmasympWntimeskHktimesm

bull V column matrix each column is a data sample (n-dimension)bull Wi k-basis represents one basebull H coordinates of V projected to W

vj asymp Wntimeskhj

Motivationbull Non-negativity is natural in many applications

bull Probability is also non-negative

bull Additive model to capture local structure

Multiplicative Update Algorithmbull Cost function Euclidean distance

bull Multiplicative Update

Multiplicative Update Algorithmbull Cost function Divergence

ndash Reduce to Kullback-Leibler divergence whenndash A and B can be regarded as normalized probability distributions

bull Multiplicative update

bull PLSA is NMF with KL divergence

NMF vs PCA

bull n = 2429 faces m = 19x19 pixelsbull Positive values are illustrated with black pixels and negative values

with red pixels

bull NMF Parts-based representationbull PCA Holistic representations

Referencebull D D Lee and H S Seung Algorithms for non-negative matrix

factorization (pdf) NIPS 2001bull D D Lee and H S Seung Learning the parts of objects by non-negative

matrix factorization (pdf) Nature 401 788-791 (1999)

Major Referencebull Saara Hyvoumlnen Linear Algebra Methods for Data Mining Spring 2007

University of Helsinki (Highly recommend)

Outlinebull Basic concepts

ndash Likelihood iidndash ML MAP and Bayesian Inferencendash Expectation-Maximizationndash Mixture Gaussian Parameter estimation

bull pLSAndash Motivationndash Derivation amp Geometry propertiesndash Applications

bull LDAndash Motivation - Why to add a hyper parameterndash Dirichlet Distributionndash Variational EMndash Relations with other topic modalsndash Incorporating category information

bull Summary

Not Included

bull General graphical model theoriesbull Markov random field (belief propagation)bull Detailed derivation of LDA

BASIC CONCEPTS

What Is Machine Learning

Databull Let x = (x1 x2 xD) T denote a data point and D = x(1) x(2) x(N) a

data set D is sometimes associated with desired outputs y1 y2

Predictionsbull We are generally interested in predicting something based on the observed

data setbull Given D what can we say about x(N+1)

Modelbull To make predictions we need to make some assumptions We can often

express these assumptions in the form of a model with some parameters θbull Given data D we learn the model parameters from which we can predict

new data pointsbull The model can often be expressed as a probability distribution over data

points

Likelihood Function

bull Given a set of parameter values probability density function (PDF) will show that some data are more probable than other data

bull Inversely given the observed data and a model of interest Likelihood function is defined as

L(θ) = fθ(x|θ) = p(x|θ)

bull That is likelihood function L(θ) will show that some parameters are more likely to have produced the data

Maximum Likelihood (ML)

bull Maximum likelihood will find the best model parameters that make the data ldquomost likelyrdquo generated from this model

bull Suppose we are given n data samples (x1 x2 hellip xn)

bull Maximum likelihood will find θ that maximize L(θ)

bull Predictive distribution

IID ndash Independent Identically Distributed

bull IID means

bull The problem is considerably simplified as

bull Usually log likehood is used

Reference

bull Zoubin Ghahramani Machine Learning (4F13) 2006 Cambridge(Introduction to Machine Learning Lectures 1-2 Slides)

bull Gregor Heinrich Parameter estimation for text analysis Technical Note 2005-2008

EXPECTATION MAXIMIZATION

Why We Need EM

bull The ExpectationndashMaximization (EM) algorithm is a method for ML learning of parameters in latent variable models

bull Why we need latent variables

bull To describe complex model Gaussian Mixture Model

bull To discover the intrinsic structure inside a data set Topic Models such as pLSA LDA

More General

bull Data set bull Likelihood

bull Goal learn maximum likelihood (ML) parameter values

bull The maximum likelihood procedure finds parameters θ such that

bull Because of the integral (or sum) over latent variables the likelihood can be a very complicated and hard to optimize function

The Expectation Maximization (EM) Algorithm

bull The EM algorithm finds a (local) maximum of a latent variable model likelihood It starts from arbitrary values of the parameters and iterates two steps

bull E step Fill in values of latent variables according to posterior given data

bull M step Maximize likelihood as if latent variables were not hidden

bull Decomposes difficult problems into series of tractable steps

Jensenrsquos Inequality

Lower Bounding the Log Likelihoodbull Observed data D = yn Latent variables X = xn Parameters θbull Goal Maximize the log likelihood (ie ML learning) wrt θ

bull Any distribution q(X) over the hidden variables can be used to obtain a lower bound on the log likelihood using Jensenrsquos inequality

bull where H[q] is the entropy of q(X)

The E and M Steps of EM

bull The lower bound on the log likelihood is given by

bull EM alternates betweenbull E step optimize wrt distribution over hidden variables

holding params fixed

bull M step maximize wrt parameters holding hidden distribution fixed

The E Step

bull E step for fixed θ

bull The second term is the Kullback-Leibler divergencebull This means that for fixed θ F is bounded above by L and achieves

that bound whenbull So the E step simply sets

The M Step

bull M step maximize wrt parameters holding hidden distribution q fixed

bull The second equality comes from fact that entropy of q(X) does not depend directly on θ

bull The specific form of the M step depends on the model Often the maximum wrt θ can be found analytically

EM Never Decreases the Likelihood

bull The E and M steps together never decrease the log likelihood

bull The E step brings F(q θ) to the likelihood L(θ) 顶bull The M-step maximizes F(q θ) wrt θ 抬bull F(q θ) lt L(θ) by Jensen ndash or equivalently from the non-

negativity of KL

Reference

bull Zoubin Ghahramani Machine Learning (4F13) 2006 Cambridge(Unsupervised learning Lecture 5 Slides)

bull Christopher M Bishop (2006) Pattern Recognition and Machine Learning Springer

WHY DO WE NEED GRAPHICAL MODEL

Why Do We Need Graphical Models

bull Consndash Graphical model is so complex even with a few circleshellipndash We have to make too many assumptions

bull Prosndash We do need probability to explain our world But joint probability is

hard to computendash Graphical model can help us analyze and understand our problemsndash Graphs are an intuitive way of representing and visualizing the

relationships between many variablesndash With a graphical model we can decouple joint probability to

conditional probabilities which are usually easier

Directed Acyclic Graphical Models (Bayesian Networks)

bull A DAG Model Bayesian network corresponds to a factorization of the joint probability distribution

p(ABCDE) = p(A)p(B)p(C|AB)p(D|BC)p(E|CD)

bull In general

bull where pa(i) are the parents of node i

Directed Graphs for Statistical ModelsPlate Notation

bull A data set of N points generated from a Gaussian

PLSA ndash PROBABILISTIC LATENT SEMANTIC ANALYSIS

Latent Semantic Indexing (LSI) Review

bull For natural Language Queries simple term matching does not work effectivelyndash Ambiguous terms ndash Same Queries vary due to personal styles

bull Latent semantic indexingndash Creates this lsquolatent semantic spacersquo (hidden meaning)

bull LSI puts documents together even if they donrsquot have common words if the docs share frequently co-occurring terms

bull Disadvantagesndash Statistical foundation is missing

pLSA ndash Probabilistic Latent Semantic Analysis

bull Automated Document Indexing and Information retrievalbull Identification of Latent Classes using an Expectation

Maximization (EM) Algorithmbull Shown to solve

ndash Polysemybull Java could mean ldquocoffeerdquo and also the ldquoPL Javardquobull Cricket is a ldquogamerdquo and also an ldquoinsectrdquo

ndash Synonymybull ldquocomputerrdquo ldquopcrdquo ldquodesktoprdquo all could mean the same

bull Has a better statistical foundation than LSA

pLSA

M

Nd

d

z

w

M

d

z1

w1

z2

w2

z3

w3

zN

wN

hellip

z1 hellip zN are variables ziє[1K]K is the number of latent topics

pLSA

d1

z1

w1

z2

w2

zN1

wN1

hellip

d2

z1

w1

z2

w2

zN2

wN2

hellip

dM

z1

w1

z2

w2

zNm

wNm

hellip

p(w|z=1) p(w|z=2) p(w | z=Nm) are shared for all documents

Likelihood

Joint Probability vs Likelihood

bull Joint probability

bull Likelihood (only for observed variables)

bull p(d) is assumed to be uniform

Document Decomposition

bull Each document can be decomposed as

bull This is similar to the matrix decomposition if we consider each discrete distribution as a vector

p(w|d) = ZVtimesk p(z|d)

bull With many documents we hope to find latent topics as common basis

pLSA ndash Objective Function

bull pLSA tries to maximize the log likelihood

bull Due to the summation over z inside log we have to resort to EM

EM Steps

bull E-Stepndash Expectation of the likelihood function is calculated with the current

parameter valuesbull M-Step

ndash Update the parameters with the calculated posterior probabilitiesndash Find the parameters that maximizes the likelihood function

Lower Bounding the Log Likelihood

EM Steps

bull The E-Step

bull The M-Step

Latent Subspace

pLSA vs LSAbull LSA and PLSA perform dimensionality reduction

ndash In LSA by keeping only K singular valuesndash In PLSA by having K aspects

bull Comparison to SVDndash U Matrix related to P(z|d) (doc to aspect)ndash V Matrix related to P(w|z) (aspect to term)ndash E Matrix related to P(z) (aspect strength)

pLSA vs LSAbull The main difference is the way the approximation is done

bull PLSA generates a model (aspect model) and maximizes its predictive power

bull Selecting the proper value of K is heuristic in LSA

bull Model selection in statistics can determine optimal K in PLSA

Applications

bull Text mining topic discovering

bull Scene Classification

Text Mining

Scene Classification

Classification Result

Reference

bull Thomas Hofmann Probabilistic latent semantic analysis In Proc of Uncertainty in Artificial Intelligence UAI99 Stockholm 1999

bull Bosch A Zisserman A and Munoz X Scene Classification via pLSAProceedings of the European Conference on Computer Vision (2006)

bull Sivic J Russell B C Efros A A Zisserman A and Freeman W T Discovering Object Categories in Image Collections MIT AI Lab Memo AIM-2005-005 February 2005

LDA ndash LATENT DIRICHILET ALLOCATION

Problems in pLSA

bull pLSA provides no probabilistic model at the document level Each doc has its own topic mixture proportion

bull The number of parameters in the model grows linearly with M (the number of documents in the training set)

Problems in pLSA

bull There is no constraint for distributions p(z|di)

bull Easy to lead to serious problems with over-fitting

d1

z1

w1

z2

w2

zN1

wN1

hellip

d2

z1

w1

z2

w2

zN2

wN2

hellip

dm

z1

w1

z2

w2

zNm

wNm

hellip

p(z|d1) p(z|d2) p(z|dm)

Dirichlet Distribution

bull In the LDA model the topic mixture proportions for each document are assumed to follow some distribution

bull Requirement for such a distributionndash The samples (mixture proportions) generated from it are K-tuples of non-

negative numbers that sum to one That is the samples are multinormialsndash Easy to optimize

bull Dirichlet Distribution is one of such distributions

bull The space of all of these multinomials has a nice geometric interpretation as a (k-1)-simplex

Dirichlet Distribution

bull Definition

bull The density is zero outside this open (K minus 1)-dimensional simplex

k

i ii

ki ik

i i

ki i

kk

xx

xxxxp i

1

11

1

12121

1 0 st

)Γ()(Γ)|(

bull Various parameter α

(6 2 2) (3 7 5)

(2 3 4) (6 2 6)

Example Dirichlet Distributions (K=3)

Example Dirichlet Distributions (K=3)

bull Equal αi different

α0=01 α0=1 α0=10

k

i i10

The LDA Model

z4z3z2z1

w4w3w2w1

b

z4z3z2z1

w4w3w2w1

z4z3z2z1

w4w3w2w1

bull For each documentbull Choose ~Dirichlet()bull For each of the N words wn

ndash Choose a topic znraquo Multinomial()

ndash Choose a word wn from p(wn|znb) a multinomial probability conditioned on the topic zn

The LDA Model

bull For each documentbull Choose ~Dirichlet()bull For each of the N words wn

ndash Choose a topic znraquo Multinomial()

ndash Choose a word wn from p(wn|znb) a multinomial probability conditioned on the topic zn

Joint Probability

bull Given parameter α and β

where

Likelihood

bull Joint Probability

bull Marginal distribution of a document

bull Likelihood over all the documents

Inference

bull The likelihood can be computed by summing each documentbull Jansenrsquos inequality in EM

Inference

bull In E-Step we need to compute the posterior distribution of the hidden variables

bull Unfortunately this distribution is intractable to compute in general

bull We have to resort to variational approach

Variational Inference

bull In variational inference we consider a simplified graphical model with variational parameters and minimize the KL Divergence between the variational and posterior distributions

Variantional Inference

bull The difference between the lower bound and the likelihood is the KL divergence

bull Maximizing the lower bound L(b) with respect to and is equivalent to minimizing the KL divergence

VBEM vs EM

bull Only different in the E-Step

bull In standard EM q(X) is directly set to p(X|Dθ) and let KL=0bull In VBEM it is intractable to compute p(X|Dθ) Instead it

approximates p(X|Dθ) by a variational distribution q(X) by minimizing KL(q(X) | P(X|D θ)

bull This is also equivalent to maximizing the lower bound L(θ)

Parameter Estimation

bull Given a corpus of documents we would like to find the parameters and b which maximize the likelihood of the observed data

bull Strategy (Variational EM)

bull Lower bound log p(w|b) by a function L(b)bull Repeat until convergence

ndash E Maximize L(b) with respect to the variational parameters ndash M Maximize the bound with respect to parameters and b

Parameter Estimation

bull E-Step Variational Inference ndash repeat until convergence

bull M-Step Parameter estimation

β

α can be implemented using the Newton-Raphson method

Topic Examples in a 100-topic LDA Model)bull 16000 documents from a subset of the TREC AP corpus

Classification (50-topic LDA + SVM)

bull Reuters-21578 dataset ndash contains 8000 documents and 15818 words

(a) EARN vs NOT EARN (b) GRAIN vs NOT GRAIN

Problems in LDA

bull Dirichlet Distribution is helpful to avoid over-fitting But the assumption might be too strong

z4z3z2z1

w4w3w2w1

b

z4z3z2z1

w4w3w2w1

z4z3z2z1

w4w3w2w1

A Bayesian Hierarchical Model for Learning Natural Scene Categories

bull Incorporating category information

MNd

π

z

x

θ

β

Codebookbull 174 Local Image Patches

bull DetectionEvenly Sampled GridRandom SamplingSaliency DetectorLowersquos DoG Detector

bull RepresentationNormalized 11x11 gray values128-dim SIFT

Topic Distribution in Different Categories

Topic Hierarchical Clustering

More Topic Models

bull Dynamic topic models ICML 2006bull Correlated Topic Model NIPS 2005bull Hierarchical Dirichlet Process Journal of the American

Statistical Association 2003bull Nonparametric Bayes pachinko allocation UAI 2007bull Supervised LDA NIPS 2007bull MedLDA ndash Maximum Margin Discrimant LDA ICML 2009bull hellip

Are you really into Graphical Models

bull Describing Visual Scenes using Transformed Dirichlet Processes E Sudderth A Torralba W Freeman and A Willsky NIPS Dec 2005

Reference

bull David M Blei Andrew Y Ng Michael I Jordan Latent Dirichlet Allocation Journal of Machine Learning Research (JMLR) 2003

bull Matthew J Beal Variational Algorithms for Approximate Bayesian Inference PhD Thesis University of Cambridge 1998

bull L Fei-Fei and P Perona A Bayesian Hierarchical Model for Learning Natural Scene Categories CVPR 2005

Outline

bull Matrix Decompositionndash PCA SVD NMFndash LDA ICA Sparse Coding etc

bull Graphical Modelndash Basic concepts in probabilistic machine learningndash EMndash pLSAndash LDA

bull Two Applicationsndash Document decomposition for ldquolong queryrdquo retrieval ICCV 2009ndash Modeling Threaded Discussions SIGIR 2009

Large-Scale Indexing for ldquoLong Queryrdquo Retrieval (Similarity Search)

Xiao Zhang Zhiwei Li Lei Zhang Wei-Ying Ma and Heung-Yeung Shum

ICCV 2009

The Long Query Problem

bull If a query contains 1000 keywordsndash Need to access 1000 inverted listsndash The intersection of 1000 inverted lists may be emptyndash The union of 1000 inverted list may be the whole corpus

bull Dimension reductionTerm1 Term2 Term3 Term4 hellip TermN

Img1 1 2 0 0 hellip 2

f1 f2 hellip fM

Img1 02 01 hellip 003

Topic Projection

Dim = 1 million

Dim = 200

Key Idea Dimension Reduction + Residual Error Preservation

bull p original TF-IDF vector in vocabulary spacebull X projection matrix for dimension reductionbull w low dimensional feature vectorbull residual error

p Xw

An image = 0 01 02 03 04 05 06 07 08 09 1

0

2

4

6

8

10

12

14

16

+ a few words(10 words)

Orthogonal Decomposition

p Xw 1 11 1 1

2 21 2 1 2

1 1

1

k

k

W k W

W W Wk W

p x xp x x w

p wp x x

Base vector

Low dimensional representation

Residual

An image = 0 01 02 03 04 05 06 07 08 09 1

0

2

4

6

8

10

12

14

16

+ a few words(10 words)

T Tp q p q

p q

w w

X1 X2 X3 hellip Xk

A Probabilistic Implementation

x is a switch variable It controls a word generated from

bull a topic specific distribution

bull a document specific distribution

bull a background distribution

( | )p w d 1( 0 | ) ( | ) ( | )K

kp x d p w z k p z k d

( 1| ) ( | )p x d p w d ( 2 | ) ( | )p x d p w

C Chemudugunta etal Modeling General and Specific Aspects of Documents with a Probabilistic Topic Model NIPS 2006

Search (Online)

DS1 DS2 hellip

DS1 DS2 hellip

DS1 DS2 hellip

DS1 DS2 hellip

DS1 DS2 hellip

hellip

hellip

hellip

hellip

hellip

LSH Index

Doc 300 Doc 401 hellip

A query = 0 01 02 03 04 05 06 07 08 09 1

0

2

4

6

8

10

12

14

16

+ a few words

Re-rankingDoc 401 hellip

Doc 1

Doc 2

Doc 300

Doc 401

Doc N

Doc 300

Index 10M Images 46GBSearch Speed lt 100ms

Doc Meta

Search Example

Query Image

Search Example

Query Image

Simultaneously Modeling Semantics and Structure of Threaded Discussions A Sparse

Coding Approach and Its Applications

Chen LIN Jiang-Ming YANG Rui CAI Xin-jing WANG Wei WANG Lei ZHANG

SIGIR 2009

123

Semantic amp structure

124

SemanticTopics

StructureWho reply to who

Optimize them together

Model semantic

Model structure

125

Reply reconstruction

126

DocumentSimilarity

TopicSimilarity

StructureSimilarity

BaselinesNP

Reply to Nearest PostRR

Reply to RootDS

Document SimilarityLDA

Latent Dirichlet Allocation Project documents to topic space

SWB Special Words Topic Model with Background distribution Project documents to topic and junk topic space

127

Evaluation

method Slashdot Apple

All Posts Good Posts All Posts Good Posts

NP 0021 0012 0289 0239

RR 0183 0319 0269 0474

DS 0463 0643 0409 0628

LDA 0465 0644 0410 0648

SWB 0463 0644 0410 0641

SMSS 0524 0737 0517 0772

128

Expert finding

Reply reconstruction

Network construction

Expert finding

Methods

HITS

PageRank

hellip

129

Baselines

LMFormal Models for Expert Finding in Enterprise Corpora SIGIR

06Achieves stable performance in expert finding task using a

language modelPageRank

Benchmark nodal ranking methodHITS

Find hub nodes and authority nodeEABIF

Personalized Recommendation Driven by Information Flow SIGIR rsquo06

Find most influential node130

Evaluation

131

bull Bayesian estimate

Method MRR MAP P10

LM 0821 0698 0800

EABIF(ori) 0674 0362 0243

EABIF(rec) 0742 0318 0281

PageRank(ori) 0675 0377 0263

PageRank(rec) 0743 0321 0266

HITS(ori) 0906 0832 0900

HITS(rec) 0938 0822 0906

Summary

bull Matrix and probability are fundamental mathematics in information retrieval and computer visionndash Matrix decomposition ndash a good practice to learn matrixndash Graphical model ndash a good practice to learn probability

bull Graphical model is a good tool to analyze problems

bull The essence of decomposition is to discover a set of mid-level features to describe original documentsimages

bull It is more adaptable for various applications than matrix decomposition

  • An Introduction To Matrix Decomposition and Graphical Model
  • Outline
  • What Is Matrix Decomposition
  • Why We Need Matrix Decomposition
  • Why We Need Matrix Decomposition (2)
  • Principle Component Analysis
  • Definition ndash Eigenvalue amp Eigenvector
  • Definition ndash Principle Component Analysis
  • Principle Component Analysis (2)
  • Maximizing Variance
  • Optimization Problem
  • Property Data Decomposition
  • Face Recognition ndash Eigenface
  • Slide 14
  • Slide 15
  • PageRank ndash Power Iteration
  • Column-Stochastic amp Irreducible
  • Iterative PageRank Calculation
  • Convergence of the power iteration
  • Singular value decomposition
  • SVD - Definition
  • Singular Values And Singular Vectors
  • Matrix approximation
  • SVD and PCA
  • Example - LSI
  • SVD and PCA (2)
  • Latent Semantic Indexing (LSI)
  • Latent Semantic Indexing
  • Similarity Measures
  • Similarity Measures (2)
  • HITS (Hyperlink Induced Topic Search)
  • Power Iteration
  • HITS and SVD
  • HITS vs PageRank
  • NMF ndash Non-Negative Matrix Factorization
  • Definition
  • Motivation
  • Multiplicative Update Algorithm
  • Multiplicative Update Algorithm (2)
  • NMF vs PCA
  • Reference
  • Major Reference
  • Outline (2)
  • Not Included
  • Basic Concepts
  • What Is Machine Learning
  • Likelihood Function
  • Maximum Likelihood (ML)
  • IID ndash Independent Identically Distributed
  • Reference (2)
  • Expectation Maximization
  • Why We Need EM
  • More General
  • The Expectation Maximization (EM) Algorithm
  • Jensenrsquos Inequality
  • Lower Bounding the Log Likelihood
  • The E and M Steps of EM
  • The E Step
  • The M Step
  • EM Never Decreases the Likelihood
  • Reference (3)
  • Why Do We Need Graphical Model
  • Why Do We Need Graphical Models
  • Directed Acyclic Graphical Models (Bayesian Networks)
  • Directed Graphs for Statistical Models Plate Notation
  • pLSA ndash Probabilistic Latent Semantic Analysis
  • Latent Semantic Indexing (LSI) Review
  • pLSA ndash Probabilistic Latent Semantic Analysis (2)
  • pLSA
  • pLSA (2)
  • Joint Probability vs Likelihood
  • Document Decomposition
  • pLSA ndash Objective Function
  • EM Steps
  • Lower Bounding the Log Likelihood (2)
  • EM Steps (2)
  • Latent Subspace
  • pLSA vs LSA
  • pLSA vs LSA (2)
  • Applications
  • Text Mining
  • Scene Classification
  • Classification Result
  • Reference (4)
  • LDA ndash Latent Dirichilet Allocation
  • Problems in pLSA
  • Problems in pLSA (2)
  • Dirichlet Distribution
  • Dirichlet Distribution (2)
  • Example Dirichlet Distributions (K=3)
  • Example Dirichlet Distributions (K=3) (2)
  • The LDA Model
  • The LDA Model (2)
  • Joint Probability
  • Likelihood
  • Inference
  • Inference (2)
  • Variational Inference
  • Variantional Inference
  • VBEM vs EM
  • Parameter Estimation
  • Parameter Estimation (2)
  • Topic Examples in a 100-topic LDA Model)
  • Classification (50-topic LDA + SVM)
  • Problems in LDA
  • A Bayesian Hierarchical Model for Learning Natural Scene Catego
  • Codebook
  • Topic Distribution in Different Categories
  • Topic Hierarchical Clustering
  • More Topic Models
  • Are you really into Graphical Models
  • Reference (5)
  • Outline (3)
  • Slide 115
  • The Long Query Problem
  • Key Idea Dimension Reduction + Residual Error Preservation
  • Orthogonal Decomposition
  • A Probabilistic Implementation
  • Search (Online)
  • Search Example
  • Search Example (2)
  • Simultaneously Modeling Semantics and Structure of Threaded Dis
  • Semantic amp structure
  • Optimize them together
  • Reply reconstruction
  • Baselines
  • Evaluation
  • Expert finding
  • Baselines (2)
  • Evaluation (2)
  • Summary
Page 22: An Introduction To Matrix Decomposition and Graphical Model

SVD and PCAbull We can writebull Remember that in PCA we treat A as a row matrixbull V is just eigenvectors for A

ndash Each column in V is an eigenvector of row matrix Andash we use V to approximate a row in A

bull Equivalently we can writebull U is just eigenvectors for AT

ndash Each column in U is an eigenvector of column matrix Andash We use U to approximate a column in A

Example - LSI bull Build a term-by-document matrix A

bull Compute the SVD of A A = UΣVT

bull Approximate A by

ndash Uk Orthogonal basis that we use to approximate all the documents

ndash Dk Column j hold the coordinates of document j in the new basisndash Dk is the projection of A onto the subspace spanned by Uk

SVD and PCAbull For symmetric A SVD is closely related to PCA

bull PCA A = UΛUT

ndash U and Λ are eigenvectors and eigenvaluesbull SVD A = UΛVT

ndash U is left(column) eigenvectorsndash V is right(row) eigenvectorsndash Λ is the same eigenvalues

bull For symmetric A column eigenvectors equal to row eigenvectors

bull Note the difference of A in PCA and SVDndash SVD A is directly the data eg term-by-document matrixndash PCA A is covariance matrix A=XTX each row in X is a sample

Latent Semantic Indexing (LSI)1 Document file preparation preprocessing

ndash Indexing collecting termsndash Use stop list eliminate rdquomeaninglessrdquo wordsndash Stemming

2 Construction term-by-document matrix sparse matrix storage

3 Query matching distance measures4 Data compression by low rank approximation SVD5 Ranking and relevance feedback

Latent Semantic Indexingbull Assumption there is some underlying latent semantic

structure in the data

bull Eg car and automobile occur in similar documents as do cows and sheep

bull This structure can be enhanced by projecting the data (the term-by-document matrix and the queries) onto a lower dimensional space using SVD

Similarity Measures

bull Term to term AAT = UΣ2UT = (UΣ)(UΣ)T

UΣ are the coordinates of A (rows) projected to space V

bull Document to documentATA = VΣ2VT = (VΣ)(VΣ)T

VΣ are the coordinates of A (columns) projected to space U

Similarity Measures

bull Term to document A = UΣVT = (UΣfrac12)(VΣfrac12)T

UΣfrac12 are the coordinates of A (rows) projected to space VVΣfrac12 are the coordinates of A (columns) projected to space U

HITS (Hyperlink Induced Topic Search)

bull Idea Web includes two flavors of prominent pagesndash authorities contain high-quality informationndash hubs are comprehensive lists of links to authorities

bull A page is a good authority if many hubs point to itbull A page is a good hub if it points to many authoritiesbull Good authorities are pointed to by good hubs and good hubs

point to good authorities

Hubs Authorities

Power Iteration

bull Each page i has both a hub score hi and an authority score aibull HITS successively refines these scores by computing

bull Define the adjacency matrix L of the directed web graph

bull Now

HITS and SVD

bull L rows are outlinks columns are inlinks

bull a will be the dominant eigenvector of the authority matrix LTL bull h will be the dominant eigenvector of the hub matrix LLT

bull They are in fact the first left and right singular vectors of Lbull We are in fact running SVD on the adjacency matrix

HITS vs PageRankbull PageRank may be computed once HITS is computed per

query

bull HITS takes query into account PageRank doesnrsquot

bull PageRank has no concept of hubs

bull HITS is sensitive to local topology insertion or deletion of a small number of nodes may change the scores a lot

bull PageRank more stable because of its random jump step

NMF ndash NON-NEGATIVE MATRIX FACTORIZATION

Definition

bull Given a nonnegative matrix Vntimesm find non-negative matrix factors Wntimesk and Hktimesm such that

VntimesmasympWntimeskHktimesm

bull V column matrix each column is a data sample (n-dimension)bull Wi k-basis represents one basebull H coordinates of V projected to W

vj asymp Wntimeskhj

Motivationbull Non-negativity is natural in many applications

bull Probability is also non-negative

bull Additive model to capture local structure

Multiplicative Update Algorithmbull Cost function Euclidean distance

bull Multiplicative Update

Multiplicative Update Algorithmbull Cost function Divergence

ndash Reduce to Kullback-Leibler divergence whenndash A and B can be regarded as normalized probability distributions

bull Multiplicative update

bull PLSA is NMF with KL divergence

NMF vs PCA

bull n = 2429 faces m = 19x19 pixelsbull Positive values are illustrated with black pixels and negative values

with red pixels

bull NMF Parts-based representationbull PCA Holistic representations

Referencebull D D Lee and H S Seung Algorithms for non-negative matrix

factorization (pdf) NIPS 2001bull D D Lee and H S Seung Learning the parts of objects by non-negative

matrix factorization (pdf) Nature 401 788-791 (1999)

Major Referencebull Saara Hyvoumlnen Linear Algebra Methods for Data Mining Spring 2007

University of Helsinki (Highly recommend)

Outlinebull Basic concepts

ndash Likelihood iidndash ML MAP and Bayesian Inferencendash Expectation-Maximizationndash Mixture Gaussian Parameter estimation

bull pLSAndash Motivationndash Derivation amp Geometry propertiesndash Applications

bull LDAndash Motivation - Why to add a hyper parameterndash Dirichlet Distributionndash Variational EMndash Relations with other topic modalsndash Incorporating category information

bull Summary

Not Included

bull General graphical model theoriesbull Markov random field (belief propagation)bull Detailed derivation of LDA

BASIC CONCEPTS

What Is Machine Learning

Databull Let x = (x1 x2 xD) T denote a data point and D = x(1) x(2) x(N) a

data set D is sometimes associated with desired outputs y1 y2

Predictionsbull We are generally interested in predicting something based on the observed

data setbull Given D what can we say about x(N+1)

Modelbull To make predictions we need to make some assumptions We can often

express these assumptions in the form of a model with some parameters θbull Given data D we learn the model parameters from which we can predict

new data pointsbull The model can often be expressed as a probability distribution over data

points

Likelihood Function

bull Given a set of parameter values probability density function (PDF) will show that some data are more probable than other data

bull Inversely given the observed data and a model of interest Likelihood function is defined as

L(θ) = fθ(x|θ) = p(x|θ)

bull That is likelihood function L(θ) will show that some parameters are more likely to have produced the data

Maximum Likelihood (ML)

bull Maximum likelihood will find the best model parameters that make the data ldquomost likelyrdquo generated from this model

bull Suppose we are given n data samples (x1 x2 hellip xn)

bull Maximum likelihood will find θ that maximize L(θ)

bull Predictive distribution

IID ndash Independent Identically Distributed

bull IID means

bull The problem is considerably simplified as

bull Usually log likehood is used

Reference

bull Zoubin Ghahramani Machine Learning (4F13) 2006 Cambridge(Introduction to Machine Learning Lectures 1-2 Slides)

bull Gregor Heinrich Parameter estimation for text analysis Technical Note 2005-2008

EXPECTATION MAXIMIZATION

Why We Need EM

bull The ExpectationndashMaximization (EM) algorithm is a method for ML learning of parameters in latent variable models

bull Why we need latent variables

bull To describe complex model Gaussian Mixture Model

bull To discover the intrinsic structure inside a data set Topic Models such as pLSA LDA

More General

bull Data set bull Likelihood

bull Goal learn maximum likelihood (ML) parameter values

bull The maximum likelihood procedure finds parameters θ such that

bull Because of the integral (or sum) over latent variables the likelihood can be a very complicated and hard to optimize function

The Expectation Maximization (EM) Algorithm

bull The EM algorithm finds a (local) maximum of a latent variable model likelihood It starts from arbitrary values of the parameters and iterates two steps

bull E step Fill in values of latent variables according to posterior given data

bull M step Maximize likelihood as if latent variables were not hidden

bull Decomposes difficult problems into series of tractable steps

Jensenrsquos Inequality

Lower Bounding the Log Likelihoodbull Observed data D = yn Latent variables X = xn Parameters θbull Goal Maximize the log likelihood (ie ML learning) wrt θ

bull Any distribution q(X) over the hidden variables can be used to obtain a lower bound on the log likelihood using Jensenrsquos inequality

bull where H[q] is the entropy of q(X)

The E and M Steps of EM

bull The lower bound on the log likelihood is given by

bull EM alternates betweenbull E step optimize wrt distribution over hidden variables

holding params fixed

bull M step maximize wrt parameters holding hidden distribution fixed

The E Step

bull E step for fixed θ

bull The second term is the Kullback-Leibler divergencebull This means that for fixed θ F is bounded above by L and achieves

that bound whenbull So the E step simply sets

The M Step

bull M step maximize wrt parameters holding hidden distribution q fixed

bull The second equality comes from fact that entropy of q(X) does not depend directly on θ

bull The specific form of the M step depends on the model Often the maximum wrt θ can be found analytically

EM Never Decreases the Likelihood

bull The E and M steps together never decrease the log likelihood

bull The E step brings F(q θ) to the likelihood L(θ) 顶bull The M-step maximizes F(q θ) wrt θ 抬bull F(q θ) lt L(θ) by Jensen ndash or equivalently from the non-

negativity of KL

Reference

bull Zoubin Ghahramani Machine Learning (4F13) 2006 Cambridge(Unsupervised learning Lecture 5 Slides)

bull Christopher M Bishop (2006) Pattern Recognition and Machine Learning Springer

WHY DO WE NEED GRAPHICAL MODEL

Why Do We Need Graphical Models

bull Consndash Graphical model is so complex even with a few circleshellipndash We have to make too many assumptions

bull Prosndash We do need probability to explain our world But joint probability is

hard to computendash Graphical model can help us analyze and understand our problemsndash Graphs are an intuitive way of representing and visualizing the

relationships between many variablesndash With a graphical model we can decouple joint probability to

conditional probabilities which are usually easier

Directed Acyclic Graphical Models (Bayesian Networks)

bull A DAG Model Bayesian network corresponds to a factorization of the joint probability distribution

p(ABCDE) = p(A)p(B)p(C|AB)p(D|BC)p(E|CD)

bull In general

bull where pa(i) are the parents of node i

Directed Graphs for Statistical ModelsPlate Notation

bull A data set of N points generated from a Gaussian

PLSA ndash PROBABILISTIC LATENT SEMANTIC ANALYSIS

Latent Semantic Indexing (LSI) Review

bull For natural Language Queries simple term matching does not work effectivelyndash Ambiguous terms ndash Same Queries vary due to personal styles

bull Latent semantic indexingndash Creates this lsquolatent semantic spacersquo (hidden meaning)

bull LSI puts documents together even if they donrsquot have common words if the docs share frequently co-occurring terms

bull Disadvantagesndash Statistical foundation is missing

pLSA ndash Probabilistic Latent Semantic Analysis

bull Automated Document Indexing and Information retrievalbull Identification of Latent Classes using an Expectation

Maximization (EM) Algorithmbull Shown to solve

ndash Polysemybull Java could mean ldquocoffeerdquo and also the ldquoPL Javardquobull Cricket is a ldquogamerdquo and also an ldquoinsectrdquo

ndash Synonymybull ldquocomputerrdquo ldquopcrdquo ldquodesktoprdquo all could mean the same

bull Has a better statistical foundation than LSA

pLSA

M

Nd

d

z

w

M

d

z1

w1

z2

w2

z3

w3

zN

wN

hellip

z1 hellip zN are variables ziє[1K]K is the number of latent topics

pLSA

d1

z1

w1

z2

w2

zN1

wN1

hellip

d2

z1

w1

z2

w2

zN2

wN2

hellip

dM

z1

w1

z2

w2

zNm

wNm

hellip

p(w|z=1) p(w|z=2) p(w | z=Nm) are shared for all documents

Likelihood

Joint Probability vs Likelihood

bull Joint probability

bull Likelihood (only for observed variables)

bull p(d) is assumed to be uniform

Document Decomposition

bull Each document can be decomposed as

bull This is similar to the matrix decomposition if we consider each discrete distribution as a vector

p(w|d) = ZVtimesk p(z|d)

bull With many documents we hope to find latent topics as common basis

pLSA ndash Objective Function

bull pLSA tries to maximize the log likelihood

bull Due to the summation over z inside log we have to resort to EM

EM Steps

bull E-Stepndash Expectation of the likelihood function is calculated with the current

parameter valuesbull M-Step

ndash Update the parameters with the calculated posterior probabilitiesndash Find the parameters that maximizes the likelihood function

Lower Bounding the Log Likelihood

EM Steps

bull The E-Step

bull The M-Step

Latent Subspace

pLSA vs LSAbull LSA and PLSA perform dimensionality reduction

ndash In LSA by keeping only K singular valuesndash In PLSA by having K aspects

bull Comparison to SVDndash U Matrix related to P(z|d) (doc to aspect)ndash V Matrix related to P(w|z) (aspect to term)ndash E Matrix related to P(z) (aspect strength)

pLSA vs LSAbull The main difference is the way the approximation is done

bull PLSA generates a model (aspect model) and maximizes its predictive power

bull Selecting the proper value of K is heuristic in LSA

bull Model selection in statistics can determine optimal K in PLSA

Applications

bull Text mining topic discovering

bull Scene Classification

Text Mining

Scene Classification

Classification Result

Reference

bull Thomas Hofmann Probabilistic latent semantic analysis In Proc of Uncertainty in Artificial Intelligence UAI99 Stockholm 1999

bull Bosch A Zisserman A and Munoz X Scene Classification via pLSAProceedings of the European Conference on Computer Vision (2006)

bull Sivic J Russell B C Efros A A Zisserman A and Freeman W T Discovering Object Categories in Image Collections MIT AI Lab Memo AIM-2005-005 February 2005

LDA ndash LATENT DIRICHILET ALLOCATION

Problems in pLSA

bull pLSA provides no probabilistic model at the document level Each doc has its own topic mixture proportion

bull The number of parameters in the model grows linearly with M (the number of documents in the training set)

Problems in pLSA

bull There is no constraint for distributions p(z|di)

bull Easy to lead to serious problems with over-fitting

d1

z1

w1

z2

w2

zN1

wN1

hellip

d2

z1

w1

z2

w2

zN2

wN2

hellip

dm

z1

w1

z2

w2

zNm

wNm

hellip

p(z|d1) p(z|d2) p(z|dm)

Dirichlet Distribution

bull In the LDA model the topic mixture proportions for each document are assumed to follow some distribution

bull Requirement for such a distributionndash The samples (mixture proportions) generated from it are K-tuples of non-

negative numbers that sum to one That is the samples are multinormialsndash Easy to optimize

bull Dirichlet Distribution is one of such distributions

bull The space of all of these multinomials has a nice geometric interpretation as a (k-1)-simplex

Dirichlet Distribution

bull Definition

bull The density is zero outside this open (K minus 1)-dimensional simplex

k

i ii

ki ik

i i

ki i

kk

xx

xxxxp i

1

11

1

12121

1 0 st

)Γ()(Γ)|(

bull Various parameter α

(6 2 2) (3 7 5)

(2 3 4) (6 2 6)

Example Dirichlet Distributions (K=3)

Example Dirichlet Distributions (K=3)

bull Equal αi different

α0=01 α0=1 α0=10

k

i i10

The LDA Model

z4z3z2z1

w4w3w2w1

b

z4z3z2z1

w4w3w2w1

z4z3z2z1

w4w3w2w1

bull For each documentbull Choose ~Dirichlet()bull For each of the N words wn

ndash Choose a topic znraquo Multinomial()

ndash Choose a word wn from p(wn|znb) a multinomial probability conditioned on the topic zn

The LDA Model

bull For each documentbull Choose ~Dirichlet()bull For each of the N words wn

ndash Choose a topic znraquo Multinomial()

ndash Choose a word wn from p(wn|znb) a multinomial probability conditioned on the topic zn

Joint Probability

bull Given parameter α and β

where

Likelihood

bull Joint Probability

bull Marginal distribution of a document

bull Likelihood over all the documents

Inference

bull The likelihood can be computed by summing each documentbull Jansenrsquos inequality in EM

Inference

bull In E-Step we need to compute the posterior distribution of the hidden variables

bull Unfortunately this distribution is intractable to compute in general

bull We have to resort to variational approach

Variational Inference

bull In variational inference we consider a simplified graphical model with variational parameters and minimize the KL Divergence between the variational and posterior distributions

Variantional Inference

bull The difference between the lower bound and the likelihood is the KL divergence

bull Maximizing the lower bound L(b) with respect to and is equivalent to minimizing the KL divergence

VBEM vs EM

bull Only different in the E-Step

bull In standard EM q(X) is directly set to p(X|Dθ) and let KL=0bull In VBEM it is intractable to compute p(X|Dθ) Instead it

approximates p(X|Dθ) by a variational distribution q(X) by minimizing KL(q(X) | P(X|D θ)

bull This is also equivalent to maximizing the lower bound L(θ)

Parameter Estimation

bull Given a corpus of documents we would like to find the parameters and b which maximize the likelihood of the observed data

bull Strategy (Variational EM)

bull Lower bound log p(w|b) by a function L(b)bull Repeat until convergence

ndash E Maximize L(b) with respect to the variational parameters ndash M Maximize the bound with respect to parameters and b

Parameter Estimation

bull E-Step Variational Inference ndash repeat until convergence

bull M-Step Parameter estimation

β

α can be implemented using the Newton-Raphson method

Topic Examples in a 100-topic LDA Model)bull 16000 documents from a subset of the TREC AP corpus

Classification (50-topic LDA + SVM)

bull Reuters-21578 dataset ndash contains 8000 documents and 15818 words

(a) EARN vs NOT EARN (b) GRAIN vs NOT GRAIN

Problems in LDA

bull Dirichlet Distribution is helpful to avoid over-fitting But the assumption might be too strong

z4z3z2z1

w4w3w2w1

b

z4z3z2z1

w4w3w2w1

z4z3z2z1

w4w3w2w1

A Bayesian Hierarchical Model for Learning Natural Scene Categories

bull Incorporating category information

MNd

π

z

x

θ

β

Codebookbull 174 Local Image Patches

bull DetectionEvenly Sampled GridRandom SamplingSaliency DetectorLowersquos DoG Detector

bull RepresentationNormalized 11x11 gray values128-dim SIFT

Topic Distribution in Different Categories

Topic Hierarchical Clustering

More Topic Models

bull Dynamic topic models ICML 2006bull Correlated Topic Model NIPS 2005bull Hierarchical Dirichlet Process Journal of the American

Statistical Association 2003bull Nonparametric Bayes pachinko allocation UAI 2007bull Supervised LDA NIPS 2007bull MedLDA ndash Maximum Margin Discrimant LDA ICML 2009bull hellip

Are you really into Graphical Models

bull Describing Visual Scenes using Transformed Dirichlet Processes E Sudderth A Torralba W Freeman and A Willsky NIPS Dec 2005

Reference

bull David M Blei Andrew Y Ng Michael I Jordan Latent Dirichlet Allocation Journal of Machine Learning Research (JMLR) 2003

bull Matthew J Beal Variational Algorithms for Approximate Bayesian Inference PhD Thesis University of Cambridge 1998

bull L Fei-Fei and P Perona A Bayesian Hierarchical Model for Learning Natural Scene Categories CVPR 2005

Outline

bull Matrix Decompositionndash PCA SVD NMFndash LDA ICA Sparse Coding etc

bull Graphical Modelndash Basic concepts in probabilistic machine learningndash EMndash pLSAndash LDA

bull Two Applicationsndash Document decomposition for ldquolong queryrdquo retrieval ICCV 2009ndash Modeling Threaded Discussions SIGIR 2009

Large-Scale Indexing for ldquoLong Queryrdquo Retrieval (Similarity Search)

Xiao Zhang Zhiwei Li Lei Zhang Wei-Ying Ma and Heung-Yeung Shum

ICCV 2009

The Long Query Problem

bull If a query contains 1000 keywordsndash Need to access 1000 inverted listsndash The intersection of 1000 inverted lists may be emptyndash The union of 1000 inverted list may be the whole corpus

bull Dimension reductionTerm1 Term2 Term3 Term4 hellip TermN

Img1 1 2 0 0 hellip 2

f1 f2 hellip fM

Img1 02 01 hellip 003

Topic Projection

Dim = 1 million

Dim = 200

Key Idea Dimension Reduction + Residual Error Preservation

bull p original TF-IDF vector in vocabulary spacebull X projection matrix for dimension reductionbull w low dimensional feature vectorbull residual error

p Xw

An image = 0 01 02 03 04 05 06 07 08 09 1

0

2

4

6

8

10

12

14

16

+ a few words(10 words)

Orthogonal Decomposition

p Xw 1 11 1 1

2 21 2 1 2

1 1

1

k

k

W k W

W W Wk W

p x xp x x w

p wp x x

Base vector

Low dimensional representation

Residual

An image = 0 01 02 03 04 05 06 07 08 09 1

0

2

4

6

8

10

12

14

16

+ a few words(10 words)

T Tp q p q

p q

w w

X1 X2 X3 hellip Xk

A Probabilistic Implementation

x is a switch variable It controls a word generated from

bull a topic specific distribution

bull a document specific distribution

bull a background distribution

( | )p w d 1( 0 | ) ( | ) ( | )K

kp x d p w z k p z k d

( 1| ) ( | )p x d p w d ( 2 | ) ( | )p x d p w

C Chemudugunta etal Modeling General and Specific Aspects of Documents with a Probabilistic Topic Model NIPS 2006

Search (Online)

DS1 DS2 hellip

DS1 DS2 hellip

DS1 DS2 hellip

DS1 DS2 hellip

DS1 DS2 hellip

hellip

hellip

hellip

hellip

hellip

LSH Index

Doc 300 Doc 401 hellip

A query = 0 01 02 03 04 05 06 07 08 09 1

0

2

4

6

8

10

12

14

16

+ a few words

Re-rankingDoc 401 hellip

Doc 1

Doc 2

Doc 300

Doc 401

Doc N

Doc 300

Index 10M Images 46GBSearch Speed lt 100ms

Doc Meta

Search Example

Query Image

Search Example

Query Image

Simultaneously Modeling Semantics and Structure of Threaded Discussions A Sparse

Coding Approach and Its Applications

Chen LIN Jiang-Ming YANG Rui CAI Xin-jing WANG Wei WANG Lei ZHANG

SIGIR 2009

123

Semantic amp structure

124

SemanticTopics

StructureWho reply to who

Optimize them together

Model semantic

Model structure

125

Reply reconstruction

126

DocumentSimilarity

TopicSimilarity

StructureSimilarity

BaselinesNP

Reply to Nearest PostRR

Reply to RootDS

Document SimilarityLDA

Latent Dirichlet Allocation Project documents to topic space

SWB Special Words Topic Model with Background distribution Project documents to topic and junk topic space

127

Evaluation

method Slashdot Apple

All Posts Good Posts All Posts Good Posts

NP 0021 0012 0289 0239

RR 0183 0319 0269 0474

DS 0463 0643 0409 0628

LDA 0465 0644 0410 0648

SWB 0463 0644 0410 0641

SMSS 0524 0737 0517 0772

128

Expert finding

Reply reconstruction

Network construction

Expert finding

Methods

HITS

PageRank

hellip

129

Baselines

LMFormal Models for Expert Finding in Enterprise Corpora SIGIR

06Achieves stable performance in expert finding task using a

language modelPageRank

Benchmark nodal ranking methodHITS

Find hub nodes and authority nodeEABIF

Personalized Recommendation Driven by Information Flow SIGIR rsquo06

Find most influential node130

Evaluation

131

bull Bayesian estimate

Method MRR MAP P10

LM 0821 0698 0800

EABIF(ori) 0674 0362 0243

EABIF(rec) 0742 0318 0281

PageRank(ori) 0675 0377 0263

PageRank(rec) 0743 0321 0266

HITS(ori) 0906 0832 0900

HITS(rec) 0938 0822 0906

Summary

bull Matrix and probability are fundamental mathematics in information retrieval and computer visionndash Matrix decomposition ndash a good practice to learn matrixndash Graphical model ndash a good practice to learn probability

bull Graphical model is a good tool to analyze problems

bull The essence of decomposition is to discover a set of mid-level features to describe original documentsimages

bull It is more adaptable for various applications than matrix decomposition

  • An Introduction To Matrix Decomposition and Graphical Model
  • Outline
  • What Is Matrix Decomposition
  • Why We Need Matrix Decomposition
  • Why We Need Matrix Decomposition (2)
  • Principle Component Analysis
  • Definition ndash Eigenvalue amp Eigenvector
  • Definition ndash Principle Component Analysis
  • Principle Component Analysis (2)
  • Maximizing Variance
  • Optimization Problem
  • Property Data Decomposition
  • Face Recognition ndash Eigenface
  • Slide 14
  • Slide 15
  • PageRank ndash Power Iteration
  • Column-Stochastic amp Irreducible
  • Iterative PageRank Calculation
  • Convergence of the power iteration
  • Singular value decomposition
  • SVD - Definition
  • Singular Values And Singular Vectors
  • Matrix approximation
  • SVD and PCA
  • Example - LSI
  • SVD and PCA (2)
  • Latent Semantic Indexing (LSI)
  • Latent Semantic Indexing
  • Similarity Measures
  • Similarity Measures (2)
  • HITS (Hyperlink Induced Topic Search)
  • Power Iteration
  • HITS and SVD
  • HITS vs PageRank
  • NMF ndash Non-Negative Matrix Factorization
  • Definition
  • Motivation
  • Multiplicative Update Algorithm
  • Multiplicative Update Algorithm (2)
  • NMF vs PCA
  • Reference
  • Major Reference
  • Outline (2)
  • Not Included
  • Basic Concepts
  • What Is Machine Learning
  • Likelihood Function
  • Maximum Likelihood (ML)
  • IID ndash Independent Identically Distributed
  • Reference (2)
  • Expectation Maximization
  • Why We Need EM
  • More General
  • The Expectation Maximization (EM) Algorithm
  • Jensenrsquos Inequality
  • Lower Bounding the Log Likelihood
  • The E and M Steps of EM
  • The E Step
  • The M Step
  • EM Never Decreases the Likelihood
  • Reference (3)
  • Why Do We Need Graphical Model
  • Why Do We Need Graphical Models
  • Directed Acyclic Graphical Models (Bayesian Networks)
  • Directed Graphs for Statistical Models Plate Notation
  • pLSA ndash Probabilistic Latent Semantic Analysis
  • Latent Semantic Indexing (LSI) Review
  • pLSA ndash Probabilistic Latent Semantic Analysis (2)
  • pLSA
  • pLSA (2)
  • Joint Probability vs Likelihood
  • Document Decomposition
  • pLSA ndash Objective Function
  • EM Steps
  • Lower Bounding the Log Likelihood (2)
  • EM Steps (2)
  • Latent Subspace
  • pLSA vs LSA
  • pLSA vs LSA (2)
  • Applications
  • Text Mining
  • Scene Classification
  • Classification Result
  • Reference (4)
  • LDA ndash Latent Dirichilet Allocation
  • Problems in pLSA
  • Problems in pLSA (2)
  • Dirichlet Distribution
  • Dirichlet Distribution (2)
  • Example Dirichlet Distributions (K=3)
  • Example Dirichlet Distributions (K=3) (2)
  • The LDA Model
  • The LDA Model (2)
  • Joint Probability
  • Likelihood
  • Inference
  • Inference (2)
  • Variational Inference
  • Variantional Inference
  • VBEM vs EM
  • Parameter Estimation
  • Parameter Estimation (2)
  • Topic Examples in a 100-topic LDA Model)
  • Classification (50-topic LDA + SVM)
  • Problems in LDA
  • A Bayesian Hierarchical Model for Learning Natural Scene Catego
  • Codebook
  • Topic Distribution in Different Categories
  • Topic Hierarchical Clustering
  • More Topic Models
  • Are you really into Graphical Models
  • Reference (5)
  • Outline (3)
  • Slide 115
  • The Long Query Problem
  • Key Idea Dimension Reduction + Residual Error Preservation
  • Orthogonal Decomposition
  • A Probabilistic Implementation
  • Search (Online)
  • Search Example
  • Search Example (2)
  • Simultaneously Modeling Semantics and Structure of Threaded Dis
  • Semantic amp structure
  • Optimize them together
  • Reply reconstruction
  • Baselines
  • Evaluation
  • Expert finding
  • Baselines (2)
  • Evaluation (2)
  • Summary
Page 23: An Introduction To Matrix Decomposition and Graphical Model

Example - LSI bull Build a term-by-document matrix A

bull Compute the SVD of A A = UΣVT

bull Approximate A by

ndash Uk Orthogonal basis that we use to approximate all the documents

ndash Dk Column j hold the coordinates of document j in the new basisndash Dk is the projection of A onto the subspace spanned by Uk

SVD and PCAbull For symmetric A SVD is closely related to PCA

bull PCA A = UΛUT

ndash U and Λ are eigenvectors and eigenvaluesbull SVD A = UΛVT

ndash U is left(column) eigenvectorsndash V is right(row) eigenvectorsndash Λ is the same eigenvalues

bull For symmetric A column eigenvectors equal to row eigenvectors

bull Note the difference of A in PCA and SVDndash SVD A is directly the data eg term-by-document matrixndash PCA A is covariance matrix A=XTX each row in X is a sample

Latent Semantic Indexing (LSI)1 Document file preparation preprocessing

ndash Indexing collecting termsndash Use stop list eliminate rdquomeaninglessrdquo wordsndash Stemming

2 Construction term-by-document matrix sparse matrix storage

3 Query matching distance measures4 Data compression by low rank approximation SVD5 Ranking and relevance feedback

Latent Semantic Indexingbull Assumption there is some underlying latent semantic

structure in the data

bull Eg car and automobile occur in similar documents as do cows and sheep

bull This structure can be enhanced by projecting the data (the term-by-document matrix and the queries) onto a lower dimensional space using SVD

Similarity Measures

bull Term to term AAT = UΣ2UT = (UΣ)(UΣ)T

UΣ are the coordinates of A (rows) projected to space V

bull Document to documentATA = VΣ2VT = (VΣ)(VΣ)T

VΣ are the coordinates of A (columns) projected to space U

Similarity Measures

bull Term to document A = UΣVT = (UΣfrac12)(VΣfrac12)T

UΣfrac12 are the coordinates of A (rows) projected to space VVΣfrac12 are the coordinates of A (columns) projected to space U

HITS (Hyperlink Induced Topic Search)

bull Idea Web includes two flavors of prominent pagesndash authorities contain high-quality informationndash hubs are comprehensive lists of links to authorities

bull A page is a good authority if many hubs point to itbull A page is a good hub if it points to many authoritiesbull Good authorities are pointed to by good hubs and good hubs

point to good authorities

Hubs Authorities

Power Iteration

bull Each page i has both a hub score hi and an authority score aibull HITS successively refines these scores by computing

bull Define the adjacency matrix L of the directed web graph

bull Now

HITS and SVD

bull L rows are outlinks columns are inlinks

bull a will be the dominant eigenvector of the authority matrix LTL bull h will be the dominant eigenvector of the hub matrix LLT

bull They are in fact the first left and right singular vectors of Lbull We are in fact running SVD on the adjacency matrix

HITS vs PageRankbull PageRank may be computed once HITS is computed per

query

bull HITS takes query into account PageRank doesnrsquot

bull PageRank has no concept of hubs

bull HITS is sensitive to local topology insertion or deletion of a small number of nodes may change the scores a lot

bull PageRank more stable because of its random jump step

NMF ndash NON-NEGATIVE MATRIX FACTORIZATION

Definition

bull Given a nonnegative matrix Vntimesm find non-negative matrix factors Wntimesk and Hktimesm such that

VntimesmasympWntimeskHktimesm

bull V column matrix each column is a data sample (n-dimension)bull Wi k-basis represents one basebull H coordinates of V projected to W

vj asymp Wntimeskhj

Motivationbull Non-negativity is natural in many applications

bull Probability is also non-negative

bull Additive model to capture local structure

Multiplicative Update Algorithmbull Cost function Euclidean distance

bull Multiplicative Update

Multiplicative Update Algorithmbull Cost function Divergence

ndash Reduce to Kullback-Leibler divergence whenndash A and B can be regarded as normalized probability distributions

bull Multiplicative update

bull PLSA is NMF with KL divergence

NMF vs PCA

bull n = 2429 faces m = 19x19 pixelsbull Positive values are illustrated with black pixels and negative values

with red pixels

bull NMF Parts-based representationbull PCA Holistic representations

Referencebull D D Lee and H S Seung Algorithms for non-negative matrix

factorization (pdf) NIPS 2001bull D D Lee and H S Seung Learning the parts of objects by non-negative

matrix factorization (pdf) Nature 401 788-791 (1999)

Major Referencebull Saara Hyvoumlnen Linear Algebra Methods for Data Mining Spring 2007

University of Helsinki (Highly recommend)

Outlinebull Basic concepts

ndash Likelihood iidndash ML MAP and Bayesian Inferencendash Expectation-Maximizationndash Mixture Gaussian Parameter estimation

bull pLSAndash Motivationndash Derivation amp Geometry propertiesndash Applications

bull LDAndash Motivation - Why to add a hyper parameterndash Dirichlet Distributionndash Variational EMndash Relations with other topic modalsndash Incorporating category information

bull Summary

Not Included

bull General graphical model theoriesbull Markov random field (belief propagation)bull Detailed derivation of LDA

BASIC CONCEPTS

What Is Machine Learning

Databull Let x = (x1 x2 xD) T denote a data point and D = x(1) x(2) x(N) a

data set D is sometimes associated with desired outputs y1 y2

Predictionsbull We are generally interested in predicting something based on the observed

data setbull Given D what can we say about x(N+1)

Modelbull To make predictions we need to make some assumptions We can often

express these assumptions in the form of a model with some parameters θbull Given data D we learn the model parameters from which we can predict

new data pointsbull The model can often be expressed as a probability distribution over data

points

Likelihood Function

bull Given a set of parameter values probability density function (PDF) will show that some data are more probable than other data

bull Inversely given the observed data and a model of interest Likelihood function is defined as

L(θ) = fθ(x|θ) = p(x|θ)

bull That is likelihood function L(θ) will show that some parameters are more likely to have produced the data

Maximum Likelihood (ML)

bull Maximum likelihood will find the best model parameters that make the data ldquomost likelyrdquo generated from this model

bull Suppose we are given n data samples (x1 x2 hellip xn)

bull Maximum likelihood will find θ that maximize L(θ)

bull Predictive distribution

IID ndash Independent Identically Distributed

bull IID means

bull The problem is considerably simplified as

bull Usually log likehood is used

Reference

bull Zoubin Ghahramani Machine Learning (4F13) 2006 Cambridge(Introduction to Machine Learning Lectures 1-2 Slides)

bull Gregor Heinrich Parameter estimation for text analysis Technical Note 2005-2008

EXPECTATION MAXIMIZATION

Why We Need EM

bull The ExpectationndashMaximization (EM) algorithm is a method for ML learning of parameters in latent variable models

bull Why we need latent variables

bull To describe complex model Gaussian Mixture Model

bull To discover the intrinsic structure inside a data set Topic Models such as pLSA LDA

More General

bull Data set bull Likelihood

bull Goal learn maximum likelihood (ML) parameter values

bull The maximum likelihood procedure finds parameters θ such that

bull Because of the integral (or sum) over latent variables the likelihood can be a very complicated and hard to optimize function

The Expectation Maximization (EM) Algorithm

bull The EM algorithm finds a (local) maximum of a latent variable model likelihood It starts from arbitrary values of the parameters and iterates two steps

bull E step Fill in values of latent variables according to posterior given data

bull M step Maximize likelihood as if latent variables were not hidden

bull Decomposes difficult problems into series of tractable steps

Jensenrsquos Inequality

Lower Bounding the Log Likelihoodbull Observed data D = yn Latent variables X = xn Parameters θbull Goal Maximize the log likelihood (ie ML learning) wrt θ

bull Any distribution q(X) over the hidden variables can be used to obtain a lower bound on the log likelihood using Jensenrsquos inequality

bull where H[q] is the entropy of q(X)

The E and M Steps of EM

bull The lower bound on the log likelihood is given by

bull EM alternates betweenbull E step optimize wrt distribution over hidden variables

holding params fixed

bull M step maximize wrt parameters holding hidden distribution fixed

The E Step

bull E step for fixed θ

bull The second term is the Kullback-Leibler divergencebull This means that for fixed θ F is bounded above by L and achieves

that bound whenbull So the E step simply sets

The M Step

bull M step maximize wrt parameters holding hidden distribution q fixed

bull The second equality comes from fact that entropy of q(X) does not depend directly on θ

bull The specific form of the M step depends on the model Often the maximum wrt θ can be found analytically

EM Never Decreases the Likelihood

bull The E and M steps together never decrease the log likelihood

bull The E step brings F(q θ) to the likelihood L(θ) 顶bull The M-step maximizes F(q θ) wrt θ 抬bull F(q θ) lt L(θ) by Jensen ndash or equivalently from the non-

negativity of KL

Reference

bull Zoubin Ghahramani Machine Learning (4F13) 2006 Cambridge(Unsupervised learning Lecture 5 Slides)

bull Christopher M Bishop (2006) Pattern Recognition and Machine Learning Springer

WHY DO WE NEED GRAPHICAL MODEL

Why Do We Need Graphical Models

bull Consndash Graphical model is so complex even with a few circleshellipndash We have to make too many assumptions

bull Prosndash We do need probability to explain our world But joint probability is

hard to computendash Graphical model can help us analyze and understand our problemsndash Graphs are an intuitive way of representing and visualizing the

relationships between many variablesndash With a graphical model we can decouple joint probability to

conditional probabilities which are usually easier

Directed Acyclic Graphical Models (Bayesian Networks)

bull A DAG Model Bayesian network corresponds to a factorization of the joint probability distribution

p(ABCDE) = p(A)p(B)p(C|AB)p(D|BC)p(E|CD)

bull In general

bull where pa(i) are the parents of node i

Directed Graphs for Statistical ModelsPlate Notation

bull A data set of N points generated from a Gaussian

PLSA ndash PROBABILISTIC LATENT SEMANTIC ANALYSIS

Latent Semantic Indexing (LSI) Review

bull For natural Language Queries simple term matching does not work effectivelyndash Ambiguous terms ndash Same Queries vary due to personal styles

bull Latent semantic indexingndash Creates this lsquolatent semantic spacersquo (hidden meaning)

bull LSI puts documents together even if they donrsquot have common words if the docs share frequently co-occurring terms

bull Disadvantagesndash Statistical foundation is missing

pLSA ndash Probabilistic Latent Semantic Analysis

bull Automated Document Indexing and Information retrievalbull Identification of Latent Classes using an Expectation

Maximization (EM) Algorithmbull Shown to solve

ndash Polysemybull Java could mean ldquocoffeerdquo and also the ldquoPL Javardquobull Cricket is a ldquogamerdquo and also an ldquoinsectrdquo

ndash Synonymybull ldquocomputerrdquo ldquopcrdquo ldquodesktoprdquo all could mean the same

bull Has a better statistical foundation than LSA

pLSA

M

Nd

d

z

w

M

d

z1

w1

z2

w2

z3

w3

zN

wN

hellip

z1 hellip zN are variables ziє[1K]K is the number of latent topics

pLSA

d1

z1

w1

z2

w2

zN1

wN1

hellip

d2

z1

w1

z2

w2

zN2

wN2

hellip

dM

z1

w1

z2

w2

zNm

wNm

hellip

p(w|z=1) p(w|z=2) p(w | z=Nm) are shared for all documents

Likelihood

Joint Probability vs Likelihood

bull Joint probability

bull Likelihood (only for observed variables)

bull p(d) is assumed to be uniform

Document Decomposition

bull Each document can be decomposed as

bull This is similar to the matrix decomposition if we consider each discrete distribution as a vector

p(w|d) = ZVtimesk p(z|d)

bull With many documents we hope to find latent topics as common basis

pLSA ndash Objective Function

bull pLSA tries to maximize the log likelihood

bull Due to the summation over z inside log we have to resort to EM

EM Steps

bull E-Stepndash Expectation of the likelihood function is calculated with the current

parameter valuesbull M-Step

ndash Update the parameters with the calculated posterior probabilitiesndash Find the parameters that maximizes the likelihood function

Lower Bounding the Log Likelihood

EM Steps

bull The E-Step

bull The M-Step

Latent Subspace

pLSA vs LSAbull LSA and PLSA perform dimensionality reduction

ndash In LSA by keeping only K singular valuesndash In PLSA by having K aspects

bull Comparison to SVDndash U Matrix related to P(z|d) (doc to aspect)ndash V Matrix related to P(w|z) (aspect to term)ndash E Matrix related to P(z) (aspect strength)

pLSA vs LSAbull The main difference is the way the approximation is done

bull PLSA generates a model (aspect model) and maximizes its predictive power

bull Selecting the proper value of K is heuristic in LSA

bull Model selection in statistics can determine optimal K in PLSA

Applications

bull Text mining topic discovering

bull Scene Classification

Text Mining

Scene Classification

Classification Result

Reference

bull Thomas Hofmann Probabilistic latent semantic analysis In Proc of Uncertainty in Artificial Intelligence UAI99 Stockholm 1999

bull Bosch A Zisserman A and Munoz X Scene Classification via pLSAProceedings of the European Conference on Computer Vision (2006)

bull Sivic J Russell B C Efros A A Zisserman A and Freeman W T Discovering Object Categories in Image Collections MIT AI Lab Memo AIM-2005-005 February 2005

LDA ndash LATENT DIRICHILET ALLOCATION

Problems in pLSA

bull pLSA provides no probabilistic model at the document level Each doc has its own topic mixture proportion

bull The number of parameters in the model grows linearly with M (the number of documents in the training set)

Problems in pLSA

bull There is no constraint for distributions p(z|di)

bull Easy to lead to serious problems with over-fitting

d1

z1

w1

z2

w2

zN1

wN1

hellip

d2

z1

w1

z2

w2

zN2

wN2

hellip

dm

z1

w1

z2

w2

zNm

wNm

hellip

p(z|d1) p(z|d2) p(z|dm)

Dirichlet Distribution

bull In the LDA model the topic mixture proportions for each document are assumed to follow some distribution

bull Requirement for such a distributionndash The samples (mixture proportions) generated from it are K-tuples of non-

negative numbers that sum to one That is the samples are multinormialsndash Easy to optimize

bull Dirichlet Distribution is one of such distributions

bull The space of all of these multinomials has a nice geometric interpretation as a (k-1)-simplex

Dirichlet Distribution

bull Definition

bull The density is zero outside this open (K minus 1)-dimensional simplex

k

i ii

ki ik

i i

ki i

kk

xx

xxxxp i

1

11

1

12121

1 0 st

)Γ()(Γ)|(

bull Various parameter α

(6 2 2) (3 7 5)

(2 3 4) (6 2 6)

Example Dirichlet Distributions (K=3)

Example Dirichlet Distributions (K=3)

bull Equal αi different

α0=01 α0=1 α0=10

k

i i10

The LDA Model

z4z3z2z1

w4w3w2w1

b

z4z3z2z1

w4w3w2w1

z4z3z2z1

w4w3w2w1

bull For each documentbull Choose ~Dirichlet()bull For each of the N words wn

ndash Choose a topic znraquo Multinomial()

ndash Choose a word wn from p(wn|znb) a multinomial probability conditioned on the topic zn

The LDA Model

bull For each documentbull Choose ~Dirichlet()bull For each of the N words wn

ndash Choose a topic znraquo Multinomial()

ndash Choose a word wn from p(wn|znb) a multinomial probability conditioned on the topic zn

Joint Probability

bull Given parameter α and β

where

Likelihood

bull Joint Probability

bull Marginal distribution of a document

bull Likelihood over all the documents

Inference

bull The likelihood can be computed by summing each documentbull Jansenrsquos inequality in EM

Inference

bull In E-Step we need to compute the posterior distribution of the hidden variables

bull Unfortunately this distribution is intractable to compute in general

bull We have to resort to variational approach

Variational Inference

bull In variational inference we consider a simplified graphical model with variational parameters and minimize the KL Divergence between the variational and posterior distributions

Variantional Inference

bull The difference between the lower bound and the likelihood is the KL divergence

bull Maximizing the lower bound L(b) with respect to and is equivalent to minimizing the KL divergence

VBEM vs EM

bull Only different in the E-Step

bull In standard EM q(X) is directly set to p(X|Dθ) and let KL=0bull In VBEM it is intractable to compute p(X|Dθ) Instead it

approximates p(X|Dθ) by a variational distribution q(X) by minimizing KL(q(X) | P(X|D θ)

bull This is also equivalent to maximizing the lower bound L(θ)

Parameter Estimation

bull Given a corpus of documents we would like to find the parameters and b which maximize the likelihood of the observed data

bull Strategy (Variational EM)

bull Lower bound log p(w|b) by a function L(b)bull Repeat until convergence

ndash E Maximize L(b) with respect to the variational parameters ndash M Maximize the bound with respect to parameters and b

Parameter Estimation

bull E-Step Variational Inference ndash repeat until convergence

bull M-Step Parameter estimation

β

α can be implemented using the Newton-Raphson method

Topic Examples in a 100-topic LDA Model)bull 16000 documents from a subset of the TREC AP corpus

Classification (50-topic LDA + SVM)

bull Reuters-21578 dataset ndash contains 8000 documents and 15818 words

(a) EARN vs NOT EARN (b) GRAIN vs NOT GRAIN

Problems in LDA

bull Dirichlet Distribution is helpful to avoid over-fitting But the assumption might be too strong

z4z3z2z1

w4w3w2w1

b

z4z3z2z1

w4w3w2w1

z4z3z2z1

w4w3w2w1

A Bayesian Hierarchical Model for Learning Natural Scene Categories

bull Incorporating category information

MNd

π

z

x

θ

β

Codebookbull 174 Local Image Patches

bull DetectionEvenly Sampled GridRandom SamplingSaliency DetectorLowersquos DoG Detector

bull RepresentationNormalized 11x11 gray values128-dim SIFT

Topic Distribution in Different Categories

Topic Hierarchical Clustering

More Topic Models

bull Dynamic topic models ICML 2006bull Correlated Topic Model NIPS 2005bull Hierarchical Dirichlet Process Journal of the American

Statistical Association 2003bull Nonparametric Bayes pachinko allocation UAI 2007bull Supervised LDA NIPS 2007bull MedLDA ndash Maximum Margin Discrimant LDA ICML 2009bull hellip

Are you really into Graphical Models

bull Describing Visual Scenes using Transformed Dirichlet Processes E Sudderth A Torralba W Freeman and A Willsky NIPS Dec 2005

Reference

bull David M Blei Andrew Y Ng Michael I Jordan Latent Dirichlet Allocation Journal of Machine Learning Research (JMLR) 2003

bull Matthew J Beal Variational Algorithms for Approximate Bayesian Inference PhD Thesis University of Cambridge 1998

bull L Fei-Fei and P Perona A Bayesian Hierarchical Model for Learning Natural Scene Categories CVPR 2005

Outline

bull Matrix Decompositionndash PCA SVD NMFndash LDA ICA Sparse Coding etc

bull Graphical Modelndash Basic concepts in probabilistic machine learningndash EMndash pLSAndash LDA

bull Two Applicationsndash Document decomposition for ldquolong queryrdquo retrieval ICCV 2009ndash Modeling Threaded Discussions SIGIR 2009

Large-Scale Indexing for ldquoLong Queryrdquo Retrieval (Similarity Search)

Xiao Zhang Zhiwei Li Lei Zhang Wei-Ying Ma and Heung-Yeung Shum

ICCV 2009

The Long Query Problem

bull If a query contains 1000 keywordsndash Need to access 1000 inverted listsndash The intersection of 1000 inverted lists may be emptyndash The union of 1000 inverted list may be the whole corpus

bull Dimension reductionTerm1 Term2 Term3 Term4 hellip TermN

Img1 1 2 0 0 hellip 2

f1 f2 hellip fM

Img1 02 01 hellip 003

Topic Projection

Dim = 1 million

Dim = 200

Key Idea Dimension Reduction + Residual Error Preservation

bull p original TF-IDF vector in vocabulary spacebull X projection matrix for dimension reductionbull w low dimensional feature vectorbull residual error

p Xw

An image = 0 01 02 03 04 05 06 07 08 09 1

0

2

4

6

8

10

12

14

16

+ a few words(10 words)

Orthogonal Decomposition

p Xw 1 11 1 1

2 21 2 1 2

1 1

1

k

k

W k W

W W Wk W

p x xp x x w

p wp x x

Base vector

Low dimensional representation

Residual

An image = 0 01 02 03 04 05 06 07 08 09 1

0

2

4

6

8

10

12

14

16

+ a few words(10 words)

T Tp q p q

p q

w w

X1 X2 X3 hellip Xk

A Probabilistic Implementation

x is a switch variable It controls a word generated from

bull a topic specific distribution

bull a document specific distribution

bull a background distribution

( | )p w d 1( 0 | ) ( | ) ( | )K

kp x d p w z k p z k d

( 1| ) ( | )p x d p w d ( 2 | ) ( | )p x d p w

C Chemudugunta etal Modeling General and Specific Aspects of Documents with a Probabilistic Topic Model NIPS 2006

Search (Online)

DS1 DS2 hellip

DS1 DS2 hellip

DS1 DS2 hellip

DS1 DS2 hellip

DS1 DS2 hellip

hellip

hellip

hellip

hellip

hellip

LSH Index

Doc 300 Doc 401 hellip

A query = 0 01 02 03 04 05 06 07 08 09 1

0

2

4

6

8

10

12

14

16

+ a few words

Re-rankingDoc 401 hellip

Doc 1

Doc 2

Doc 300

Doc 401

Doc N

Doc 300

Index 10M Images 46GBSearch Speed lt 100ms

Doc Meta

Search Example

Query Image

Search Example

Query Image

Simultaneously Modeling Semantics and Structure of Threaded Discussions A Sparse

Coding Approach and Its Applications

Chen LIN Jiang-Ming YANG Rui CAI Xin-jing WANG Wei WANG Lei ZHANG

SIGIR 2009

123

Semantic amp structure

124

SemanticTopics

StructureWho reply to who

Optimize them together

Model semantic

Model structure

125

Reply reconstruction

126

DocumentSimilarity

TopicSimilarity

StructureSimilarity

BaselinesNP

Reply to Nearest PostRR

Reply to RootDS

Document SimilarityLDA

Latent Dirichlet Allocation Project documents to topic space

SWB Special Words Topic Model with Background distribution Project documents to topic and junk topic space

127

Evaluation

method Slashdot Apple

All Posts Good Posts All Posts Good Posts

NP 0021 0012 0289 0239

RR 0183 0319 0269 0474

DS 0463 0643 0409 0628

LDA 0465 0644 0410 0648

SWB 0463 0644 0410 0641

SMSS 0524 0737 0517 0772

128

Expert finding

Reply reconstruction

Network construction

Expert finding

Methods

HITS

PageRank

hellip

129

Baselines

LMFormal Models for Expert Finding in Enterprise Corpora SIGIR

06Achieves stable performance in expert finding task using a

language modelPageRank

Benchmark nodal ranking methodHITS

Find hub nodes and authority nodeEABIF

Personalized Recommendation Driven by Information Flow SIGIR rsquo06

Find most influential node130

Evaluation

131

bull Bayesian estimate

Method MRR MAP P10

LM 0821 0698 0800

EABIF(ori) 0674 0362 0243

EABIF(rec) 0742 0318 0281

PageRank(ori) 0675 0377 0263

PageRank(rec) 0743 0321 0266

HITS(ori) 0906 0832 0900

HITS(rec) 0938 0822 0906

Summary

bull Matrix and probability are fundamental mathematics in information retrieval and computer visionndash Matrix decomposition ndash a good practice to learn matrixndash Graphical model ndash a good practice to learn probability

bull Graphical model is a good tool to analyze problems

bull The essence of decomposition is to discover a set of mid-level features to describe original documentsimages

bull It is more adaptable for various applications than matrix decomposition

  • An Introduction To Matrix Decomposition and Graphical Model
  • Outline
  • What Is Matrix Decomposition
  • Why We Need Matrix Decomposition
  • Why We Need Matrix Decomposition (2)
  • Principle Component Analysis
  • Definition ndash Eigenvalue amp Eigenvector
  • Definition ndash Principle Component Analysis
  • Principle Component Analysis (2)
  • Maximizing Variance
  • Optimization Problem
  • Property Data Decomposition
  • Face Recognition ndash Eigenface
  • Slide 14
  • Slide 15
  • PageRank ndash Power Iteration
  • Column-Stochastic amp Irreducible
  • Iterative PageRank Calculation
  • Convergence of the power iteration
  • Singular value decomposition
  • SVD - Definition
  • Singular Values And Singular Vectors
  • Matrix approximation
  • SVD and PCA
  • Example - LSI
  • SVD and PCA (2)
  • Latent Semantic Indexing (LSI)
  • Latent Semantic Indexing
  • Similarity Measures
  • Similarity Measures (2)
  • HITS (Hyperlink Induced Topic Search)
  • Power Iteration
  • HITS and SVD
  • HITS vs PageRank
  • NMF ndash Non-Negative Matrix Factorization
  • Definition
  • Motivation
  • Multiplicative Update Algorithm
  • Multiplicative Update Algorithm (2)
  • NMF vs PCA
  • Reference
  • Major Reference
  • Outline (2)
  • Not Included
  • Basic Concepts
  • What Is Machine Learning
  • Likelihood Function
  • Maximum Likelihood (ML)
  • IID ndash Independent Identically Distributed
  • Reference (2)
  • Expectation Maximization
  • Why We Need EM
  • More General
  • The Expectation Maximization (EM) Algorithm
  • Jensenrsquos Inequality
  • Lower Bounding the Log Likelihood
  • The E and M Steps of EM
  • The E Step
  • The M Step
  • EM Never Decreases the Likelihood
  • Reference (3)
  • Why Do We Need Graphical Model
  • Why Do We Need Graphical Models
  • Directed Acyclic Graphical Models (Bayesian Networks)
  • Directed Graphs for Statistical Models Plate Notation
  • pLSA ndash Probabilistic Latent Semantic Analysis
  • Latent Semantic Indexing (LSI) Review
  • pLSA ndash Probabilistic Latent Semantic Analysis (2)
  • pLSA
  • pLSA (2)
  • Joint Probability vs Likelihood
  • Document Decomposition
  • pLSA ndash Objective Function
  • EM Steps
  • Lower Bounding the Log Likelihood (2)
  • EM Steps (2)
  • Latent Subspace
  • pLSA vs LSA
  • pLSA vs LSA (2)
  • Applications
  • Text Mining
  • Scene Classification
  • Classification Result
  • Reference (4)
  • LDA ndash Latent Dirichilet Allocation
  • Problems in pLSA
  • Problems in pLSA (2)
  • Dirichlet Distribution
  • Dirichlet Distribution (2)
  • Example Dirichlet Distributions (K=3)
  • Example Dirichlet Distributions (K=3) (2)
  • The LDA Model
  • The LDA Model (2)
  • Joint Probability
  • Likelihood
  • Inference
  • Inference (2)
  • Variational Inference
  • Variantional Inference
  • VBEM vs EM
  • Parameter Estimation
  • Parameter Estimation (2)
  • Topic Examples in a 100-topic LDA Model)
  • Classification (50-topic LDA + SVM)
  • Problems in LDA
  • A Bayesian Hierarchical Model for Learning Natural Scene Catego
  • Codebook
  • Topic Distribution in Different Categories
  • Topic Hierarchical Clustering
  • More Topic Models
  • Are you really into Graphical Models
  • Reference (5)
  • Outline (3)
  • Slide 115
  • The Long Query Problem
  • Key Idea Dimension Reduction + Residual Error Preservation
  • Orthogonal Decomposition
  • A Probabilistic Implementation
  • Search (Online)
  • Search Example
  • Search Example (2)
  • Simultaneously Modeling Semantics and Structure of Threaded Dis
  • Semantic amp structure
  • Optimize them together
  • Reply reconstruction
  • Baselines
  • Evaluation
  • Expert finding
  • Baselines (2)
  • Evaluation (2)
  • Summary
Page 24: An Introduction To Matrix Decomposition and Graphical Model

SVD and PCAbull For symmetric A SVD is closely related to PCA

bull PCA A = UΛUT

ndash U and Λ are eigenvectors and eigenvaluesbull SVD A = UΛVT

ndash U is left(column) eigenvectorsndash V is right(row) eigenvectorsndash Λ is the same eigenvalues

bull For symmetric A column eigenvectors equal to row eigenvectors

bull Note the difference of A in PCA and SVDndash SVD A is directly the data eg term-by-document matrixndash PCA A is covariance matrix A=XTX each row in X is a sample

Latent Semantic Indexing (LSI)1 Document file preparation preprocessing

ndash Indexing collecting termsndash Use stop list eliminate rdquomeaninglessrdquo wordsndash Stemming

2 Construction term-by-document matrix sparse matrix storage

3 Query matching distance measures4 Data compression by low rank approximation SVD5 Ranking and relevance feedback

Latent Semantic Indexingbull Assumption there is some underlying latent semantic

structure in the data

bull Eg car and automobile occur in similar documents as do cows and sheep

bull This structure can be enhanced by projecting the data (the term-by-document matrix and the queries) onto a lower dimensional space using SVD

Similarity Measures

bull Term to term AAT = UΣ2UT = (UΣ)(UΣ)T

UΣ are the coordinates of A (rows) projected to space V

bull Document to documentATA = VΣ2VT = (VΣ)(VΣ)T

VΣ are the coordinates of A (columns) projected to space U

Similarity Measures

bull Term to document A = UΣVT = (UΣfrac12)(VΣfrac12)T

UΣfrac12 are the coordinates of A (rows) projected to space VVΣfrac12 are the coordinates of A (columns) projected to space U

HITS (Hyperlink Induced Topic Search)

bull Idea Web includes two flavors of prominent pagesndash authorities contain high-quality informationndash hubs are comprehensive lists of links to authorities

bull A page is a good authority if many hubs point to itbull A page is a good hub if it points to many authoritiesbull Good authorities are pointed to by good hubs and good hubs

point to good authorities

Hubs Authorities

Power Iteration

bull Each page i has both a hub score hi and an authority score aibull HITS successively refines these scores by computing

bull Define the adjacency matrix L of the directed web graph

bull Now

HITS and SVD

bull L rows are outlinks columns are inlinks

bull a will be the dominant eigenvector of the authority matrix LTL bull h will be the dominant eigenvector of the hub matrix LLT

bull They are in fact the first left and right singular vectors of Lbull We are in fact running SVD on the adjacency matrix

HITS vs PageRankbull PageRank may be computed once HITS is computed per

query

bull HITS takes query into account PageRank doesnrsquot

bull PageRank has no concept of hubs

bull HITS is sensitive to local topology insertion or deletion of a small number of nodes may change the scores a lot

bull PageRank more stable because of its random jump step

NMF ndash NON-NEGATIVE MATRIX FACTORIZATION

Definition

bull Given a nonnegative matrix Vntimesm find non-negative matrix factors Wntimesk and Hktimesm such that

VntimesmasympWntimeskHktimesm

bull V column matrix each column is a data sample (n-dimension)bull Wi k-basis represents one basebull H coordinates of V projected to W

vj asymp Wntimeskhj

Motivationbull Non-negativity is natural in many applications

bull Probability is also non-negative

bull Additive model to capture local structure

Multiplicative Update Algorithmbull Cost function Euclidean distance

bull Multiplicative Update

Multiplicative Update Algorithmbull Cost function Divergence

ndash Reduce to Kullback-Leibler divergence whenndash A and B can be regarded as normalized probability distributions

bull Multiplicative update

bull PLSA is NMF with KL divergence

NMF vs PCA

bull n = 2429 faces m = 19x19 pixelsbull Positive values are illustrated with black pixels and negative values

with red pixels

bull NMF Parts-based representationbull PCA Holistic representations

Referencebull D D Lee and H S Seung Algorithms for non-negative matrix

factorization (pdf) NIPS 2001bull D D Lee and H S Seung Learning the parts of objects by non-negative

matrix factorization (pdf) Nature 401 788-791 (1999)

Major Referencebull Saara Hyvoumlnen Linear Algebra Methods for Data Mining Spring 2007

University of Helsinki (Highly recommend)

Outlinebull Basic concepts

ndash Likelihood iidndash ML MAP and Bayesian Inferencendash Expectation-Maximizationndash Mixture Gaussian Parameter estimation

bull pLSAndash Motivationndash Derivation amp Geometry propertiesndash Applications

bull LDAndash Motivation - Why to add a hyper parameterndash Dirichlet Distributionndash Variational EMndash Relations with other topic modalsndash Incorporating category information

bull Summary

Not Included

bull General graphical model theoriesbull Markov random field (belief propagation)bull Detailed derivation of LDA

BASIC CONCEPTS

What Is Machine Learning

Databull Let x = (x1 x2 xD) T denote a data point and D = x(1) x(2) x(N) a

data set D is sometimes associated with desired outputs y1 y2

Predictionsbull We are generally interested in predicting something based on the observed

data setbull Given D what can we say about x(N+1)

Modelbull To make predictions we need to make some assumptions We can often

express these assumptions in the form of a model with some parameters θbull Given data D we learn the model parameters from which we can predict

new data pointsbull The model can often be expressed as a probability distribution over data

points

Likelihood Function

bull Given a set of parameter values probability density function (PDF) will show that some data are more probable than other data

bull Inversely given the observed data and a model of interest Likelihood function is defined as

L(θ) = fθ(x|θ) = p(x|θ)

bull That is likelihood function L(θ) will show that some parameters are more likely to have produced the data

Maximum Likelihood (ML)

bull Maximum likelihood will find the best model parameters that make the data ldquomost likelyrdquo generated from this model

bull Suppose we are given n data samples (x1 x2 hellip xn)

bull Maximum likelihood will find θ that maximize L(θ)

bull Predictive distribution

IID ndash Independent Identically Distributed

bull IID means

bull The problem is considerably simplified as

bull Usually log likehood is used

Reference

bull Zoubin Ghahramani Machine Learning (4F13) 2006 Cambridge(Introduction to Machine Learning Lectures 1-2 Slides)

bull Gregor Heinrich Parameter estimation for text analysis Technical Note 2005-2008

EXPECTATION MAXIMIZATION

Why We Need EM

bull The ExpectationndashMaximization (EM) algorithm is a method for ML learning of parameters in latent variable models

bull Why we need latent variables

bull To describe complex model Gaussian Mixture Model

bull To discover the intrinsic structure inside a data set Topic Models such as pLSA LDA

More General

bull Data set bull Likelihood

bull Goal learn maximum likelihood (ML) parameter values

bull The maximum likelihood procedure finds parameters θ such that

bull Because of the integral (or sum) over latent variables the likelihood can be a very complicated and hard to optimize function

The Expectation Maximization (EM) Algorithm

bull The EM algorithm finds a (local) maximum of a latent variable model likelihood It starts from arbitrary values of the parameters and iterates two steps

bull E step Fill in values of latent variables according to posterior given data

bull M step Maximize likelihood as if latent variables were not hidden

bull Decomposes difficult problems into series of tractable steps

Jensenrsquos Inequality

Lower Bounding the Log Likelihoodbull Observed data D = yn Latent variables X = xn Parameters θbull Goal Maximize the log likelihood (ie ML learning) wrt θ

bull Any distribution q(X) over the hidden variables can be used to obtain a lower bound on the log likelihood using Jensenrsquos inequality

bull where H[q] is the entropy of q(X)

The E and M Steps of EM

bull The lower bound on the log likelihood is given by

bull EM alternates betweenbull E step optimize wrt distribution over hidden variables

holding params fixed

bull M step maximize wrt parameters holding hidden distribution fixed

The E Step

bull E step for fixed θ

bull The second term is the Kullback-Leibler divergencebull This means that for fixed θ F is bounded above by L and achieves

that bound whenbull So the E step simply sets

The M Step

bull M step maximize wrt parameters holding hidden distribution q fixed

bull The second equality comes from fact that entropy of q(X) does not depend directly on θ

bull The specific form of the M step depends on the model Often the maximum wrt θ can be found analytically

EM Never Decreases the Likelihood

bull The E and M steps together never decrease the log likelihood

bull The E step brings F(q θ) to the likelihood L(θ) 顶bull The M-step maximizes F(q θ) wrt θ 抬bull F(q θ) lt L(θ) by Jensen ndash or equivalently from the non-

negativity of KL

Reference

bull Zoubin Ghahramani Machine Learning (4F13) 2006 Cambridge(Unsupervised learning Lecture 5 Slides)

bull Christopher M Bishop (2006) Pattern Recognition and Machine Learning Springer

WHY DO WE NEED GRAPHICAL MODEL

Why Do We Need Graphical Models

bull Consndash Graphical model is so complex even with a few circleshellipndash We have to make too many assumptions

bull Prosndash We do need probability to explain our world But joint probability is

hard to computendash Graphical model can help us analyze and understand our problemsndash Graphs are an intuitive way of representing and visualizing the

relationships between many variablesndash With a graphical model we can decouple joint probability to

conditional probabilities which are usually easier

Directed Acyclic Graphical Models (Bayesian Networks)

bull A DAG Model Bayesian network corresponds to a factorization of the joint probability distribution

p(ABCDE) = p(A)p(B)p(C|AB)p(D|BC)p(E|CD)

bull In general

bull where pa(i) are the parents of node i

Directed Graphs for Statistical ModelsPlate Notation

bull A data set of N points generated from a Gaussian

PLSA ndash PROBABILISTIC LATENT SEMANTIC ANALYSIS

Latent Semantic Indexing (LSI) Review

bull For natural Language Queries simple term matching does not work effectivelyndash Ambiguous terms ndash Same Queries vary due to personal styles

bull Latent semantic indexingndash Creates this lsquolatent semantic spacersquo (hidden meaning)

bull LSI puts documents together even if they donrsquot have common words if the docs share frequently co-occurring terms

bull Disadvantagesndash Statistical foundation is missing

pLSA ndash Probabilistic Latent Semantic Analysis

bull Automated Document Indexing and Information retrievalbull Identification of Latent Classes using an Expectation

Maximization (EM) Algorithmbull Shown to solve

ndash Polysemybull Java could mean ldquocoffeerdquo and also the ldquoPL Javardquobull Cricket is a ldquogamerdquo and also an ldquoinsectrdquo

ndash Synonymybull ldquocomputerrdquo ldquopcrdquo ldquodesktoprdquo all could mean the same

bull Has a better statistical foundation than LSA

pLSA

M

Nd

d

z

w

M

d

z1

w1

z2

w2

z3

w3

zN

wN

hellip

z1 hellip zN are variables ziє[1K]K is the number of latent topics

pLSA

d1

z1

w1

z2

w2

zN1

wN1

hellip

d2

z1

w1

z2

w2

zN2

wN2

hellip

dM

z1

w1

z2

w2

zNm

wNm

hellip

p(w|z=1) p(w|z=2) p(w | z=Nm) are shared for all documents

Likelihood

Joint Probability vs Likelihood

bull Joint probability

bull Likelihood (only for observed variables)

bull p(d) is assumed to be uniform

Document Decomposition

bull Each document can be decomposed as

bull This is similar to the matrix decomposition if we consider each discrete distribution as a vector

p(w|d) = ZVtimesk p(z|d)

bull With many documents we hope to find latent topics as common basis

pLSA ndash Objective Function

bull pLSA tries to maximize the log likelihood

bull Due to the summation over z inside log we have to resort to EM

EM Steps

bull E-Stepndash Expectation of the likelihood function is calculated with the current

parameter valuesbull M-Step

ndash Update the parameters with the calculated posterior probabilitiesndash Find the parameters that maximizes the likelihood function

Lower Bounding the Log Likelihood

EM Steps

bull The E-Step

bull The M-Step

Latent Subspace

pLSA vs LSAbull LSA and PLSA perform dimensionality reduction

ndash In LSA by keeping only K singular valuesndash In PLSA by having K aspects

bull Comparison to SVDndash U Matrix related to P(z|d) (doc to aspect)ndash V Matrix related to P(w|z) (aspect to term)ndash E Matrix related to P(z) (aspect strength)

pLSA vs LSAbull The main difference is the way the approximation is done

bull PLSA generates a model (aspect model) and maximizes its predictive power

bull Selecting the proper value of K is heuristic in LSA

bull Model selection in statistics can determine optimal K in PLSA

Applications

bull Text mining topic discovering

bull Scene Classification

Text Mining

Scene Classification

Classification Result

Reference

bull Thomas Hofmann Probabilistic latent semantic analysis In Proc of Uncertainty in Artificial Intelligence UAI99 Stockholm 1999

bull Bosch A Zisserman A and Munoz X Scene Classification via pLSAProceedings of the European Conference on Computer Vision (2006)

bull Sivic J Russell B C Efros A A Zisserman A and Freeman W T Discovering Object Categories in Image Collections MIT AI Lab Memo AIM-2005-005 February 2005

LDA ndash LATENT DIRICHILET ALLOCATION

Problems in pLSA

bull pLSA provides no probabilistic model at the document level Each doc has its own topic mixture proportion

bull The number of parameters in the model grows linearly with M (the number of documents in the training set)

Problems in pLSA

bull There is no constraint for distributions p(z|di)

bull Easy to lead to serious problems with over-fitting

d1

z1

w1

z2

w2

zN1

wN1

hellip

d2

z1

w1

z2

w2

zN2

wN2

hellip

dm

z1

w1

z2

w2

zNm

wNm

hellip

p(z|d1) p(z|d2) p(z|dm)

Dirichlet Distribution

bull In the LDA model the topic mixture proportions for each document are assumed to follow some distribution

bull Requirement for such a distributionndash The samples (mixture proportions) generated from it are K-tuples of non-

negative numbers that sum to one That is the samples are multinormialsndash Easy to optimize

bull Dirichlet Distribution is one of such distributions

bull The space of all of these multinomials has a nice geometric interpretation as a (k-1)-simplex

Dirichlet Distribution

bull Definition

bull The density is zero outside this open (K minus 1)-dimensional simplex

k

i ii

ki ik

i i

ki i

kk

xx

xxxxp i

1

11

1

12121

1 0 st

)Γ()(Γ)|(

bull Various parameter α

(6 2 2) (3 7 5)

(2 3 4) (6 2 6)

Example Dirichlet Distributions (K=3)

Example Dirichlet Distributions (K=3)

bull Equal αi different

α0=01 α0=1 α0=10

k

i i10

The LDA Model

z4z3z2z1

w4w3w2w1

b

z4z3z2z1

w4w3w2w1

z4z3z2z1

w4w3w2w1

bull For each documentbull Choose ~Dirichlet()bull For each of the N words wn

ndash Choose a topic znraquo Multinomial()

ndash Choose a word wn from p(wn|znb) a multinomial probability conditioned on the topic zn

The LDA Model

bull For each documentbull Choose ~Dirichlet()bull For each of the N words wn

ndash Choose a topic znraquo Multinomial()

ndash Choose a word wn from p(wn|znb) a multinomial probability conditioned on the topic zn

Joint Probability

bull Given parameter α and β

where

Likelihood

bull Joint Probability

bull Marginal distribution of a document

bull Likelihood over all the documents

Inference

bull The likelihood can be computed by summing each documentbull Jansenrsquos inequality in EM

Inference

bull In E-Step we need to compute the posterior distribution of the hidden variables

bull Unfortunately this distribution is intractable to compute in general

bull We have to resort to variational approach

Variational Inference

bull In variational inference we consider a simplified graphical model with variational parameters and minimize the KL Divergence between the variational and posterior distributions

Variantional Inference

bull The difference between the lower bound and the likelihood is the KL divergence

bull Maximizing the lower bound L(b) with respect to and is equivalent to minimizing the KL divergence

VBEM vs EM

bull Only different in the E-Step

bull In standard EM q(X) is directly set to p(X|Dθ) and let KL=0bull In VBEM it is intractable to compute p(X|Dθ) Instead it

approximates p(X|Dθ) by a variational distribution q(X) by minimizing KL(q(X) | P(X|D θ)

bull This is also equivalent to maximizing the lower bound L(θ)

Parameter Estimation

bull Given a corpus of documents we would like to find the parameters and b which maximize the likelihood of the observed data

bull Strategy (Variational EM)

bull Lower bound log p(w|b) by a function L(b)bull Repeat until convergence

ndash E Maximize L(b) with respect to the variational parameters ndash M Maximize the bound with respect to parameters and b

Parameter Estimation

bull E-Step Variational Inference ndash repeat until convergence

bull M-Step Parameter estimation

β

α can be implemented using the Newton-Raphson method

Topic Examples in a 100-topic LDA Model)bull 16000 documents from a subset of the TREC AP corpus

Classification (50-topic LDA + SVM)

bull Reuters-21578 dataset ndash contains 8000 documents and 15818 words

(a) EARN vs NOT EARN (b) GRAIN vs NOT GRAIN

Problems in LDA

bull Dirichlet Distribution is helpful to avoid over-fitting But the assumption might be too strong

z4z3z2z1

w4w3w2w1

b

z4z3z2z1

w4w3w2w1

z4z3z2z1

w4w3w2w1

A Bayesian Hierarchical Model for Learning Natural Scene Categories

bull Incorporating category information

MNd

π

z

x

θ

β

Codebookbull 174 Local Image Patches

bull DetectionEvenly Sampled GridRandom SamplingSaliency DetectorLowersquos DoG Detector

bull RepresentationNormalized 11x11 gray values128-dim SIFT

Topic Distribution in Different Categories

Topic Hierarchical Clustering

More Topic Models

bull Dynamic topic models ICML 2006bull Correlated Topic Model NIPS 2005bull Hierarchical Dirichlet Process Journal of the American

Statistical Association 2003bull Nonparametric Bayes pachinko allocation UAI 2007bull Supervised LDA NIPS 2007bull MedLDA ndash Maximum Margin Discrimant LDA ICML 2009bull hellip

Are you really into Graphical Models

bull Describing Visual Scenes using Transformed Dirichlet Processes E Sudderth A Torralba W Freeman and A Willsky NIPS Dec 2005

Reference

bull David M Blei Andrew Y Ng Michael I Jordan Latent Dirichlet Allocation Journal of Machine Learning Research (JMLR) 2003

bull Matthew J Beal Variational Algorithms for Approximate Bayesian Inference PhD Thesis University of Cambridge 1998

bull L Fei-Fei and P Perona A Bayesian Hierarchical Model for Learning Natural Scene Categories CVPR 2005

Outline

bull Matrix Decompositionndash PCA SVD NMFndash LDA ICA Sparse Coding etc

bull Graphical Modelndash Basic concepts in probabilistic machine learningndash EMndash pLSAndash LDA

bull Two Applicationsndash Document decomposition for ldquolong queryrdquo retrieval ICCV 2009ndash Modeling Threaded Discussions SIGIR 2009

Large-Scale Indexing for ldquoLong Queryrdquo Retrieval (Similarity Search)

Xiao Zhang Zhiwei Li Lei Zhang Wei-Ying Ma and Heung-Yeung Shum

ICCV 2009

The Long Query Problem

bull If a query contains 1000 keywordsndash Need to access 1000 inverted listsndash The intersection of 1000 inverted lists may be emptyndash The union of 1000 inverted list may be the whole corpus

bull Dimension reductionTerm1 Term2 Term3 Term4 hellip TermN

Img1 1 2 0 0 hellip 2

f1 f2 hellip fM

Img1 02 01 hellip 003

Topic Projection

Dim = 1 million

Dim = 200

Key Idea Dimension Reduction + Residual Error Preservation

bull p original TF-IDF vector in vocabulary spacebull X projection matrix for dimension reductionbull w low dimensional feature vectorbull residual error

p Xw

An image = 0 01 02 03 04 05 06 07 08 09 1

0

2

4

6

8

10

12

14

16

+ a few words(10 words)

Orthogonal Decomposition

p Xw 1 11 1 1

2 21 2 1 2

1 1

1

k

k

W k W

W W Wk W

p x xp x x w

p wp x x

Base vector

Low dimensional representation

Residual

An image = 0 01 02 03 04 05 06 07 08 09 1

0

2

4

6

8

10

12

14

16

+ a few words(10 words)

T Tp q p q

p q

w w

X1 X2 X3 hellip Xk

A Probabilistic Implementation

x is a switch variable It controls a word generated from

bull a topic specific distribution

bull a document specific distribution

bull a background distribution

( | )p w d 1( 0 | ) ( | ) ( | )K

kp x d p w z k p z k d

( 1| ) ( | )p x d p w d ( 2 | ) ( | )p x d p w

C Chemudugunta etal Modeling General and Specific Aspects of Documents with a Probabilistic Topic Model NIPS 2006

Search (Online)

DS1 DS2 hellip

DS1 DS2 hellip

DS1 DS2 hellip

DS1 DS2 hellip

DS1 DS2 hellip

hellip

hellip

hellip

hellip

hellip

LSH Index

Doc 300 Doc 401 hellip

A query = 0 01 02 03 04 05 06 07 08 09 1

0

2

4

6

8

10

12

14

16

+ a few words

Re-rankingDoc 401 hellip

Doc 1

Doc 2

Doc 300

Doc 401

Doc N

Doc 300

Index 10M Images 46GBSearch Speed lt 100ms

Doc Meta

Search Example

Query Image

Search Example

Query Image

Simultaneously Modeling Semantics and Structure of Threaded Discussions A Sparse

Coding Approach and Its Applications

Chen LIN Jiang-Ming YANG Rui CAI Xin-jing WANG Wei WANG Lei ZHANG

SIGIR 2009

123

Semantic amp structure

124

SemanticTopics

StructureWho reply to who

Optimize them together

Model semantic

Model structure

125

Reply reconstruction

126

DocumentSimilarity

TopicSimilarity

StructureSimilarity

BaselinesNP

Reply to Nearest PostRR

Reply to RootDS

Document SimilarityLDA

Latent Dirichlet Allocation Project documents to topic space

SWB Special Words Topic Model with Background distribution Project documents to topic and junk topic space

127

Evaluation

method Slashdot Apple

All Posts Good Posts All Posts Good Posts

NP 0021 0012 0289 0239

RR 0183 0319 0269 0474

DS 0463 0643 0409 0628

LDA 0465 0644 0410 0648

SWB 0463 0644 0410 0641

SMSS 0524 0737 0517 0772

128

Expert finding

Reply reconstruction

Network construction

Expert finding

Methods

HITS

PageRank

hellip

129

Baselines

LMFormal Models for Expert Finding in Enterprise Corpora SIGIR

06Achieves stable performance in expert finding task using a

language modelPageRank

Benchmark nodal ranking methodHITS

Find hub nodes and authority nodeEABIF

Personalized Recommendation Driven by Information Flow SIGIR rsquo06

Find most influential node130

Evaluation

131

bull Bayesian estimate

Method MRR MAP P10

LM 0821 0698 0800

EABIF(ori) 0674 0362 0243

EABIF(rec) 0742 0318 0281

PageRank(ori) 0675 0377 0263

PageRank(rec) 0743 0321 0266

HITS(ori) 0906 0832 0900

HITS(rec) 0938 0822 0906

Summary

bull Matrix and probability are fundamental mathematics in information retrieval and computer visionndash Matrix decomposition ndash a good practice to learn matrixndash Graphical model ndash a good practice to learn probability

bull Graphical model is a good tool to analyze problems

bull The essence of decomposition is to discover a set of mid-level features to describe original documentsimages

bull It is more adaptable for various applications than matrix decomposition

  • An Introduction To Matrix Decomposition and Graphical Model
  • Outline
  • What Is Matrix Decomposition
  • Why We Need Matrix Decomposition
  • Why We Need Matrix Decomposition (2)
  • Principle Component Analysis
  • Definition ndash Eigenvalue amp Eigenvector
  • Definition ndash Principle Component Analysis
  • Principle Component Analysis (2)
  • Maximizing Variance
  • Optimization Problem
  • Property Data Decomposition
  • Face Recognition ndash Eigenface
  • Slide 14
  • Slide 15
  • PageRank ndash Power Iteration
  • Column-Stochastic amp Irreducible
  • Iterative PageRank Calculation
  • Convergence of the power iteration
  • Singular value decomposition
  • SVD - Definition
  • Singular Values And Singular Vectors
  • Matrix approximation
  • SVD and PCA
  • Example - LSI
  • SVD and PCA (2)
  • Latent Semantic Indexing (LSI)
  • Latent Semantic Indexing
  • Similarity Measures
  • Similarity Measures (2)
  • HITS (Hyperlink Induced Topic Search)
  • Power Iteration
  • HITS and SVD
  • HITS vs PageRank
  • NMF ndash Non-Negative Matrix Factorization
  • Definition
  • Motivation
  • Multiplicative Update Algorithm
  • Multiplicative Update Algorithm (2)
  • NMF vs PCA
  • Reference
  • Major Reference
  • Outline (2)
  • Not Included
  • Basic Concepts
  • What Is Machine Learning
  • Likelihood Function
  • Maximum Likelihood (ML)
  • IID ndash Independent Identically Distributed
  • Reference (2)
  • Expectation Maximization
  • Why We Need EM
  • More General
  • The Expectation Maximization (EM) Algorithm
  • Jensenrsquos Inequality
  • Lower Bounding the Log Likelihood
  • The E and M Steps of EM
  • The E Step
  • The M Step
  • EM Never Decreases the Likelihood
  • Reference (3)
  • Why Do We Need Graphical Model
  • Why Do We Need Graphical Models
  • Directed Acyclic Graphical Models (Bayesian Networks)
  • Directed Graphs for Statistical Models Plate Notation
  • pLSA ndash Probabilistic Latent Semantic Analysis
  • Latent Semantic Indexing (LSI) Review
  • pLSA ndash Probabilistic Latent Semantic Analysis (2)
  • pLSA
  • pLSA (2)
  • Joint Probability vs Likelihood
  • Document Decomposition
  • pLSA ndash Objective Function
  • EM Steps
  • Lower Bounding the Log Likelihood (2)
  • EM Steps (2)
  • Latent Subspace
  • pLSA vs LSA
  • pLSA vs LSA (2)
  • Applications
  • Text Mining
  • Scene Classification
  • Classification Result
  • Reference (4)
  • LDA ndash Latent Dirichilet Allocation
  • Problems in pLSA
  • Problems in pLSA (2)
  • Dirichlet Distribution
  • Dirichlet Distribution (2)
  • Example Dirichlet Distributions (K=3)
  • Example Dirichlet Distributions (K=3) (2)
  • The LDA Model
  • The LDA Model (2)
  • Joint Probability
  • Likelihood
  • Inference
  • Inference (2)
  • Variational Inference
  • Variantional Inference
  • VBEM vs EM
  • Parameter Estimation
  • Parameter Estimation (2)
  • Topic Examples in a 100-topic LDA Model)
  • Classification (50-topic LDA + SVM)
  • Problems in LDA
  • A Bayesian Hierarchical Model for Learning Natural Scene Catego
  • Codebook
  • Topic Distribution in Different Categories
  • Topic Hierarchical Clustering
  • More Topic Models
  • Are you really into Graphical Models
  • Reference (5)
  • Outline (3)
  • Slide 115
  • The Long Query Problem
  • Key Idea Dimension Reduction + Residual Error Preservation
  • Orthogonal Decomposition
  • A Probabilistic Implementation
  • Search (Online)
  • Search Example
  • Search Example (2)
  • Simultaneously Modeling Semantics and Structure of Threaded Dis
  • Semantic amp structure
  • Optimize them together
  • Reply reconstruction
  • Baselines
  • Evaluation
  • Expert finding
  • Baselines (2)
  • Evaluation (2)
  • Summary
Page 25: An Introduction To Matrix Decomposition and Graphical Model

Latent Semantic Indexing (LSI)1 Document file preparation preprocessing

ndash Indexing collecting termsndash Use stop list eliminate rdquomeaninglessrdquo wordsndash Stemming

2 Construction term-by-document matrix sparse matrix storage

3 Query matching distance measures4 Data compression by low rank approximation SVD5 Ranking and relevance feedback

Latent Semantic Indexingbull Assumption there is some underlying latent semantic

structure in the data

bull Eg car and automobile occur in similar documents as do cows and sheep

bull This structure can be enhanced by projecting the data (the term-by-document matrix and the queries) onto a lower dimensional space using SVD

Similarity Measures

bull Term to term AAT = UΣ2UT = (UΣ)(UΣ)T

UΣ are the coordinates of A (rows) projected to space V

bull Document to documentATA = VΣ2VT = (VΣ)(VΣ)T

VΣ are the coordinates of A (columns) projected to space U

Similarity Measures

bull Term to document A = UΣVT = (UΣfrac12)(VΣfrac12)T

UΣfrac12 are the coordinates of A (rows) projected to space VVΣfrac12 are the coordinates of A (columns) projected to space U

HITS (Hyperlink Induced Topic Search)

bull Idea Web includes two flavors of prominent pagesndash authorities contain high-quality informationndash hubs are comprehensive lists of links to authorities

bull A page is a good authority if many hubs point to itbull A page is a good hub if it points to many authoritiesbull Good authorities are pointed to by good hubs and good hubs

point to good authorities

Hubs Authorities

Power Iteration

bull Each page i has both a hub score hi and an authority score aibull HITS successively refines these scores by computing

bull Define the adjacency matrix L of the directed web graph

bull Now

HITS and SVD

bull L rows are outlinks columns are inlinks

bull a will be the dominant eigenvector of the authority matrix LTL bull h will be the dominant eigenvector of the hub matrix LLT

bull They are in fact the first left and right singular vectors of Lbull We are in fact running SVD on the adjacency matrix

HITS vs PageRankbull PageRank may be computed once HITS is computed per

query

bull HITS takes query into account PageRank doesnrsquot

bull PageRank has no concept of hubs

bull HITS is sensitive to local topology insertion or deletion of a small number of nodes may change the scores a lot

bull PageRank more stable because of its random jump step

NMF ndash NON-NEGATIVE MATRIX FACTORIZATION

Definition

bull Given a nonnegative matrix Vntimesm find non-negative matrix factors Wntimesk and Hktimesm such that

VntimesmasympWntimeskHktimesm

bull V column matrix each column is a data sample (n-dimension)bull Wi k-basis represents one basebull H coordinates of V projected to W

vj asymp Wntimeskhj

Motivationbull Non-negativity is natural in many applications

bull Probability is also non-negative

bull Additive model to capture local structure

Multiplicative Update Algorithmbull Cost function Euclidean distance

bull Multiplicative Update

Multiplicative Update Algorithmbull Cost function Divergence

ndash Reduce to Kullback-Leibler divergence whenndash A and B can be regarded as normalized probability distributions

bull Multiplicative update

bull PLSA is NMF with KL divergence

NMF vs PCA

bull n = 2429 faces m = 19x19 pixelsbull Positive values are illustrated with black pixels and negative values

with red pixels

bull NMF Parts-based representationbull PCA Holistic representations

Referencebull D D Lee and H S Seung Algorithms for non-negative matrix

factorization (pdf) NIPS 2001bull D D Lee and H S Seung Learning the parts of objects by non-negative

matrix factorization (pdf) Nature 401 788-791 (1999)

Major Referencebull Saara Hyvoumlnen Linear Algebra Methods for Data Mining Spring 2007

University of Helsinki (Highly recommend)

Outlinebull Basic concepts

ndash Likelihood iidndash ML MAP and Bayesian Inferencendash Expectation-Maximizationndash Mixture Gaussian Parameter estimation

bull pLSAndash Motivationndash Derivation amp Geometry propertiesndash Applications

bull LDAndash Motivation - Why to add a hyper parameterndash Dirichlet Distributionndash Variational EMndash Relations with other topic modalsndash Incorporating category information

bull Summary

Not Included

bull General graphical model theoriesbull Markov random field (belief propagation)bull Detailed derivation of LDA

BASIC CONCEPTS

What Is Machine Learning

Databull Let x = (x1 x2 xD) T denote a data point and D = x(1) x(2) x(N) a

data set D is sometimes associated with desired outputs y1 y2

Predictionsbull We are generally interested in predicting something based on the observed

data setbull Given D what can we say about x(N+1)

Modelbull To make predictions we need to make some assumptions We can often

express these assumptions in the form of a model with some parameters θbull Given data D we learn the model parameters from which we can predict

new data pointsbull The model can often be expressed as a probability distribution over data

points

Likelihood Function

bull Given a set of parameter values probability density function (PDF) will show that some data are more probable than other data

bull Inversely given the observed data and a model of interest Likelihood function is defined as

L(θ) = fθ(x|θ) = p(x|θ)

bull That is likelihood function L(θ) will show that some parameters are more likely to have produced the data

Maximum Likelihood (ML)

bull Maximum likelihood will find the best model parameters that make the data ldquomost likelyrdquo generated from this model

bull Suppose we are given n data samples (x1 x2 hellip xn)

bull Maximum likelihood will find θ that maximize L(θ)

bull Predictive distribution

IID ndash Independent Identically Distributed

bull IID means

bull The problem is considerably simplified as

bull Usually log likehood is used

Reference

bull Zoubin Ghahramani Machine Learning (4F13) 2006 Cambridge(Introduction to Machine Learning Lectures 1-2 Slides)

bull Gregor Heinrich Parameter estimation for text analysis Technical Note 2005-2008

EXPECTATION MAXIMIZATION

Why We Need EM

bull The ExpectationndashMaximization (EM) algorithm is a method for ML learning of parameters in latent variable models

bull Why we need latent variables

bull To describe complex model Gaussian Mixture Model

bull To discover the intrinsic structure inside a data set Topic Models such as pLSA LDA

More General

bull Data set bull Likelihood

bull Goal learn maximum likelihood (ML) parameter values

bull The maximum likelihood procedure finds parameters θ such that

bull Because of the integral (or sum) over latent variables the likelihood can be a very complicated and hard to optimize function

The Expectation Maximization (EM) Algorithm

bull The EM algorithm finds a (local) maximum of a latent variable model likelihood It starts from arbitrary values of the parameters and iterates two steps

bull E step Fill in values of latent variables according to posterior given data

bull M step Maximize likelihood as if latent variables were not hidden

bull Decomposes difficult problems into series of tractable steps

Jensenrsquos Inequality

Lower Bounding the Log Likelihoodbull Observed data D = yn Latent variables X = xn Parameters θbull Goal Maximize the log likelihood (ie ML learning) wrt θ

bull Any distribution q(X) over the hidden variables can be used to obtain a lower bound on the log likelihood using Jensenrsquos inequality

bull where H[q] is the entropy of q(X)

The E and M Steps of EM

bull The lower bound on the log likelihood is given by

bull EM alternates betweenbull E step optimize wrt distribution over hidden variables

holding params fixed

bull M step maximize wrt parameters holding hidden distribution fixed

The E Step

bull E step for fixed θ

bull The second term is the Kullback-Leibler divergencebull This means that for fixed θ F is bounded above by L and achieves

that bound whenbull So the E step simply sets

The M Step

bull M step maximize wrt parameters holding hidden distribution q fixed

bull The second equality comes from fact that entropy of q(X) does not depend directly on θ

bull The specific form of the M step depends on the model Often the maximum wrt θ can be found analytically

EM Never Decreases the Likelihood

bull The E and M steps together never decrease the log likelihood

bull The E step brings F(q θ) to the likelihood L(θ) 顶bull The M-step maximizes F(q θ) wrt θ 抬bull F(q θ) lt L(θ) by Jensen ndash or equivalently from the non-

negativity of KL

Reference

bull Zoubin Ghahramani Machine Learning (4F13) 2006 Cambridge(Unsupervised learning Lecture 5 Slides)

bull Christopher M Bishop (2006) Pattern Recognition and Machine Learning Springer

WHY DO WE NEED GRAPHICAL MODEL

Why Do We Need Graphical Models

bull Consndash Graphical model is so complex even with a few circleshellipndash We have to make too many assumptions

bull Prosndash We do need probability to explain our world But joint probability is

hard to computendash Graphical model can help us analyze and understand our problemsndash Graphs are an intuitive way of representing and visualizing the

relationships between many variablesndash With a graphical model we can decouple joint probability to

conditional probabilities which are usually easier

Directed Acyclic Graphical Models (Bayesian Networks)

bull A DAG Model Bayesian network corresponds to a factorization of the joint probability distribution

p(ABCDE) = p(A)p(B)p(C|AB)p(D|BC)p(E|CD)

bull In general

bull where pa(i) are the parents of node i

Directed Graphs for Statistical ModelsPlate Notation

bull A data set of N points generated from a Gaussian

PLSA ndash PROBABILISTIC LATENT SEMANTIC ANALYSIS

Latent Semantic Indexing (LSI) Review

bull For natural Language Queries simple term matching does not work effectivelyndash Ambiguous terms ndash Same Queries vary due to personal styles

bull Latent semantic indexingndash Creates this lsquolatent semantic spacersquo (hidden meaning)

bull LSI puts documents together even if they donrsquot have common words if the docs share frequently co-occurring terms

bull Disadvantagesndash Statistical foundation is missing

pLSA ndash Probabilistic Latent Semantic Analysis

bull Automated Document Indexing and Information retrievalbull Identification of Latent Classes using an Expectation

Maximization (EM) Algorithmbull Shown to solve

ndash Polysemybull Java could mean ldquocoffeerdquo and also the ldquoPL Javardquobull Cricket is a ldquogamerdquo and also an ldquoinsectrdquo

ndash Synonymybull ldquocomputerrdquo ldquopcrdquo ldquodesktoprdquo all could mean the same

bull Has a better statistical foundation than LSA

pLSA

M

Nd

d

z

w

M

d

z1

w1

z2

w2

z3

w3

zN

wN

hellip

z1 hellip zN are variables ziє[1K]K is the number of latent topics

pLSA

d1

z1

w1

z2

w2

zN1

wN1

hellip

d2

z1

w1

z2

w2

zN2

wN2

hellip

dM

z1

w1

z2

w2

zNm

wNm

hellip

p(w|z=1) p(w|z=2) p(w | z=Nm) are shared for all documents

Likelihood

Joint Probability vs Likelihood

bull Joint probability

bull Likelihood (only for observed variables)

bull p(d) is assumed to be uniform

Document Decomposition

bull Each document can be decomposed as

bull This is similar to the matrix decomposition if we consider each discrete distribution as a vector

p(w|d) = ZVtimesk p(z|d)

bull With many documents we hope to find latent topics as common basis

pLSA ndash Objective Function

bull pLSA tries to maximize the log likelihood

bull Due to the summation over z inside log we have to resort to EM

EM Steps

bull E-Stepndash Expectation of the likelihood function is calculated with the current

parameter valuesbull M-Step

ndash Update the parameters with the calculated posterior probabilitiesndash Find the parameters that maximizes the likelihood function

Lower Bounding the Log Likelihood

EM Steps

bull The E-Step

bull The M-Step

Latent Subspace

pLSA vs LSAbull LSA and PLSA perform dimensionality reduction

ndash In LSA by keeping only K singular valuesndash In PLSA by having K aspects

bull Comparison to SVDndash U Matrix related to P(z|d) (doc to aspect)ndash V Matrix related to P(w|z) (aspect to term)ndash E Matrix related to P(z) (aspect strength)

pLSA vs LSAbull The main difference is the way the approximation is done

bull PLSA generates a model (aspect model) and maximizes its predictive power

bull Selecting the proper value of K is heuristic in LSA

bull Model selection in statistics can determine optimal K in PLSA

Applications

bull Text mining topic discovering

bull Scene Classification

Text Mining

Scene Classification

Classification Result

Reference

bull Thomas Hofmann Probabilistic latent semantic analysis In Proc of Uncertainty in Artificial Intelligence UAI99 Stockholm 1999

bull Bosch A Zisserman A and Munoz X Scene Classification via pLSAProceedings of the European Conference on Computer Vision (2006)

bull Sivic J Russell B C Efros A A Zisserman A and Freeman W T Discovering Object Categories in Image Collections MIT AI Lab Memo AIM-2005-005 February 2005

LDA ndash LATENT DIRICHILET ALLOCATION

Problems in pLSA

bull pLSA provides no probabilistic model at the document level Each doc has its own topic mixture proportion

bull The number of parameters in the model grows linearly with M (the number of documents in the training set)

Problems in pLSA

bull There is no constraint for distributions p(z|di)

bull Easy to lead to serious problems with over-fitting

d1

z1

w1

z2

w2

zN1

wN1

hellip

d2

z1

w1

z2

w2

zN2

wN2

hellip

dm

z1

w1

z2

w2

zNm

wNm

hellip

p(z|d1) p(z|d2) p(z|dm)

Dirichlet Distribution

bull In the LDA model the topic mixture proportions for each document are assumed to follow some distribution

bull Requirement for such a distributionndash The samples (mixture proportions) generated from it are K-tuples of non-

negative numbers that sum to one That is the samples are multinormialsndash Easy to optimize

bull Dirichlet Distribution is one of such distributions

bull The space of all of these multinomials has a nice geometric interpretation as a (k-1)-simplex

Dirichlet Distribution

bull Definition

bull The density is zero outside this open (K minus 1)-dimensional simplex

k

i ii

ki ik

i i

ki i

kk

xx

xxxxp i

1

11

1

12121

1 0 st

)Γ()(Γ)|(

bull Various parameter α

(6 2 2) (3 7 5)

(2 3 4) (6 2 6)

Example Dirichlet Distributions (K=3)

Example Dirichlet Distributions (K=3)

bull Equal αi different

α0=01 α0=1 α0=10

k

i i10

The LDA Model

z4z3z2z1

w4w3w2w1

b

z4z3z2z1

w4w3w2w1

z4z3z2z1

w4w3w2w1

bull For each documentbull Choose ~Dirichlet()bull For each of the N words wn

ndash Choose a topic znraquo Multinomial()

ndash Choose a word wn from p(wn|znb) a multinomial probability conditioned on the topic zn

The LDA Model

bull For each documentbull Choose ~Dirichlet()bull For each of the N words wn

ndash Choose a topic znraquo Multinomial()

ndash Choose a word wn from p(wn|znb) a multinomial probability conditioned on the topic zn

Joint Probability

bull Given parameter α and β

where

Likelihood

bull Joint Probability

bull Marginal distribution of a document

bull Likelihood over all the documents

Inference

bull The likelihood can be computed by summing each documentbull Jansenrsquos inequality in EM

Inference

bull In E-Step we need to compute the posterior distribution of the hidden variables

bull Unfortunately this distribution is intractable to compute in general

bull We have to resort to variational approach

Variational Inference

bull In variational inference we consider a simplified graphical model with variational parameters and minimize the KL Divergence between the variational and posterior distributions

Variantional Inference

bull The difference between the lower bound and the likelihood is the KL divergence

bull Maximizing the lower bound L(b) with respect to and is equivalent to minimizing the KL divergence

VBEM vs EM

bull Only different in the E-Step

bull In standard EM q(X) is directly set to p(X|Dθ) and let KL=0bull In VBEM it is intractable to compute p(X|Dθ) Instead it

approximates p(X|Dθ) by a variational distribution q(X) by minimizing KL(q(X) | P(X|D θ)

bull This is also equivalent to maximizing the lower bound L(θ)

Parameter Estimation

bull Given a corpus of documents we would like to find the parameters and b which maximize the likelihood of the observed data

bull Strategy (Variational EM)

bull Lower bound log p(w|b) by a function L(b)bull Repeat until convergence

ndash E Maximize L(b) with respect to the variational parameters ndash M Maximize the bound with respect to parameters and b

Parameter Estimation

bull E-Step Variational Inference ndash repeat until convergence

bull M-Step Parameter estimation

β

α can be implemented using the Newton-Raphson method

Topic Examples in a 100-topic LDA Model)bull 16000 documents from a subset of the TREC AP corpus

Classification (50-topic LDA + SVM)

bull Reuters-21578 dataset ndash contains 8000 documents and 15818 words

(a) EARN vs NOT EARN (b) GRAIN vs NOT GRAIN

Problems in LDA

bull Dirichlet Distribution is helpful to avoid over-fitting But the assumption might be too strong

z4z3z2z1

w4w3w2w1

b

z4z3z2z1

w4w3w2w1

z4z3z2z1

w4w3w2w1

A Bayesian Hierarchical Model for Learning Natural Scene Categories

bull Incorporating category information

MNd

π

z

x

θ

β

Codebookbull 174 Local Image Patches

bull DetectionEvenly Sampled GridRandom SamplingSaliency DetectorLowersquos DoG Detector

bull RepresentationNormalized 11x11 gray values128-dim SIFT

Topic Distribution in Different Categories

Topic Hierarchical Clustering

More Topic Models

bull Dynamic topic models ICML 2006bull Correlated Topic Model NIPS 2005bull Hierarchical Dirichlet Process Journal of the American

Statistical Association 2003bull Nonparametric Bayes pachinko allocation UAI 2007bull Supervised LDA NIPS 2007bull MedLDA ndash Maximum Margin Discrimant LDA ICML 2009bull hellip

Are you really into Graphical Models

bull Describing Visual Scenes using Transformed Dirichlet Processes E Sudderth A Torralba W Freeman and A Willsky NIPS Dec 2005

Reference

bull David M Blei Andrew Y Ng Michael I Jordan Latent Dirichlet Allocation Journal of Machine Learning Research (JMLR) 2003

bull Matthew J Beal Variational Algorithms for Approximate Bayesian Inference PhD Thesis University of Cambridge 1998

bull L Fei-Fei and P Perona A Bayesian Hierarchical Model for Learning Natural Scene Categories CVPR 2005

Outline

bull Matrix Decompositionndash PCA SVD NMFndash LDA ICA Sparse Coding etc

bull Graphical Modelndash Basic concepts in probabilistic machine learningndash EMndash pLSAndash LDA

bull Two Applicationsndash Document decomposition for ldquolong queryrdquo retrieval ICCV 2009ndash Modeling Threaded Discussions SIGIR 2009

Large-Scale Indexing for ldquoLong Queryrdquo Retrieval (Similarity Search)

Xiao Zhang Zhiwei Li Lei Zhang Wei-Ying Ma and Heung-Yeung Shum

ICCV 2009

The Long Query Problem

bull If a query contains 1000 keywordsndash Need to access 1000 inverted listsndash The intersection of 1000 inverted lists may be emptyndash The union of 1000 inverted list may be the whole corpus

bull Dimension reductionTerm1 Term2 Term3 Term4 hellip TermN

Img1 1 2 0 0 hellip 2

f1 f2 hellip fM

Img1 02 01 hellip 003

Topic Projection

Dim = 1 million

Dim = 200

Key Idea Dimension Reduction + Residual Error Preservation

bull p original TF-IDF vector in vocabulary spacebull X projection matrix for dimension reductionbull w low dimensional feature vectorbull residual error

p Xw

An image = 0 01 02 03 04 05 06 07 08 09 1

0

2

4

6

8

10

12

14

16

+ a few words(10 words)

Orthogonal Decomposition

p Xw 1 11 1 1

2 21 2 1 2

1 1

1

k

k

W k W

W W Wk W

p x xp x x w

p wp x x

Base vector

Low dimensional representation

Residual

An image = 0 01 02 03 04 05 06 07 08 09 1

0

2

4

6

8

10

12

14

16

+ a few words(10 words)

T Tp q p q

p q

w w

X1 X2 X3 hellip Xk

A Probabilistic Implementation

x is a switch variable It controls a word generated from

bull a topic specific distribution

bull a document specific distribution

bull a background distribution

( | )p w d 1( 0 | ) ( | ) ( | )K

kp x d p w z k p z k d

( 1| ) ( | )p x d p w d ( 2 | ) ( | )p x d p w

C Chemudugunta etal Modeling General and Specific Aspects of Documents with a Probabilistic Topic Model NIPS 2006

Search (Online)

DS1 DS2 hellip

DS1 DS2 hellip

DS1 DS2 hellip

DS1 DS2 hellip

DS1 DS2 hellip

hellip

hellip

hellip

hellip

hellip

LSH Index

Doc 300 Doc 401 hellip

A query = 0 01 02 03 04 05 06 07 08 09 1

0

2

4

6

8

10

12

14

16

+ a few words

Re-rankingDoc 401 hellip

Doc 1

Doc 2

Doc 300

Doc 401

Doc N

Doc 300

Index 10M Images 46GBSearch Speed lt 100ms

Doc Meta

Search Example

Query Image

Search Example

Query Image

Simultaneously Modeling Semantics and Structure of Threaded Discussions A Sparse

Coding Approach and Its Applications

Chen LIN Jiang-Ming YANG Rui CAI Xin-jing WANG Wei WANG Lei ZHANG

SIGIR 2009

123

Semantic amp structure

124

SemanticTopics

StructureWho reply to who

Optimize them together

Model semantic

Model structure

125

Reply reconstruction

126

DocumentSimilarity

TopicSimilarity

StructureSimilarity

BaselinesNP

Reply to Nearest PostRR

Reply to RootDS

Document SimilarityLDA

Latent Dirichlet Allocation Project documents to topic space

SWB Special Words Topic Model with Background distribution Project documents to topic and junk topic space

127

Evaluation

method Slashdot Apple

All Posts Good Posts All Posts Good Posts

NP 0021 0012 0289 0239

RR 0183 0319 0269 0474

DS 0463 0643 0409 0628

LDA 0465 0644 0410 0648

SWB 0463 0644 0410 0641

SMSS 0524 0737 0517 0772

128

Expert finding

Reply reconstruction

Network construction

Expert finding

Methods

HITS

PageRank

hellip

129

Baselines

LMFormal Models for Expert Finding in Enterprise Corpora SIGIR

06Achieves stable performance in expert finding task using a

language modelPageRank

Benchmark nodal ranking methodHITS

Find hub nodes and authority nodeEABIF

Personalized Recommendation Driven by Information Flow SIGIR rsquo06

Find most influential node130

Evaluation

131

bull Bayesian estimate

Method MRR MAP P10

LM 0821 0698 0800

EABIF(ori) 0674 0362 0243

EABIF(rec) 0742 0318 0281

PageRank(ori) 0675 0377 0263

PageRank(rec) 0743 0321 0266

HITS(ori) 0906 0832 0900

HITS(rec) 0938 0822 0906

Summary

bull Matrix and probability are fundamental mathematics in information retrieval and computer visionndash Matrix decomposition ndash a good practice to learn matrixndash Graphical model ndash a good practice to learn probability

bull Graphical model is a good tool to analyze problems

bull The essence of decomposition is to discover a set of mid-level features to describe original documentsimages

bull It is more adaptable for various applications than matrix decomposition

  • An Introduction To Matrix Decomposition and Graphical Model
  • Outline
  • What Is Matrix Decomposition
  • Why We Need Matrix Decomposition
  • Why We Need Matrix Decomposition (2)
  • Principle Component Analysis
  • Definition ndash Eigenvalue amp Eigenvector
  • Definition ndash Principle Component Analysis
  • Principle Component Analysis (2)
  • Maximizing Variance
  • Optimization Problem
  • Property Data Decomposition
  • Face Recognition ndash Eigenface
  • Slide 14
  • Slide 15
  • PageRank ndash Power Iteration
  • Column-Stochastic amp Irreducible
  • Iterative PageRank Calculation
  • Convergence of the power iteration
  • Singular value decomposition
  • SVD - Definition
  • Singular Values And Singular Vectors
  • Matrix approximation
  • SVD and PCA
  • Example - LSI
  • SVD and PCA (2)
  • Latent Semantic Indexing (LSI)
  • Latent Semantic Indexing
  • Similarity Measures
  • Similarity Measures (2)
  • HITS (Hyperlink Induced Topic Search)
  • Power Iteration
  • HITS and SVD
  • HITS vs PageRank
  • NMF ndash Non-Negative Matrix Factorization
  • Definition
  • Motivation
  • Multiplicative Update Algorithm
  • Multiplicative Update Algorithm (2)
  • NMF vs PCA
  • Reference
  • Major Reference
  • Outline (2)
  • Not Included
  • Basic Concepts
  • What Is Machine Learning
  • Likelihood Function
  • Maximum Likelihood (ML)
  • IID ndash Independent Identically Distributed
  • Reference (2)
  • Expectation Maximization
  • Why We Need EM
  • More General
  • The Expectation Maximization (EM) Algorithm
  • Jensenrsquos Inequality
  • Lower Bounding the Log Likelihood
  • The E and M Steps of EM
  • The E Step
  • The M Step
  • EM Never Decreases the Likelihood
  • Reference (3)
  • Why Do We Need Graphical Model
  • Why Do We Need Graphical Models
  • Directed Acyclic Graphical Models (Bayesian Networks)
  • Directed Graphs for Statistical Models Plate Notation
  • pLSA ndash Probabilistic Latent Semantic Analysis
  • Latent Semantic Indexing (LSI) Review
  • pLSA ndash Probabilistic Latent Semantic Analysis (2)
  • pLSA
  • pLSA (2)
  • Joint Probability vs Likelihood
  • Document Decomposition
  • pLSA ndash Objective Function
  • EM Steps
  • Lower Bounding the Log Likelihood (2)
  • EM Steps (2)
  • Latent Subspace
  • pLSA vs LSA
  • pLSA vs LSA (2)
  • Applications
  • Text Mining
  • Scene Classification
  • Classification Result
  • Reference (4)
  • LDA ndash Latent Dirichilet Allocation
  • Problems in pLSA
  • Problems in pLSA (2)
  • Dirichlet Distribution
  • Dirichlet Distribution (2)
  • Example Dirichlet Distributions (K=3)
  • Example Dirichlet Distributions (K=3) (2)
  • The LDA Model
  • The LDA Model (2)
  • Joint Probability
  • Likelihood
  • Inference
  • Inference (2)
  • Variational Inference
  • Variantional Inference
  • VBEM vs EM
  • Parameter Estimation
  • Parameter Estimation (2)
  • Topic Examples in a 100-topic LDA Model)
  • Classification (50-topic LDA + SVM)
  • Problems in LDA
  • A Bayesian Hierarchical Model for Learning Natural Scene Catego
  • Codebook
  • Topic Distribution in Different Categories
  • Topic Hierarchical Clustering
  • More Topic Models
  • Are you really into Graphical Models
  • Reference (5)
  • Outline (3)
  • Slide 115
  • The Long Query Problem
  • Key Idea Dimension Reduction + Residual Error Preservation
  • Orthogonal Decomposition
  • A Probabilistic Implementation
  • Search (Online)
  • Search Example
  • Search Example (2)
  • Simultaneously Modeling Semantics and Structure of Threaded Dis
  • Semantic amp structure
  • Optimize them together
  • Reply reconstruction
  • Baselines
  • Evaluation
  • Expert finding
  • Baselines (2)
  • Evaluation (2)
  • Summary
Page 26: An Introduction To Matrix Decomposition and Graphical Model

Latent Semantic Indexingbull Assumption there is some underlying latent semantic

structure in the data

bull Eg car and automobile occur in similar documents as do cows and sheep

bull This structure can be enhanced by projecting the data (the term-by-document matrix and the queries) onto a lower dimensional space using SVD

Similarity Measures

bull Term to term AAT = UΣ2UT = (UΣ)(UΣ)T

UΣ are the coordinates of A (rows) projected to space V

bull Document to documentATA = VΣ2VT = (VΣ)(VΣ)T

VΣ are the coordinates of A (columns) projected to space U

Similarity Measures

bull Term to document A = UΣVT = (UΣfrac12)(VΣfrac12)T

UΣfrac12 are the coordinates of A (rows) projected to space VVΣfrac12 are the coordinates of A (columns) projected to space U

HITS (Hyperlink Induced Topic Search)

bull Idea Web includes two flavors of prominent pagesndash authorities contain high-quality informationndash hubs are comprehensive lists of links to authorities

bull A page is a good authority if many hubs point to itbull A page is a good hub if it points to many authoritiesbull Good authorities are pointed to by good hubs and good hubs

point to good authorities

Hubs Authorities

Power Iteration

bull Each page i has both a hub score hi and an authority score aibull HITS successively refines these scores by computing

bull Define the adjacency matrix L of the directed web graph

bull Now

HITS and SVD

bull L rows are outlinks columns are inlinks

bull a will be the dominant eigenvector of the authority matrix LTL bull h will be the dominant eigenvector of the hub matrix LLT

bull They are in fact the first left and right singular vectors of Lbull We are in fact running SVD on the adjacency matrix

HITS vs PageRankbull PageRank may be computed once HITS is computed per

query

bull HITS takes query into account PageRank doesnrsquot

bull PageRank has no concept of hubs

bull HITS is sensitive to local topology insertion or deletion of a small number of nodes may change the scores a lot

bull PageRank more stable because of its random jump step

NMF ndash NON-NEGATIVE MATRIX FACTORIZATION

Definition

bull Given a nonnegative matrix Vntimesm find non-negative matrix factors Wntimesk and Hktimesm such that

VntimesmasympWntimeskHktimesm

bull V column matrix each column is a data sample (n-dimension)bull Wi k-basis represents one basebull H coordinates of V projected to W

vj asymp Wntimeskhj

Motivationbull Non-negativity is natural in many applications

bull Probability is also non-negative

bull Additive model to capture local structure

Multiplicative Update Algorithmbull Cost function Euclidean distance

bull Multiplicative Update

Multiplicative Update Algorithmbull Cost function Divergence

ndash Reduce to Kullback-Leibler divergence whenndash A and B can be regarded as normalized probability distributions

bull Multiplicative update

bull PLSA is NMF with KL divergence

NMF vs PCA

bull n = 2429 faces m = 19x19 pixelsbull Positive values are illustrated with black pixels and negative values

with red pixels

bull NMF Parts-based representationbull PCA Holistic representations

Referencebull D D Lee and H S Seung Algorithms for non-negative matrix

factorization (pdf) NIPS 2001bull D D Lee and H S Seung Learning the parts of objects by non-negative

matrix factorization (pdf) Nature 401 788-791 (1999)

Major Referencebull Saara Hyvoumlnen Linear Algebra Methods for Data Mining Spring 2007

University of Helsinki (Highly recommend)

Outlinebull Basic concepts

ndash Likelihood iidndash ML MAP and Bayesian Inferencendash Expectation-Maximizationndash Mixture Gaussian Parameter estimation

bull pLSAndash Motivationndash Derivation amp Geometry propertiesndash Applications

bull LDAndash Motivation - Why to add a hyper parameterndash Dirichlet Distributionndash Variational EMndash Relations with other topic modalsndash Incorporating category information

bull Summary

Not Included

bull General graphical model theoriesbull Markov random field (belief propagation)bull Detailed derivation of LDA

BASIC CONCEPTS

What Is Machine Learning

Databull Let x = (x1 x2 xD) T denote a data point and D = x(1) x(2) x(N) a

data set D is sometimes associated with desired outputs y1 y2

Predictionsbull We are generally interested in predicting something based on the observed

data setbull Given D what can we say about x(N+1)

Modelbull To make predictions we need to make some assumptions We can often

express these assumptions in the form of a model with some parameters θbull Given data D we learn the model parameters from which we can predict

new data pointsbull The model can often be expressed as a probability distribution over data

points

Likelihood Function

bull Given a set of parameter values probability density function (PDF) will show that some data are more probable than other data

bull Inversely given the observed data and a model of interest Likelihood function is defined as

L(θ) = fθ(x|θ) = p(x|θ)

bull That is likelihood function L(θ) will show that some parameters are more likely to have produced the data

Maximum Likelihood (ML)

bull Maximum likelihood will find the best model parameters that make the data ldquomost likelyrdquo generated from this model

bull Suppose we are given n data samples (x1 x2 hellip xn)

bull Maximum likelihood will find θ that maximize L(θ)

bull Predictive distribution

IID ndash Independent Identically Distributed

bull IID means

bull The problem is considerably simplified as

bull Usually log likehood is used

Reference

bull Zoubin Ghahramani Machine Learning (4F13) 2006 Cambridge(Introduction to Machine Learning Lectures 1-2 Slides)

bull Gregor Heinrich Parameter estimation for text analysis Technical Note 2005-2008

EXPECTATION MAXIMIZATION

Why We Need EM

bull The ExpectationndashMaximization (EM) algorithm is a method for ML learning of parameters in latent variable models

bull Why we need latent variables

bull To describe complex model Gaussian Mixture Model

bull To discover the intrinsic structure inside a data set Topic Models such as pLSA LDA

More General

bull Data set bull Likelihood

bull Goal learn maximum likelihood (ML) parameter values

bull The maximum likelihood procedure finds parameters θ such that

bull Because of the integral (or sum) over latent variables the likelihood can be a very complicated and hard to optimize function

The Expectation Maximization (EM) Algorithm

bull The EM algorithm finds a (local) maximum of a latent variable model likelihood It starts from arbitrary values of the parameters and iterates two steps

bull E step Fill in values of latent variables according to posterior given data

bull M step Maximize likelihood as if latent variables were not hidden

bull Decomposes difficult problems into series of tractable steps

Jensenrsquos Inequality

Lower Bounding the Log Likelihoodbull Observed data D = yn Latent variables X = xn Parameters θbull Goal Maximize the log likelihood (ie ML learning) wrt θ

bull Any distribution q(X) over the hidden variables can be used to obtain a lower bound on the log likelihood using Jensenrsquos inequality

bull where H[q] is the entropy of q(X)

The E and M Steps of EM

bull The lower bound on the log likelihood is given by

bull EM alternates betweenbull E step optimize wrt distribution over hidden variables

holding params fixed

bull M step maximize wrt parameters holding hidden distribution fixed

The E Step

bull E step for fixed θ

bull The second term is the Kullback-Leibler divergencebull This means that for fixed θ F is bounded above by L and achieves

that bound whenbull So the E step simply sets

The M Step

bull M step maximize wrt parameters holding hidden distribution q fixed

bull The second equality comes from fact that entropy of q(X) does not depend directly on θ

bull The specific form of the M step depends on the model Often the maximum wrt θ can be found analytically

EM Never Decreases the Likelihood

bull The E and M steps together never decrease the log likelihood

bull The E step brings F(q θ) to the likelihood L(θ) 顶bull The M-step maximizes F(q θ) wrt θ 抬bull F(q θ) lt L(θ) by Jensen ndash or equivalently from the non-

negativity of KL

Reference

bull Zoubin Ghahramani Machine Learning (4F13) 2006 Cambridge(Unsupervised learning Lecture 5 Slides)

bull Christopher M Bishop (2006) Pattern Recognition and Machine Learning Springer

WHY DO WE NEED GRAPHICAL MODEL

Why Do We Need Graphical Models

bull Consndash Graphical model is so complex even with a few circleshellipndash We have to make too many assumptions

bull Prosndash We do need probability to explain our world But joint probability is

hard to computendash Graphical model can help us analyze and understand our problemsndash Graphs are an intuitive way of representing and visualizing the

relationships between many variablesndash With a graphical model we can decouple joint probability to

conditional probabilities which are usually easier

Directed Acyclic Graphical Models (Bayesian Networks)

bull A DAG Model Bayesian network corresponds to a factorization of the joint probability distribution

p(ABCDE) = p(A)p(B)p(C|AB)p(D|BC)p(E|CD)

bull In general

bull where pa(i) are the parents of node i

Directed Graphs for Statistical ModelsPlate Notation

bull A data set of N points generated from a Gaussian

PLSA ndash PROBABILISTIC LATENT SEMANTIC ANALYSIS

Latent Semantic Indexing (LSI) Review

bull For natural Language Queries simple term matching does not work effectivelyndash Ambiguous terms ndash Same Queries vary due to personal styles

bull Latent semantic indexingndash Creates this lsquolatent semantic spacersquo (hidden meaning)

bull LSI puts documents together even if they donrsquot have common words if the docs share frequently co-occurring terms

bull Disadvantagesndash Statistical foundation is missing

pLSA ndash Probabilistic Latent Semantic Analysis

bull Automated Document Indexing and Information retrievalbull Identification of Latent Classes using an Expectation

Maximization (EM) Algorithmbull Shown to solve

ndash Polysemybull Java could mean ldquocoffeerdquo and also the ldquoPL Javardquobull Cricket is a ldquogamerdquo and also an ldquoinsectrdquo

ndash Synonymybull ldquocomputerrdquo ldquopcrdquo ldquodesktoprdquo all could mean the same

bull Has a better statistical foundation than LSA

pLSA

M

Nd

d

z

w

M

d

z1

w1

z2

w2

z3

w3

zN

wN

hellip

z1 hellip zN are variables ziє[1K]K is the number of latent topics

pLSA

d1

z1

w1

z2

w2

zN1

wN1

hellip

d2

z1

w1

z2

w2

zN2

wN2

hellip

dM

z1

w1

z2

w2

zNm

wNm

hellip

p(w|z=1) p(w|z=2) p(w | z=Nm) are shared for all documents

Likelihood

Joint Probability vs Likelihood

bull Joint probability

bull Likelihood (only for observed variables)

bull p(d) is assumed to be uniform

Document Decomposition

bull Each document can be decomposed as

bull This is similar to the matrix decomposition if we consider each discrete distribution as a vector

p(w|d) = ZVtimesk p(z|d)

bull With many documents we hope to find latent topics as common basis

pLSA ndash Objective Function

bull pLSA tries to maximize the log likelihood

bull Due to the summation over z inside log we have to resort to EM

EM Steps

bull E-Stepndash Expectation of the likelihood function is calculated with the current

parameter valuesbull M-Step

ndash Update the parameters with the calculated posterior probabilitiesndash Find the parameters that maximizes the likelihood function

Lower Bounding the Log Likelihood

EM Steps

bull The E-Step

bull The M-Step

Latent Subspace

pLSA vs LSAbull LSA and PLSA perform dimensionality reduction

ndash In LSA by keeping only K singular valuesndash In PLSA by having K aspects

bull Comparison to SVDndash U Matrix related to P(z|d) (doc to aspect)ndash V Matrix related to P(w|z) (aspect to term)ndash E Matrix related to P(z) (aspect strength)

pLSA vs LSAbull The main difference is the way the approximation is done

bull PLSA generates a model (aspect model) and maximizes its predictive power

bull Selecting the proper value of K is heuristic in LSA

bull Model selection in statistics can determine optimal K in PLSA

Applications

bull Text mining topic discovering

bull Scene Classification

Text Mining

Scene Classification

Classification Result

Reference

bull Thomas Hofmann Probabilistic latent semantic analysis In Proc of Uncertainty in Artificial Intelligence UAI99 Stockholm 1999

bull Bosch A Zisserman A and Munoz X Scene Classification via pLSAProceedings of the European Conference on Computer Vision (2006)

bull Sivic J Russell B C Efros A A Zisserman A and Freeman W T Discovering Object Categories in Image Collections MIT AI Lab Memo AIM-2005-005 February 2005

LDA ndash LATENT DIRICHILET ALLOCATION

Problems in pLSA

bull pLSA provides no probabilistic model at the document level Each doc has its own topic mixture proportion

bull The number of parameters in the model grows linearly with M (the number of documents in the training set)

Problems in pLSA

bull There is no constraint for distributions p(z|di)

bull Easy to lead to serious problems with over-fitting

d1

z1

w1

z2

w2

zN1

wN1

hellip

d2

z1

w1

z2

w2

zN2

wN2

hellip

dm

z1

w1

z2

w2

zNm

wNm

hellip

p(z|d1) p(z|d2) p(z|dm)

Dirichlet Distribution

bull In the LDA model the topic mixture proportions for each document are assumed to follow some distribution

bull Requirement for such a distributionndash The samples (mixture proportions) generated from it are K-tuples of non-

negative numbers that sum to one That is the samples are multinormialsndash Easy to optimize

bull Dirichlet Distribution is one of such distributions

bull The space of all of these multinomials has a nice geometric interpretation as a (k-1)-simplex

Dirichlet Distribution

bull Definition

bull The density is zero outside this open (K minus 1)-dimensional simplex

k

i ii

ki ik

i i

ki i

kk

xx

xxxxp i

1

11

1

12121

1 0 st

)Γ()(Γ)|(

bull Various parameter α

(6 2 2) (3 7 5)

(2 3 4) (6 2 6)

Example Dirichlet Distributions (K=3)

Example Dirichlet Distributions (K=3)

bull Equal αi different

α0=01 α0=1 α0=10

k

i i10

The LDA Model

z4z3z2z1

w4w3w2w1

b

z4z3z2z1

w4w3w2w1

z4z3z2z1

w4w3w2w1

bull For each documentbull Choose ~Dirichlet()bull For each of the N words wn

ndash Choose a topic znraquo Multinomial()

ndash Choose a word wn from p(wn|znb) a multinomial probability conditioned on the topic zn

The LDA Model

bull For each documentbull Choose ~Dirichlet()bull For each of the N words wn

ndash Choose a topic znraquo Multinomial()

ndash Choose a word wn from p(wn|znb) a multinomial probability conditioned on the topic zn

Joint Probability

bull Given parameter α and β

where

Likelihood

bull Joint Probability

bull Marginal distribution of a document

bull Likelihood over all the documents

Inference

bull The likelihood can be computed by summing each documentbull Jansenrsquos inequality in EM

Inference

bull In E-Step we need to compute the posterior distribution of the hidden variables

bull Unfortunately this distribution is intractable to compute in general

bull We have to resort to variational approach

Variational Inference

bull In variational inference we consider a simplified graphical model with variational parameters and minimize the KL Divergence between the variational and posterior distributions

Variantional Inference

bull The difference between the lower bound and the likelihood is the KL divergence

bull Maximizing the lower bound L(b) with respect to and is equivalent to minimizing the KL divergence

VBEM vs EM

bull Only different in the E-Step

bull In standard EM q(X) is directly set to p(X|Dθ) and let KL=0bull In VBEM it is intractable to compute p(X|Dθ) Instead it

approximates p(X|Dθ) by a variational distribution q(X) by minimizing KL(q(X) | P(X|D θ)

bull This is also equivalent to maximizing the lower bound L(θ)

Parameter Estimation

bull Given a corpus of documents we would like to find the parameters and b which maximize the likelihood of the observed data

bull Strategy (Variational EM)

bull Lower bound log p(w|b) by a function L(b)bull Repeat until convergence

ndash E Maximize L(b) with respect to the variational parameters ndash M Maximize the bound with respect to parameters and b

Parameter Estimation

bull E-Step Variational Inference ndash repeat until convergence

bull M-Step Parameter estimation

β

α can be implemented using the Newton-Raphson method

Topic Examples in a 100-topic LDA Model)bull 16000 documents from a subset of the TREC AP corpus

Classification (50-topic LDA + SVM)

bull Reuters-21578 dataset ndash contains 8000 documents and 15818 words

(a) EARN vs NOT EARN (b) GRAIN vs NOT GRAIN

Problems in LDA

bull Dirichlet Distribution is helpful to avoid over-fitting But the assumption might be too strong

z4z3z2z1

w4w3w2w1

b

z4z3z2z1

w4w3w2w1

z4z3z2z1

w4w3w2w1

A Bayesian Hierarchical Model for Learning Natural Scene Categories

bull Incorporating category information

MNd

π

z

x

θ

β

Codebookbull 174 Local Image Patches

bull DetectionEvenly Sampled GridRandom SamplingSaliency DetectorLowersquos DoG Detector

bull RepresentationNormalized 11x11 gray values128-dim SIFT

Topic Distribution in Different Categories

Topic Hierarchical Clustering

More Topic Models

bull Dynamic topic models ICML 2006bull Correlated Topic Model NIPS 2005bull Hierarchical Dirichlet Process Journal of the American

Statistical Association 2003bull Nonparametric Bayes pachinko allocation UAI 2007bull Supervised LDA NIPS 2007bull MedLDA ndash Maximum Margin Discrimant LDA ICML 2009bull hellip

Are you really into Graphical Models

bull Describing Visual Scenes using Transformed Dirichlet Processes E Sudderth A Torralba W Freeman and A Willsky NIPS Dec 2005

Reference

bull David M Blei Andrew Y Ng Michael I Jordan Latent Dirichlet Allocation Journal of Machine Learning Research (JMLR) 2003

bull Matthew J Beal Variational Algorithms for Approximate Bayesian Inference PhD Thesis University of Cambridge 1998

bull L Fei-Fei and P Perona A Bayesian Hierarchical Model for Learning Natural Scene Categories CVPR 2005

Outline

bull Matrix Decompositionndash PCA SVD NMFndash LDA ICA Sparse Coding etc

bull Graphical Modelndash Basic concepts in probabilistic machine learningndash EMndash pLSAndash LDA

bull Two Applicationsndash Document decomposition for ldquolong queryrdquo retrieval ICCV 2009ndash Modeling Threaded Discussions SIGIR 2009

Large-Scale Indexing for ldquoLong Queryrdquo Retrieval (Similarity Search)

Xiao Zhang Zhiwei Li Lei Zhang Wei-Ying Ma and Heung-Yeung Shum

ICCV 2009

The Long Query Problem

bull If a query contains 1000 keywordsndash Need to access 1000 inverted listsndash The intersection of 1000 inverted lists may be emptyndash The union of 1000 inverted list may be the whole corpus

bull Dimension reductionTerm1 Term2 Term3 Term4 hellip TermN

Img1 1 2 0 0 hellip 2

f1 f2 hellip fM

Img1 02 01 hellip 003

Topic Projection

Dim = 1 million

Dim = 200

Key Idea Dimension Reduction + Residual Error Preservation

bull p original TF-IDF vector in vocabulary spacebull X projection matrix for dimension reductionbull w low dimensional feature vectorbull residual error

p Xw

An image = 0 01 02 03 04 05 06 07 08 09 1

0

2

4

6

8

10

12

14

16

+ a few words(10 words)

Orthogonal Decomposition

p Xw 1 11 1 1

2 21 2 1 2

1 1

1

k

k

W k W

W W Wk W

p x xp x x w

p wp x x

Base vector

Low dimensional representation

Residual

An image = 0 01 02 03 04 05 06 07 08 09 1

0

2

4

6

8

10

12

14

16

+ a few words(10 words)

T Tp q p q

p q

w w

X1 X2 X3 hellip Xk

A Probabilistic Implementation

x is a switch variable It controls a word generated from

bull a topic specific distribution

bull a document specific distribution

bull a background distribution

( | )p w d 1( 0 | ) ( | ) ( | )K

kp x d p w z k p z k d

( 1| ) ( | )p x d p w d ( 2 | ) ( | )p x d p w

C Chemudugunta etal Modeling General and Specific Aspects of Documents with a Probabilistic Topic Model NIPS 2006

Search (Online)

DS1 DS2 hellip

DS1 DS2 hellip

DS1 DS2 hellip

DS1 DS2 hellip

DS1 DS2 hellip

hellip

hellip

hellip

hellip

hellip

LSH Index

Doc 300 Doc 401 hellip

A query = 0 01 02 03 04 05 06 07 08 09 1

0

2

4

6

8

10

12

14

16

+ a few words

Re-rankingDoc 401 hellip

Doc 1

Doc 2

Doc 300

Doc 401

Doc N

Doc 300

Index 10M Images 46GBSearch Speed lt 100ms

Doc Meta

Search Example

Query Image

Search Example

Query Image

Simultaneously Modeling Semantics and Structure of Threaded Discussions A Sparse

Coding Approach and Its Applications

Chen LIN Jiang-Ming YANG Rui CAI Xin-jing WANG Wei WANG Lei ZHANG

SIGIR 2009

123

Semantic amp structure

124

SemanticTopics

StructureWho reply to who

Optimize them together

Model semantic

Model structure

125

Reply reconstruction

126

DocumentSimilarity

TopicSimilarity

StructureSimilarity

BaselinesNP

Reply to Nearest PostRR

Reply to RootDS

Document SimilarityLDA

Latent Dirichlet Allocation Project documents to topic space

SWB Special Words Topic Model with Background distribution Project documents to topic and junk topic space

127

Evaluation

method Slashdot Apple

All Posts Good Posts All Posts Good Posts

NP 0021 0012 0289 0239

RR 0183 0319 0269 0474

DS 0463 0643 0409 0628

LDA 0465 0644 0410 0648

SWB 0463 0644 0410 0641

SMSS 0524 0737 0517 0772

128

Expert finding

Reply reconstruction

Network construction

Expert finding

Methods

HITS

PageRank

hellip

129

Baselines

LMFormal Models for Expert Finding in Enterprise Corpora SIGIR

06Achieves stable performance in expert finding task using a

language modelPageRank

Benchmark nodal ranking methodHITS

Find hub nodes and authority nodeEABIF

Personalized Recommendation Driven by Information Flow SIGIR rsquo06

Find most influential node130

Evaluation

131

bull Bayesian estimate

Method MRR MAP P10

LM 0821 0698 0800

EABIF(ori) 0674 0362 0243

EABIF(rec) 0742 0318 0281

PageRank(ori) 0675 0377 0263

PageRank(rec) 0743 0321 0266

HITS(ori) 0906 0832 0900

HITS(rec) 0938 0822 0906

Summary

bull Matrix and probability are fundamental mathematics in information retrieval and computer visionndash Matrix decomposition ndash a good practice to learn matrixndash Graphical model ndash a good practice to learn probability

bull Graphical model is a good tool to analyze problems

bull The essence of decomposition is to discover a set of mid-level features to describe original documentsimages

bull It is more adaptable for various applications than matrix decomposition

  • An Introduction To Matrix Decomposition and Graphical Model
  • Outline
  • What Is Matrix Decomposition
  • Why We Need Matrix Decomposition
  • Why We Need Matrix Decomposition (2)
  • Principle Component Analysis
  • Definition ndash Eigenvalue amp Eigenvector
  • Definition ndash Principle Component Analysis
  • Principle Component Analysis (2)
  • Maximizing Variance
  • Optimization Problem
  • Property Data Decomposition
  • Face Recognition ndash Eigenface
  • Slide 14
  • Slide 15
  • PageRank ndash Power Iteration
  • Column-Stochastic amp Irreducible
  • Iterative PageRank Calculation
  • Convergence of the power iteration
  • Singular value decomposition
  • SVD - Definition
  • Singular Values And Singular Vectors
  • Matrix approximation
  • SVD and PCA
  • Example - LSI
  • SVD and PCA (2)
  • Latent Semantic Indexing (LSI)
  • Latent Semantic Indexing
  • Similarity Measures
  • Similarity Measures (2)
  • HITS (Hyperlink Induced Topic Search)
  • Power Iteration
  • HITS and SVD
  • HITS vs PageRank
  • NMF ndash Non-Negative Matrix Factorization
  • Definition
  • Motivation
  • Multiplicative Update Algorithm
  • Multiplicative Update Algorithm (2)
  • NMF vs PCA
  • Reference
  • Major Reference
  • Outline (2)
  • Not Included
  • Basic Concepts
  • What Is Machine Learning
  • Likelihood Function
  • Maximum Likelihood (ML)
  • IID ndash Independent Identically Distributed
  • Reference (2)
  • Expectation Maximization
  • Why We Need EM
  • More General
  • The Expectation Maximization (EM) Algorithm
  • Jensenrsquos Inequality
  • Lower Bounding the Log Likelihood
  • The E and M Steps of EM
  • The E Step
  • The M Step
  • EM Never Decreases the Likelihood
  • Reference (3)
  • Why Do We Need Graphical Model
  • Why Do We Need Graphical Models
  • Directed Acyclic Graphical Models (Bayesian Networks)
  • Directed Graphs for Statistical Models Plate Notation
  • pLSA ndash Probabilistic Latent Semantic Analysis
  • Latent Semantic Indexing (LSI) Review
  • pLSA ndash Probabilistic Latent Semantic Analysis (2)
  • pLSA
  • pLSA (2)
  • Joint Probability vs Likelihood
  • Document Decomposition
  • pLSA ndash Objective Function
  • EM Steps
  • Lower Bounding the Log Likelihood (2)
  • EM Steps (2)
  • Latent Subspace
  • pLSA vs LSA
  • pLSA vs LSA (2)
  • Applications
  • Text Mining
  • Scene Classification
  • Classification Result
  • Reference (4)
  • LDA ndash Latent Dirichilet Allocation
  • Problems in pLSA
  • Problems in pLSA (2)
  • Dirichlet Distribution
  • Dirichlet Distribution (2)
  • Example Dirichlet Distributions (K=3)
  • Example Dirichlet Distributions (K=3) (2)
  • The LDA Model
  • The LDA Model (2)
  • Joint Probability
  • Likelihood
  • Inference
  • Inference (2)
  • Variational Inference
  • Variantional Inference
  • VBEM vs EM
  • Parameter Estimation
  • Parameter Estimation (2)
  • Topic Examples in a 100-topic LDA Model)
  • Classification (50-topic LDA + SVM)
  • Problems in LDA
  • A Bayesian Hierarchical Model for Learning Natural Scene Catego
  • Codebook
  • Topic Distribution in Different Categories
  • Topic Hierarchical Clustering
  • More Topic Models
  • Are you really into Graphical Models
  • Reference (5)
  • Outline (3)
  • Slide 115
  • The Long Query Problem
  • Key Idea Dimension Reduction + Residual Error Preservation
  • Orthogonal Decomposition
  • A Probabilistic Implementation
  • Search (Online)
  • Search Example
  • Search Example (2)
  • Simultaneously Modeling Semantics and Structure of Threaded Dis
  • Semantic amp structure
  • Optimize them together
  • Reply reconstruction
  • Baselines
  • Evaluation
  • Expert finding
  • Baselines (2)
  • Evaluation (2)
  • Summary
Page 27: An Introduction To Matrix Decomposition and Graphical Model

Similarity Measures

bull Term to term AAT = UΣ2UT = (UΣ)(UΣ)T

UΣ are the coordinates of A (rows) projected to space V

bull Document to documentATA = VΣ2VT = (VΣ)(VΣ)T

VΣ are the coordinates of A (columns) projected to space U

Similarity Measures

bull Term to document A = UΣVT = (UΣfrac12)(VΣfrac12)T

UΣfrac12 are the coordinates of A (rows) projected to space VVΣfrac12 are the coordinates of A (columns) projected to space U

HITS (Hyperlink Induced Topic Search)

bull Idea Web includes two flavors of prominent pagesndash authorities contain high-quality informationndash hubs are comprehensive lists of links to authorities

bull A page is a good authority if many hubs point to itbull A page is a good hub if it points to many authoritiesbull Good authorities are pointed to by good hubs and good hubs

point to good authorities

Hubs Authorities

Power Iteration

bull Each page i has both a hub score hi and an authority score aibull HITS successively refines these scores by computing

bull Define the adjacency matrix L of the directed web graph

bull Now

HITS and SVD

bull L rows are outlinks columns are inlinks

bull a will be the dominant eigenvector of the authority matrix LTL bull h will be the dominant eigenvector of the hub matrix LLT

bull They are in fact the first left and right singular vectors of Lbull We are in fact running SVD on the adjacency matrix

HITS vs PageRankbull PageRank may be computed once HITS is computed per

query

bull HITS takes query into account PageRank doesnrsquot

bull PageRank has no concept of hubs

bull HITS is sensitive to local topology insertion or deletion of a small number of nodes may change the scores a lot

bull PageRank more stable because of its random jump step

NMF ndash NON-NEGATIVE MATRIX FACTORIZATION

Definition

bull Given a nonnegative matrix Vntimesm find non-negative matrix factors Wntimesk and Hktimesm such that

VntimesmasympWntimeskHktimesm

bull V column matrix each column is a data sample (n-dimension)bull Wi k-basis represents one basebull H coordinates of V projected to W

vj asymp Wntimeskhj

Motivationbull Non-negativity is natural in many applications

bull Probability is also non-negative

bull Additive model to capture local structure

Multiplicative Update Algorithmbull Cost function Euclidean distance

bull Multiplicative Update

Multiplicative Update Algorithmbull Cost function Divergence

ndash Reduce to Kullback-Leibler divergence whenndash A and B can be regarded as normalized probability distributions

bull Multiplicative update

bull PLSA is NMF with KL divergence

NMF vs PCA

bull n = 2429 faces m = 19x19 pixelsbull Positive values are illustrated with black pixels and negative values

with red pixels

bull NMF Parts-based representationbull PCA Holistic representations

Referencebull D D Lee and H S Seung Algorithms for non-negative matrix

factorization (pdf) NIPS 2001bull D D Lee and H S Seung Learning the parts of objects by non-negative

matrix factorization (pdf) Nature 401 788-791 (1999)

Major Referencebull Saara Hyvoumlnen Linear Algebra Methods for Data Mining Spring 2007

University of Helsinki (Highly recommend)

Outlinebull Basic concepts

ndash Likelihood iidndash ML MAP and Bayesian Inferencendash Expectation-Maximizationndash Mixture Gaussian Parameter estimation

bull pLSAndash Motivationndash Derivation amp Geometry propertiesndash Applications

bull LDAndash Motivation - Why to add a hyper parameterndash Dirichlet Distributionndash Variational EMndash Relations with other topic modalsndash Incorporating category information

bull Summary

Not Included

bull General graphical model theoriesbull Markov random field (belief propagation)bull Detailed derivation of LDA

BASIC CONCEPTS

What Is Machine Learning

Databull Let x = (x1 x2 xD) T denote a data point and D = x(1) x(2) x(N) a

data set D is sometimes associated with desired outputs y1 y2

Predictionsbull We are generally interested in predicting something based on the observed

data setbull Given D what can we say about x(N+1)

Modelbull To make predictions we need to make some assumptions We can often

express these assumptions in the form of a model with some parameters θbull Given data D we learn the model parameters from which we can predict

new data pointsbull The model can often be expressed as a probability distribution over data

points

Likelihood Function

bull Given a set of parameter values probability density function (PDF) will show that some data are more probable than other data

bull Inversely given the observed data and a model of interest Likelihood function is defined as

L(θ) = fθ(x|θ) = p(x|θ)

bull That is likelihood function L(θ) will show that some parameters are more likely to have produced the data

Maximum Likelihood (ML)

bull Maximum likelihood will find the best model parameters that make the data ldquomost likelyrdquo generated from this model

bull Suppose we are given n data samples (x1 x2 hellip xn)

bull Maximum likelihood will find θ that maximize L(θ)

bull Predictive distribution

IID ndash Independent Identically Distributed

bull IID means

bull The problem is considerably simplified as

bull Usually log likehood is used

Reference

bull Zoubin Ghahramani Machine Learning (4F13) 2006 Cambridge(Introduction to Machine Learning Lectures 1-2 Slides)

bull Gregor Heinrich Parameter estimation for text analysis Technical Note 2005-2008

EXPECTATION MAXIMIZATION

Why We Need EM

bull The ExpectationndashMaximization (EM) algorithm is a method for ML learning of parameters in latent variable models

bull Why we need latent variables

bull To describe complex model Gaussian Mixture Model

bull To discover the intrinsic structure inside a data set Topic Models such as pLSA LDA

More General

bull Data set bull Likelihood

bull Goal learn maximum likelihood (ML) parameter values

bull The maximum likelihood procedure finds parameters θ such that

bull Because of the integral (or sum) over latent variables the likelihood can be a very complicated and hard to optimize function

The Expectation Maximization (EM) Algorithm

bull The EM algorithm finds a (local) maximum of a latent variable model likelihood It starts from arbitrary values of the parameters and iterates two steps

bull E step Fill in values of latent variables according to posterior given data

bull M step Maximize likelihood as if latent variables were not hidden

bull Decomposes difficult problems into series of tractable steps

Jensenrsquos Inequality

Lower Bounding the Log Likelihoodbull Observed data D = yn Latent variables X = xn Parameters θbull Goal Maximize the log likelihood (ie ML learning) wrt θ

bull Any distribution q(X) over the hidden variables can be used to obtain a lower bound on the log likelihood using Jensenrsquos inequality

bull where H[q] is the entropy of q(X)

The E and M Steps of EM

bull The lower bound on the log likelihood is given by

bull EM alternates betweenbull E step optimize wrt distribution over hidden variables

holding params fixed

bull M step maximize wrt parameters holding hidden distribution fixed

The E Step

bull E step for fixed θ

bull The second term is the Kullback-Leibler divergencebull This means that for fixed θ F is bounded above by L and achieves

that bound whenbull So the E step simply sets

The M Step

bull M step maximize wrt parameters holding hidden distribution q fixed

bull The second equality comes from fact that entropy of q(X) does not depend directly on θ

bull The specific form of the M step depends on the model Often the maximum wrt θ can be found analytically

EM Never Decreases the Likelihood

bull The E and M steps together never decrease the log likelihood

bull The E step brings F(q θ) to the likelihood L(θ) 顶bull The M-step maximizes F(q θ) wrt θ 抬bull F(q θ) lt L(θ) by Jensen ndash or equivalently from the non-

negativity of KL

Reference

bull Zoubin Ghahramani Machine Learning (4F13) 2006 Cambridge(Unsupervised learning Lecture 5 Slides)

bull Christopher M Bishop (2006) Pattern Recognition and Machine Learning Springer

WHY DO WE NEED GRAPHICAL MODEL

Why Do We Need Graphical Models

bull Consndash Graphical model is so complex even with a few circleshellipndash We have to make too many assumptions

bull Prosndash We do need probability to explain our world But joint probability is

hard to computendash Graphical model can help us analyze and understand our problemsndash Graphs are an intuitive way of representing and visualizing the

relationships between many variablesndash With a graphical model we can decouple joint probability to

conditional probabilities which are usually easier

Directed Acyclic Graphical Models (Bayesian Networks)

bull A DAG Model Bayesian network corresponds to a factorization of the joint probability distribution

p(ABCDE) = p(A)p(B)p(C|AB)p(D|BC)p(E|CD)

bull In general

bull where pa(i) are the parents of node i

Directed Graphs for Statistical ModelsPlate Notation

bull A data set of N points generated from a Gaussian

PLSA ndash PROBABILISTIC LATENT SEMANTIC ANALYSIS

Latent Semantic Indexing (LSI) Review

bull For natural Language Queries simple term matching does not work effectivelyndash Ambiguous terms ndash Same Queries vary due to personal styles

bull Latent semantic indexingndash Creates this lsquolatent semantic spacersquo (hidden meaning)

bull LSI puts documents together even if they donrsquot have common words if the docs share frequently co-occurring terms

bull Disadvantagesndash Statistical foundation is missing

pLSA ndash Probabilistic Latent Semantic Analysis

bull Automated Document Indexing and Information retrievalbull Identification of Latent Classes using an Expectation

Maximization (EM) Algorithmbull Shown to solve

ndash Polysemybull Java could mean ldquocoffeerdquo and also the ldquoPL Javardquobull Cricket is a ldquogamerdquo and also an ldquoinsectrdquo

ndash Synonymybull ldquocomputerrdquo ldquopcrdquo ldquodesktoprdquo all could mean the same

bull Has a better statistical foundation than LSA

pLSA

M

Nd

d

z

w

M

d

z1

w1

z2

w2

z3

w3

zN

wN

hellip

z1 hellip zN are variables ziє[1K]K is the number of latent topics

pLSA

d1

z1

w1

z2

w2

zN1

wN1

hellip

d2

z1

w1

z2

w2

zN2

wN2

hellip

dM

z1

w1

z2

w2

zNm

wNm

hellip

p(w|z=1) p(w|z=2) p(w | z=Nm) are shared for all documents

Likelihood

Joint Probability vs Likelihood

bull Joint probability

bull Likelihood (only for observed variables)

bull p(d) is assumed to be uniform

Document Decomposition

bull Each document can be decomposed as

bull This is similar to the matrix decomposition if we consider each discrete distribution as a vector

p(w|d) = ZVtimesk p(z|d)

bull With many documents we hope to find latent topics as common basis

pLSA ndash Objective Function

bull pLSA tries to maximize the log likelihood

bull Due to the summation over z inside log we have to resort to EM

EM Steps

bull E-Stepndash Expectation of the likelihood function is calculated with the current

parameter valuesbull M-Step

ndash Update the parameters with the calculated posterior probabilitiesndash Find the parameters that maximizes the likelihood function

Lower Bounding the Log Likelihood

EM Steps

bull The E-Step

bull The M-Step

Latent Subspace

pLSA vs LSAbull LSA and PLSA perform dimensionality reduction

ndash In LSA by keeping only K singular valuesndash In PLSA by having K aspects

bull Comparison to SVDndash U Matrix related to P(z|d) (doc to aspect)ndash V Matrix related to P(w|z) (aspect to term)ndash E Matrix related to P(z) (aspect strength)

pLSA vs LSAbull The main difference is the way the approximation is done

bull PLSA generates a model (aspect model) and maximizes its predictive power

bull Selecting the proper value of K is heuristic in LSA

bull Model selection in statistics can determine optimal K in PLSA

Applications

bull Text mining topic discovering

bull Scene Classification

Text Mining

Scene Classification

Classification Result

Reference

bull Thomas Hofmann Probabilistic latent semantic analysis In Proc of Uncertainty in Artificial Intelligence UAI99 Stockholm 1999

bull Bosch A Zisserman A and Munoz X Scene Classification via pLSAProceedings of the European Conference on Computer Vision (2006)

bull Sivic J Russell B C Efros A A Zisserman A and Freeman W T Discovering Object Categories in Image Collections MIT AI Lab Memo AIM-2005-005 February 2005

LDA ndash LATENT DIRICHILET ALLOCATION

Problems in pLSA

bull pLSA provides no probabilistic model at the document level Each doc has its own topic mixture proportion

bull The number of parameters in the model grows linearly with M (the number of documents in the training set)

Problems in pLSA

bull There is no constraint for distributions p(z|di)

bull Easy to lead to serious problems with over-fitting

d1

z1

w1

z2

w2

zN1

wN1

hellip

d2

z1

w1

z2

w2

zN2

wN2

hellip

dm

z1

w1

z2

w2

zNm

wNm

hellip

p(z|d1) p(z|d2) p(z|dm)

Dirichlet Distribution

bull In the LDA model the topic mixture proportions for each document are assumed to follow some distribution

bull Requirement for such a distributionndash The samples (mixture proportions) generated from it are K-tuples of non-

negative numbers that sum to one That is the samples are multinormialsndash Easy to optimize

bull Dirichlet Distribution is one of such distributions

bull The space of all of these multinomials has a nice geometric interpretation as a (k-1)-simplex

Dirichlet Distribution

bull Definition

bull The density is zero outside this open (K minus 1)-dimensional simplex

k

i ii

ki ik

i i

ki i

kk

xx

xxxxp i

1

11

1

12121

1 0 st

)Γ()(Γ)|(

bull Various parameter α

(6 2 2) (3 7 5)

(2 3 4) (6 2 6)

Example Dirichlet Distributions (K=3)

Example Dirichlet Distributions (K=3)

bull Equal αi different

α0=01 α0=1 α0=10

k

i i10

The LDA Model

z4z3z2z1

w4w3w2w1

b

z4z3z2z1

w4w3w2w1

z4z3z2z1

w4w3w2w1

bull For each documentbull Choose ~Dirichlet()bull For each of the N words wn

ndash Choose a topic znraquo Multinomial()

ndash Choose a word wn from p(wn|znb) a multinomial probability conditioned on the topic zn

The LDA Model

bull For each documentbull Choose ~Dirichlet()bull For each of the N words wn

ndash Choose a topic znraquo Multinomial()

ndash Choose a word wn from p(wn|znb) a multinomial probability conditioned on the topic zn

Joint Probability

bull Given parameter α and β

where

Likelihood

bull Joint Probability

bull Marginal distribution of a document

bull Likelihood over all the documents

Inference

bull The likelihood can be computed by summing each documentbull Jansenrsquos inequality in EM

Inference

bull In E-Step we need to compute the posterior distribution of the hidden variables

bull Unfortunately this distribution is intractable to compute in general

bull We have to resort to variational approach

Variational Inference

bull In variational inference we consider a simplified graphical model with variational parameters and minimize the KL Divergence between the variational and posterior distributions

Variantional Inference

bull The difference between the lower bound and the likelihood is the KL divergence

bull Maximizing the lower bound L(b) with respect to and is equivalent to minimizing the KL divergence

VBEM vs EM

bull Only different in the E-Step

bull In standard EM q(X) is directly set to p(X|Dθ) and let KL=0bull In VBEM it is intractable to compute p(X|Dθ) Instead it

approximates p(X|Dθ) by a variational distribution q(X) by minimizing KL(q(X) | P(X|D θ)

bull This is also equivalent to maximizing the lower bound L(θ)

Parameter Estimation

bull Given a corpus of documents we would like to find the parameters and b which maximize the likelihood of the observed data

bull Strategy (Variational EM)

bull Lower bound log p(w|b) by a function L(b)bull Repeat until convergence

ndash E Maximize L(b) with respect to the variational parameters ndash M Maximize the bound with respect to parameters and b

Parameter Estimation

bull E-Step Variational Inference ndash repeat until convergence

bull M-Step Parameter estimation

β

α can be implemented using the Newton-Raphson method

Topic Examples in a 100-topic LDA Model)bull 16000 documents from a subset of the TREC AP corpus

Classification (50-topic LDA + SVM)

bull Reuters-21578 dataset ndash contains 8000 documents and 15818 words

(a) EARN vs NOT EARN (b) GRAIN vs NOT GRAIN

Problems in LDA

bull Dirichlet Distribution is helpful to avoid over-fitting But the assumption might be too strong

z4z3z2z1

w4w3w2w1

b

z4z3z2z1

w4w3w2w1

z4z3z2z1

w4w3w2w1

A Bayesian Hierarchical Model for Learning Natural Scene Categories

bull Incorporating category information

MNd

π

z

x

θ

β

Codebookbull 174 Local Image Patches

bull DetectionEvenly Sampled GridRandom SamplingSaliency DetectorLowersquos DoG Detector

bull RepresentationNormalized 11x11 gray values128-dim SIFT

Topic Distribution in Different Categories

Topic Hierarchical Clustering

More Topic Models

bull Dynamic topic models ICML 2006bull Correlated Topic Model NIPS 2005bull Hierarchical Dirichlet Process Journal of the American

Statistical Association 2003bull Nonparametric Bayes pachinko allocation UAI 2007bull Supervised LDA NIPS 2007bull MedLDA ndash Maximum Margin Discrimant LDA ICML 2009bull hellip

Are you really into Graphical Models

bull Describing Visual Scenes using Transformed Dirichlet Processes E Sudderth A Torralba W Freeman and A Willsky NIPS Dec 2005

Reference

bull David M Blei Andrew Y Ng Michael I Jordan Latent Dirichlet Allocation Journal of Machine Learning Research (JMLR) 2003

bull Matthew J Beal Variational Algorithms for Approximate Bayesian Inference PhD Thesis University of Cambridge 1998

bull L Fei-Fei and P Perona A Bayesian Hierarchical Model for Learning Natural Scene Categories CVPR 2005

Outline

bull Matrix Decompositionndash PCA SVD NMFndash LDA ICA Sparse Coding etc

bull Graphical Modelndash Basic concepts in probabilistic machine learningndash EMndash pLSAndash LDA

bull Two Applicationsndash Document decomposition for ldquolong queryrdquo retrieval ICCV 2009ndash Modeling Threaded Discussions SIGIR 2009

Large-Scale Indexing for ldquoLong Queryrdquo Retrieval (Similarity Search)

Xiao Zhang Zhiwei Li Lei Zhang Wei-Ying Ma and Heung-Yeung Shum

ICCV 2009

The Long Query Problem

bull If a query contains 1000 keywordsndash Need to access 1000 inverted listsndash The intersection of 1000 inverted lists may be emptyndash The union of 1000 inverted list may be the whole corpus

bull Dimension reductionTerm1 Term2 Term3 Term4 hellip TermN

Img1 1 2 0 0 hellip 2

f1 f2 hellip fM

Img1 02 01 hellip 003

Topic Projection

Dim = 1 million

Dim = 200

Key Idea Dimension Reduction + Residual Error Preservation

bull p original TF-IDF vector in vocabulary spacebull X projection matrix for dimension reductionbull w low dimensional feature vectorbull residual error

p Xw

An image = 0 01 02 03 04 05 06 07 08 09 1

0

2

4

6

8

10

12

14

16

+ a few words(10 words)

Orthogonal Decomposition

p Xw 1 11 1 1

2 21 2 1 2

1 1

1

k

k

W k W

W W Wk W

p x xp x x w

p wp x x

Base vector

Low dimensional representation

Residual

An image = 0 01 02 03 04 05 06 07 08 09 1

0

2

4

6

8

10

12

14

16

+ a few words(10 words)

T Tp q p q

p q

w w

X1 X2 X3 hellip Xk

A Probabilistic Implementation

x is a switch variable It controls a word generated from

bull a topic specific distribution

bull a document specific distribution

bull a background distribution

( | )p w d 1( 0 | ) ( | ) ( | )K

kp x d p w z k p z k d

( 1| ) ( | )p x d p w d ( 2 | ) ( | )p x d p w

C Chemudugunta etal Modeling General and Specific Aspects of Documents with a Probabilistic Topic Model NIPS 2006

Search (Online)

DS1 DS2 hellip

DS1 DS2 hellip

DS1 DS2 hellip

DS1 DS2 hellip

DS1 DS2 hellip

hellip

hellip

hellip

hellip

hellip

LSH Index

Doc 300 Doc 401 hellip

A query = 0 01 02 03 04 05 06 07 08 09 1

0

2

4

6

8

10

12

14

16

+ a few words

Re-rankingDoc 401 hellip

Doc 1

Doc 2

Doc 300

Doc 401

Doc N

Doc 300

Index 10M Images 46GBSearch Speed lt 100ms

Doc Meta

Search Example

Query Image

Search Example

Query Image

Simultaneously Modeling Semantics and Structure of Threaded Discussions A Sparse

Coding Approach and Its Applications

Chen LIN Jiang-Ming YANG Rui CAI Xin-jing WANG Wei WANG Lei ZHANG

SIGIR 2009

123

Semantic amp structure

124

SemanticTopics

StructureWho reply to who

Optimize them together

Model semantic

Model structure

125

Reply reconstruction

126

DocumentSimilarity

TopicSimilarity

StructureSimilarity

BaselinesNP

Reply to Nearest PostRR

Reply to RootDS

Document SimilarityLDA

Latent Dirichlet Allocation Project documents to topic space

SWB Special Words Topic Model with Background distribution Project documents to topic and junk topic space

127

Evaluation

method Slashdot Apple

All Posts Good Posts All Posts Good Posts

NP 0021 0012 0289 0239

RR 0183 0319 0269 0474

DS 0463 0643 0409 0628

LDA 0465 0644 0410 0648

SWB 0463 0644 0410 0641

SMSS 0524 0737 0517 0772

128

Expert finding

Reply reconstruction

Network construction

Expert finding

Methods

HITS

PageRank

hellip

129

Baselines

LMFormal Models for Expert Finding in Enterprise Corpora SIGIR

06Achieves stable performance in expert finding task using a

language modelPageRank

Benchmark nodal ranking methodHITS

Find hub nodes and authority nodeEABIF

Personalized Recommendation Driven by Information Flow SIGIR rsquo06

Find most influential node130

Evaluation

131

bull Bayesian estimate

Method MRR MAP P10

LM 0821 0698 0800

EABIF(ori) 0674 0362 0243

EABIF(rec) 0742 0318 0281

PageRank(ori) 0675 0377 0263

PageRank(rec) 0743 0321 0266

HITS(ori) 0906 0832 0900

HITS(rec) 0938 0822 0906

Summary

bull Matrix and probability are fundamental mathematics in information retrieval and computer visionndash Matrix decomposition ndash a good practice to learn matrixndash Graphical model ndash a good practice to learn probability

bull Graphical model is a good tool to analyze problems

bull The essence of decomposition is to discover a set of mid-level features to describe original documentsimages

bull It is more adaptable for various applications than matrix decomposition

  • An Introduction To Matrix Decomposition and Graphical Model
  • Outline
  • What Is Matrix Decomposition
  • Why We Need Matrix Decomposition
  • Why We Need Matrix Decomposition (2)
  • Principle Component Analysis
  • Definition ndash Eigenvalue amp Eigenvector
  • Definition ndash Principle Component Analysis
  • Principle Component Analysis (2)
  • Maximizing Variance
  • Optimization Problem
  • Property Data Decomposition
  • Face Recognition ndash Eigenface
  • Slide 14
  • Slide 15
  • PageRank ndash Power Iteration
  • Column-Stochastic amp Irreducible
  • Iterative PageRank Calculation
  • Convergence of the power iteration
  • Singular value decomposition
  • SVD - Definition
  • Singular Values And Singular Vectors
  • Matrix approximation
  • SVD and PCA
  • Example - LSI
  • SVD and PCA (2)
  • Latent Semantic Indexing (LSI)
  • Latent Semantic Indexing
  • Similarity Measures
  • Similarity Measures (2)
  • HITS (Hyperlink Induced Topic Search)
  • Power Iteration
  • HITS and SVD
  • HITS vs PageRank
  • NMF ndash Non-Negative Matrix Factorization
  • Definition
  • Motivation
  • Multiplicative Update Algorithm
  • Multiplicative Update Algorithm (2)
  • NMF vs PCA
  • Reference
  • Major Reference
  • Outline (2)
  • Not Included
  • Basic Concepts
  • What Is Machine Learning
  • Likelihood Function
  • Maximum Likelihood (ML)
  • IID ndash Independent Identically Distributed
  • Reference (2)
  • Expectation Maximization
  • Why We Need EM
  • More General
  • The Expectation Maximization (EM) Algorithm
  • Jensenrsquos Inequality
  • Lower Bounding the Log Likelihood
  • The E and M Steps of EM
  • The E Step
  • The M Step
  • EM Never Decreases the Likelihood
  • Reference (3)
  • Why Do We Need Graphical Model
  • Why Do We Need Graphical Models
  • Directed Acyclic Graphical Models (Bayesian Networks)
  • Directed Graphs for Statistical Models Plate Notation
  • pLSA ndash Probabilistic Latent Semantic Analysis
  • Latent Semantic Indexing (LSI) Review
  • pLSA ndash Probabilistic Latent Semantic Analysis (2)
  • pLSA
  • pLSA (2)
  • Joint Probability vs Likelihood
  • Document Decomposition
  • pLSA ndash Objective Function
  • EM Steps
  • Lower Bounding the Log Likelihood (2)
  • EM Steps (2)
  • Latent Subspace
  • pLSA vs LSA
  • pLSA vs LSA (2)
  • Applications
  • Text Mining
  • Scene Classification
  • Classification Result
  • Reference (4)
  • LDA ndash Latent Dirichilet Allocation
  • Problems in pLSA
  • Problems in pLSA (2)
  • Dirichlet Distribution
  • Dirichlet Distribution (2)
  • Example Dirichlet Distributions (K=3)
  • Example Dirichlet Distributions (K=3) (2)
  • The LDA Model
  • The LDA Model (2)
  • Joint Probability
  • Likelihood
  • Inference
  • Inference (2)
  • Variational Inference
  • Variantional Inference
  • VBEM vs EM
  • Parameter Estimation
  • Parameter Estimation (2)
  • Topic Examples in a 100-topic LDA Model)
  • Classification (50-topic LDA + SVM)
  • Problems in LDA
  • A Bayesian Hierarchical Model for Learning Natural Scene Catego
  • Codebook
  • Topic Distribution in Different Categories
  • Topic Hierarchical Clustering
  • More Topic Models
  • Are you really into Graphical Models
  • Reference (5)
  • Outline (3)
  • Slide 115
  • The Long Query Problem
  • Key Idea Dimension Reduction + Residual Error Preservation
  • Orthogonal Decomposition
  • A Probabilistic Implementation
  • Search (Online)
  • Search Example
  • Search Example (2)
  • Simultaneously Modeling Semantics and Structure of Threaded Dis
  • Semantic amp structure
  • Optimize them together
  • Reply reconstruction
  • Baselines
  • Evaluation
  • Expert finding
  • Baselines (2)
  • Evaluation (2)
  • Summary
Page 28: An Introduction To Matrix Decomposition and Graphical Model

Similarity Measures

bull Term to document A = UΣVT = (UΣfrac12)(VΣfrac12)T

UΣfrac12 are the coordinates of A (rows) projected to space VVΣfrac12 are the coordinates of A (columns) projected to space U

HITS (Hyperlink Induced Topic Search)

bull Idea Web includes two flavors of prominent pagesndash authorities contain high-quality informationndash hubs are comprehensive lists of links to authorities

bull A page is a good authority if many hubs point to itbull A page is a good hub if it points to many authoritiesbull Good authorities are pointed to by good hubs and good hubs

point to good authorities

Hubs Authorities

Power Iteration

bull Each page i has both a hub score hi and an authority score aibull HITS successively refines these scores by computing

bull Define the adjacency matrix L of the directed web graph

bull Now

HITS and SVD

bull L rows are outlinks columns are inlinks

bull a will be the dominant eigenvector of the authority matrix LTL bull h will be the dominant eigenvector of the hub matrix LLT

bull They are in fact the first left and right singular vectors of Lbull We are in fact running SVD on the adjacency matrix

HITS vs PageRankbull PageRank may be computed once HITS is computed per

query

bull HITS takes query into account PageRank doesnrsquot

bull PageRank has no concept of hubs

bull HITS is sensitive to local topology insertion or deletion of a small number of nodes may change the scores a lot

bull PageRank more stable because of its random jump step

NMF ndash NON-NEGATIVE MATRIX FACTORIZATION

Definition

bull Given a nonnegative matrix Vntimesm find non-negative matrix factors Wntimesk and Hktimesm such that

VntimesmasympWntimeskHktimesm

bull V column matrix each column is a data sample (n-dimension)bull Wi k-basis represents one basebull H coordinates of V projected to W

vj asymp Wntimeskhj

Motivationbull Non-negativity is natural in many applications

bull Probability is also non-negative

bull Additive model to capture local structure

Multiplicative Update Algorithmbull Cost function Euclidean distance

bull Multiplicative Update

Multiplicative Update Algorithmbull Cost function Divergence

ndash Reduce to Kullback-Leibler divergence whenndash A and B can be regarded as normalized probability distributions

bull Multiplicative update

bull PLSA is NMF with KL divergence

NMF vs PCA

bull n = 2429 faces m = 19x19 pixelsbull Positive values are illustrated with black pixels and negative values

with red pixels

bull NMF Parts-based representationbull PCA Holistic representations

Referencebull D D Lee and H S Seung Algorithms for non-negative matrix

factorization (pdf) NIPS 2001bull D D Lee and H S Seung Learning the parts of objects by non-negative

matrix factorization (pdf) Nature 401 788-791 (1999)

Major Referencebull Saara Hyvoumlnen Linear Algebra Methods for Data Mining Spring 2007

University of Helsinki (Highly recommend)

Outlinebull Basic concepts

ndash Likelihood iidndash ML MAP and Bayesian Inferencendash Expectation-Maximizationndash Mixture Gaussian Parameter estimation

bull pLSAndash Motivationndash Derivation amp Geometry propertiesndash Applications

bull LDAndash Motivation - Why to add a hyper parameterndash Dirichlet Distributionndash Variational EMndash Relations with other topic modalsndash Incorporating category information

bull Summary

Not Included

bull General graphical model theoriesbull Markov random field (belief propagation)bull Detailed derivation of LDA

BASIC CONCEPTS

What Is Machine Learning

Databull Let x = (x1 x2 xD) T denote a data point and D = x(1) x(2) x(N) a

data set D is sometimes associated with desired outputs y1 y2

Predictionsbull We are generally interested in predicting something based on the observed

data setbull Given D what can we say about x(N+1)

Modelbull To make predictions we need to make some assumptions We can often

express these assumptions in the form of a model with some parameters θbull Given data D we learn the model parameters from which we can predict

new data pointsbull The model can often be expressed as a probability distribution over data

points

Likelihood Function

bull Given a set of parameter values probability density function (PDF) will show that some data are more probable than other data

bull Inversely given the observed data and a model of interest Likelihood function is defined as

L(θ) = fθ(x|θ) = p(x|θ)

bull That is likelihood function L(θ) will show that some parameters are more likely to have produced the data

Maximum Likelihood (ML)

bull Maximum likelihood will find the best model parameters that make the data ldquomost likelyrdquo generated from this model

bull Suppose we are given n data samples (x1 x2 hellip xn)

bull Maximum likelihood will find θ that maximize L(θ)

bull Predictive distribution

IID ndash Independent Identically Distributed

bull IID means

bull The problem is considerably simplified as

bull Usually log likehood is used

Reference

bull Zoubin Ghahramani Machine Learning (4F13) 2006 Cambridge(Introduction to Machine Learning Lectures 1-2 Slides)

bull Gregor Heinrich Parameter estimation for text analysis Technical Note 2005-2008

EXPECTATION MAXIMIZATION

Why We Need EM

bull The ExpectationndashMaximization (EM) algorithm is a method for ML learning of parameters in latent variable models

bull Why we need latent variables

bull To describe complex model Gaussian Mixture Model

bull To discover the intrinsic structure inside a data set Topic Models such as pLSA LDA

More General

bull Data set bull Likelihood

bull Goal learn maximum likelihood (ML) parameter values

bull The maximum likelihood procedure finds parameters θ such that

bull Because of the integral (or sum) over latent variables the likelihood can be a very complicated and hard to optimize function

The Expectation Maximization (EM) Algorithm

bull The EM algorithm finds a (local) maximum of a latent variable model likelihood It starts from arbitrary values of the parameters and iterates two steps

bull E step Fill in values of latent variables according to posterior given data

bull M step Maximize likelihood as if latent variables were not hidden

bull Decomposes difficult problems into series of tractable steps

Jensenrsquos Inequality

Lower Bounding the Log Likelihoodbull Observed data D = yn Latent variables X = xn Parameters θbull Goal Maximize the log likelihood (ie ML learning) wrt θ

bull Any distribution q(X) over the hidden variables can be used to obtain a lower bound on the log likelihood using Jensenrsquos inequality

bull where H[q] is the entropy of q(X)

The E and M Steps of EM

bull The lower bound on the log likelihood is given by

bull EM alternates betweenbull E step optimize wrt distribution over hidden variables

holding params fixed

bull M step maximize wrt parameters holding hidden distribution fixed

The E Step

bull E step for fixed θ

bull The second term is the Kullback-Leibler divergencebull This means that for fixed θ F is bounded above by L and achieves

that bound whenbull So the E step simply sets

The M Step

bull M step maximize wrt parameters holding hidden distribution q fixed

bull The second equality comes from fact that entropy of q(X) does not depend directly on θ

bull The specific form of the M step depends on the model Often the maximum wrt θ can be found analytically

EM Never Decreases the Likelihood

bull The E and M steps together never decrease the log likelihood

bull The E step brings F(q θ) to the likelihood L(θ) 顶bull The M-step maximizes F(q θ) wrt θ 抬bull F(q θ) lt L(θ) by Jensen ndash or equivalently from the non-

negativity of KL

Reference

bull Zoubin Ghahramani Machine Learning (4F13) 2006 Cambridge(Unsupervised learning Lecture 5 Slides)

bull Christopher M Bishop (2006) Pattern Recognition and Machine Learning Springer

WHY DO WE NEED GRAPHICAL MODEL

Why Do We Need Graphical Models

bull Consndash Graphical model is so complex even with a few circleshellipndash We have to make too many assumptions

bull Prosndash We do need probability to explain our world But joint probability is

hard to computendash Graphical model can help us analyze and understand our problemsndash Graphs are an intuitive way of representing and visualizing the

relationships between many variablesndash With a graphical model we can decouple joint probability to

conditional probabilities which are usually easier

Directed Acyclic Graphical Models (Bayesian Networks)

bull A DAG Model Bayesian network corresponds to a factorization of the joint probability distribution

p(ABCDE) = p(A)p(B)p(C|AB)p(D|BC)p(E|CD)

bull In general

bull where pa(i) are the parents of node i

Directed Graphs for Statistical ModelsPlate Notation

bull A data set of N points generated from a Gaussian

PLSA ndash PROBABILISTIC LATENT SEMANTIC ANALYSIS

Latent Semantic Indexing (LSI) Review

bull For natural Language Queries simple term matching does not work effectivelyndash Ambiguous terms ndash Same Queries vary due to personal styles

bull Latent semantic indexingndash Creates this lsquolatent semantic spacersquo (hidden meaning)

bull LSI puts documents together even if they donrsquot have common words if the docs share frequently co-occurring terms

bull Disadvantagesndash Statistical foundation is missing

pLSA ndash Probabilistic Latent Semantic Analysis

bull Automated Document Indexing and Information retrievalbull Identification of Latent Classes using an Expectation

Maximization (EM) Algorithmbull Shown to solve

ndash Polysemybull Java could mean ldquocoffeerdquo and also the ldquoPL Javardquobull Cricket is a ldquogamerdquo and also an ldquoinsectrdquo

ndash Synonymybull ldquocomputerrdquo ldquopcrdquo ldquodesktoprdquo all could mean the same

bull Has a better statistical foundation than LSA

pLSA

M

Nd

d

z

w

M

d

z1

w1

z2

w2

z3

w3

zN

wN

hellip

z1 hellip zN are variables ziє[1K]K is the number of latent topics

pLSA

d1

z1

w1

z2

w2

zN1

wN1

hellip

d2

z1

w1

z2

w2

zN2

wN2

hellip

dM

z1

w1

z2

w2

zNm

wNm

hellip

p(w|z=1) p(w|z=2) p(w | z=Nm) are shared for all documents

Likelihood

Joint Probability vs Likelihood

bull Joint probability

bull Likelihood (only for observed variables)

bull p(d) is assumed to be uniform

Document Decomposition

bull Each document can be decomposed as

bull This is similar to the matrix decomposition if we consider each discrete distribution as a vector

p(w|d) = ZVtimesk p(z|d)

bull With many documents we hope to find latent topics as common basis

pLSA ndash Objective Function

bull pLSA tries to maximize the log likelihood

bull Due to the summation over z inside log we have to resort to EM

EM Steps

bull E-Stepndash Expectation of the likelihood function is calculated with the current

parameter valuesbull M-Step

ndash Update the parameters with the calculated posterior probabilitiesndash Find the parameters that maximizes the likelihood function

Lower Bounding the Log Likelihood

EM Steps

bull The E-Step

bull The M-Step

Latent Subspace

pLSA vs LSAbull LSA and PLSA perform dimensionality reduction

ndash In LSA by keeping only K singular valuesndash In PLSA by having K aspects

bull Comparison to SVDndash U Matrix related to P(z|d) (doc to aspect)ndash V Matrix related to P(w|z) (aspect to term)ndash E Matrix related to P(z) (aspect strength)

pLSA vs LSAbull The main difference is the way the approximation is done

bull PLSA generates a model (aspect model) and maximizes its predictive power

bull Selecting the proper value of K is heuristic in LSA

bull Model selection in statistics can determine optimal K in PLSA

Applications

bull Text mining topic discovering

bull Scene Classification

Text Mining

Scene Classification

Classification Result

Reference

bull Thomas Hofmann Probabilistic latent semantic analysis In Proc of Uncertainty in Artificial Intelligence UAI99 Stockholm 1999

bull Bosch A Zisserman A and Munoz X Scene Classification via pLSAProceedings of the European Conference on Computer Vision (2006)

bull Sivic J Russell B C Efros A A Zisserman A and Freeman W T Discovering Object Categories in Image Collections MIT AI Lab Memo AIM-2005-005 February 2005

LDA ndash LATENT DIRICHILET ALLOCATION

Problems in pLSA

bull pLSA provides no probabilistic model at the document level Each doc has its own topic mixture proportion

bull The number of parameters in the model grows linearly with M (the number of documents in the training set)

Problems in pLSA

bull There is no constraint for distributions p(z|di)

bull Easy to lead to serious problems with over-fitting

d1

z1

w1

z2

w2

zN1

wN1

hellip

d2

z1

w1

z2

w2

zN2

wN2

hellip

dm

z1

w1

z2

w2

zNm

wNm

hellip

p(z|d1) p(z|d2) p(z|dm)

Dirichlet Distribution

bull In the LDA model the topic mixture proportions for each document are assumed to follow some distribution

bull Requirement for such a distributionndash The samples (mixture proportions) generated from it are K-tuples of non-

negative numbers that sum to one That is the samples are multinormialsndash Easy to optimize

bull Dirichlet Distribution is one of such distributions

bull The space of all of these multinomials has a nice geometric interpretation as a (k-1)-simplex

Dirichlet Distribution

bull Definition

bull The density is zero outside this open (K minus 1)-dimensional simplex

k

i ii

ki ik

i i

ki i

kk

xx

xxxxp i

1

11

1

12121

1 0 st

)Γ()(Γ)|(

bull Various parameter α

(6 2 2) (3 7 5)

(2 3 4) (6 2 6)

Example Dirichlet Distributions (K=3)

Example Dirichlet Distributions (K=3)

bull Equal αi different

α0=01 α0=1 α0=10

k

i i10

The LDA Model

z4z3z2z1

w4w3w2w1

b

z4z3z2z1

w4w3w2w1

z4z3z2z1

w4w3w2w1

bull For each documentbull Choose ~Dirichlet()bull For each of the N words wn

ndash Choose a topic znraquo Multinomial()

ndash Choose a word wn from p(wn|znb) a multinomial probability conditioned on the topic zn

The LDA Model

bull For each documentbull Choose ~Dirichlet()bull For each of the N words wn

ndash Choose a topic znraquo Multinomial()

ndash Choose a word wn from p(wn|znb) a multinomial probability conditioned on the topic zn

Joint Probability

bull Given parameter α and β

where

Likelihood

bull Joint Probability

bull Marginal distribution of a document

bull Likelihood over all the documents

Inference

bull The likelihood can be computed by summing each documentbull Jansenrsquos inequality in EM

Inference

bull In E-Step we need to compute the posterior distribution of the hidden variables

bull Unfortunately this distribution is intractable to compute in general

bull We have to resort to variational approach

Variational Inference

bull In variational inference we consider a simplified graphical model with variational parameters and minimize the KL Divergence between the variational and posterior distributions

Variantional Inference

bull The difference between the lower bound and the likelihood is the KL divergence

bull Maximizing the lower bound L(b) with respect to and is equivalent to minimizing the KL divergence

VBEM vs EM

bull Only different in the E-Step

bull In standard EM q(X) is directly set to p(X|Dθ) and let KL=0bull In VBEM it is intractable to compute p(X|Dθ) Instead it

approximates p(X|Dθ) by a variational distribution q(X) by minimizing KL(q(X) | P(X|D θ)

bull This is also equivalent to maximizing the lower bound L(θ)

Parameter Estimation

bull Given a corpus of documents we would like to find the parameters and b which maximize the likelihood of the observed data

bull Strategy (Variational EM)

bull Lower bound log p(w|b) by a function L(b)bull Repeat until convergence

ndash E Maximize L(b) with respect to the variational parameters ndash M Maximize the bound with respect to parameters and b

Parameter Estimation

bull E-Step Variational Inference ndash repeat until convergence

bull M-Step Parameter estimation

β

α can be implemented using the Newton-Raphson method

Topic Examples in a 100-topic LDA Model)bull 16000 documents from a subset of the TREC AP corpus

Classification (50-topic LDA + SVM)

bull Reuters-21578 dataset ndash contains 8000 documents and 15818 words

(a) EARN vs NOT EARN (b) GRAIN vs NOT GRAIN

Problems in LDA

bull Dirichlet Distribution is helpful to avoid over-fitting But the assumption might be too strong

z4z3z2z1

w4w3w2w1

b

z4z3z2z1

w4w3w2w1

z4z3z2z1

w4w3w2w1

A Bayesian Hierarchical Model for Learning Natural Scene Categories

bull Incorporating category information

MNd

π

z

x

θ

β

Codebookbull 174 Local Image Patches

bull DetectionEvenly Sampled GridRandom SamplingSaliency DetectorLowersquos DoG Detector

bull RepresentationNormalized 11x11 gray values128-dim SIFT

Topic Distribution in Different Categories

Topic Hierarchical Clustering

More Topic Models

bull Dynamic topic models ICML 2006bull Correlated Topic Model NIPS 2005bull Hierarchical Dirichlet Process Journal of the American

Statistical Association 2003bull Nonparametric Bayes pachinko allocation UAI 2007bull Supervised LDA NIPS 2007bull MedLDA ndash Maximum Margin Discrimant LDA ICML 2009bull hellip

Are you really into Graphical Models

bull Describing Visual Scenes using Transformed Dirichlet Processes E Sudderth A Torralba W Freeman and A Willsky NIPS Dec 2005

Reference

bull David M Blei Andrew Y Ng Michael I Jordan Latent Dirichlet Allocation Journal of Machine Learning Research (JMLR) 2003

bull Matthew J Beal Variational Algorithms for Approximate Bayesian Inference PhD Thesis University of Cambridge 1998

bull L Fei-Fei and P Perona A Bayesian Hierarchical Model for Learning Natural Scene Categories CVPR 2005

Outline

bull Matrix Decompositionndash PCA SVD NMFndash LDA ICA Sparse Coding etc

bull Graphical Modelndash Basic concepts in probabilistic machine learningndash EMndash pLSAndash LDA

bull Two Applicationsndash Document decomposition for ldquolong queryrdquo retrieval ICCV 2009ndash Modeling Threaded Discussions SIGIR 2009

Large-Scale Indexing for ldquoLong Queryrdquo Retrieval (Similarity Search)

Xiao Zhang Zhiwei Li Lei Zhang Wei-Ying Ma and Heung-Yeung Shum

ICCV 2009

The Long Query Problem

bull If a query contains 1000 keywordsndash Need to access 1000 inverted listsndash The intersection of 1000 inverted lists may be emptyndash The union of 1000 inverted list may be the whole corpus

bull Dimension reductionTerm1 Term2 Term3 Term4 hellip TermN

Img1 1 2 0 0 hellip 2

f1 f2 hellip fM

Img1 02 01 hellip 003

Topic Projection

Dim = 1 million

Dim = 200

Key Idea Dimension Reduction + Residual Error Preservation

bull p original TF-IDF vector in vocabulary spacebull X projection matrix for dimension reductionbull w low dimensional feature vectorbull residual error

p Xw

An image = 0 01 02 03 04 05 06 07 08 09 1

0

2

4

6

8

10

12

14

16

+ a few words(10 words)

Orthogonal Decomposition

p Xw 1 11 1 1

2 21 2 1 2

1 1

1

k

k

W k W

W W Wk W

p x xp x x w

p wp x x

Base vector

Low dimensional representation

Residual

An image = 0 01 02 03 04 05 06 07 08 09 1

0

2

4

6

8

10

12

14

16

+ a few words(10 words)

T Tp q p q

p q

w w

X1 X2 X3 hellip Xk

A Probabilistic Implementation

x is a switch variable It controls a word generated from

bull a topic specific distribution

bull a document specific distribution

bull a background distribution

( | )p w d 1( 0 | ) ( | ) ( | )K

kp x d p w z k p z k d

( 1| ) ( | )p x d p w d ( 2 | ) ( | )p x d p w

C Chemudugunta etal Modeling General and Specific Aspects of Documents with a Probabilistic Topic Model NIPS 2006

Search (Online)

DS1 DS2 hellip

DS1 DS2 hellip

DS1 DS2 hellip

DS1 DS2 hellip

DS1 DS2 hellip

hellip

hellip

hellip

hellip

hellip

LSH Index

Doc 300 Doc 401 hellip

A query = 0 01 02 03 04 05 06 07 08 09 1

0

2

4

6

8

10

12

14

16

+ a few words

Re-rankingDoc 401 hellip

Doc 1

Doc 2

Doc 300

Doc 401

Doc N

Doc 300

Index 10M Images 46GBSearch Speed lt 100ms

Doc Meta

Search Example

Query Image

Search Example

Query Image

Simultaneously Modeling Semantics and Structure of Threaded Discussions A Sparse

Coding Approach and Its Applications

Chen LIN Jiang-Ming YANG Rui CAI Xin-jing WANG Wei WANG Lei ZHANG

SIGIR 2009

123

Semantic amp structure

124

SemanticTopics

StructureWho reply to who

Optimize them together

Model semantic

Model structure

125

Reply reconstruction

126

DocumentSimilarity

TopicSimilarity

StructureSimilarity

BaselinesNP

Reply to Nearest PostRR

Reply to RootDS

Document SimilarityLDA

Latent Dirichlet Allocation Project documents to topic space

SWB Special Words Topic Model with Background distribution Project documents to topic and junk topic space

127

Evaluation

method Slashdot Apple

All Posts Good Posts All Posts Good Posts

NP 0021 0012 0289 0239

RR 0183 0319 0269 0474

DS 0463 0643 0409 0628

LDA 0465 0644 0410 0648

SWB 0463 0644 0410 0641

SMSS 0524 0737 0517 0772

128

Expert finding

Reply reconstruction

Network construction

Expert finding

Methods

HITS

PageRank

hellip

129

Baselines

LMFormal Models for Expert Finding in Enterprise Corpora SIGIR

06Achieves stable performance in expert finding task using a

language modelPageRank

Benchmark nodal ranking methodHITS

Find hub nodes and authority nodeEABIF

Personalized Recommendation Driven by Information Flow SIGIR rsquo06

Find most influential node130

Evaluation

131

bull Bayesian estimate

Method MRR MAP P10

LM 0821 0698 0800

EABIF(ori) 0674 0362 0243

EABIF(rec) 0742 0318 0281

PageRank(ori) 0675 0377 0263

PageRank(rec) 0743 0321 0266

HITS(ori) 0906 0832 0900

HITS(rec) 0938 0822 0906

Summary

bull Matrix and probability are fundamental mathematics in information retrieval and computer visionndash Matrix decomposition ndash a good practice to learn matrixndash Graphical model ndash a good practice to learn probability

bull Graphical model is a good tool to analyze problems

bull The essence of decomposition is to discover a set of mid-level features to describe original documentsimages

bull It is more adaptable for various applications than matrix decomposition

  • An Introduction To Matrix Decomposition and Graphical Model
  • Outline
  • What Is Matrix Decomposition
  • Why We Need Matrix Decomposition
  • Why We Need Matrix Decomposition (2)
  • Principle Component Analysis
  • Definition ndash Eigenvalue amp Eigenvector
  • Definition ndash Principle Component Analysis
  • Principle Component Analysis (2)
  • Maximizing Variance
  • Optimization Problem
  • Property Data Decomposition
  • Face Recognition ndash Eigenface
  • Slide 14
  • Slide 15
  • PageRank ndash Power Iteration
  • Column-Stochastic amp Irreducible
  • Iterative PageRank Calculation
  • Convergence of the power iteration
  • Singular value decomposition
  • SVD - Definition
  • Singular Values And Singular Vectors
  • Matrix approximation
  • SVD and PCA
  • Example - LSI
  • SVD and PCA (2)
  • Latent Semantic Indexing (LSI)
  • Latent Semantic Indexing
  • Similarity Measures
  • Similarity Measures (2)
  • HITS (Hyperlink Induced Topic Search)
  • Power Iteration
  • HITS and SVD
  • HITS vs PageRank
  • NMF ndash Non-Negative Matrix Factorization
  • Definition
  • Motivation
  • Multiplicative Update Algorithm
  • Multiplicative Update Algorithm (2)
  • NMF vs PCA
  • Reference
  • Major Reference
  • Outline (2)
  • Not Included
  • Basic Concepts
  • What Is Machine Learning
  • Likelihood Function
  • Maximum Likelihood (ML)
  • IID ndash Independent Identically Distributed
  • Reference (2)
  • Expectation Maximization
  • Why We Need EM
  • More General
  • The Expectation Maximization (EM) Algorithm
  • Jensenrsquos Inequality
  • Lower Bounding the Log Likelihood
  • The E and M Steps of EM
  • The E Step
  • The M Step
  • EM Never Decreases the Likelihood
  • Reference (3)
  • Why Do We Need Graphical Model
  • Why Do We Need Graphical Models
  • Directed Acyclic Graphical Models (Bayesian Networks)
  • Directed Graphs for Statistical Models Plate Notation
  • pLSA ndash Probabilistic Latent Semantic Analysis
  • Latent Semantic Indexing (LSI) Review
  • pLSA ndash Probabilistic Latent Semantic Analysis (2)
  • pLSA
  • pLSA (2)
  • Joint Probability vs Likelihood
  • Document Decomposition
  • pLSA ndash Objective Function
  • EM Steps
  • Lower Bounding the Log Likelihood (2)
  • EM Steps (2)
  • Latent Subspace
  • pLSA vs LSA
  • pLSA vs LSA (2)
  • Applications
  • Text Mining
  • Scene Classification
  • Classification Result
  • Reference (4)
  • LDA ndash Latent Dirichilet Allocation
  • Problems in pLSA
  • Problems in pLSA (2)
  • Dirichlet Distribution
  • Dirichlet Distribution (2)
  • Example Dirichlet Distributions (K=3)
  • Example Dirichlet Distributions (K=3) (2)
  • The LDA Model
  • The LDA Model (2)
  • Joint Probability
  • Likelihood
  • Inference
  • Inference (2)
  • Variational Inference
  • Variantional Inference
  • VBEM vs EM
  • Parameter Estimation
  • Parameter Estimation (2)
  • Topic Examples in a 100-topic LDA Model)
  • Classification (50-topic LDA + SVM)
  • Problems in LDA
  • A Bayesian Hierarchical Model for Learning Natural Scene Catego
  • Codebook
  • Topic Distribution in Different Categories
  • Topic Hierarchical Clustering
  • More Topic Models
  • Are you really into Graphical Models
  • Reference (5)
  • Outline (3)
  • Slide 115
  • The Long Query Problem
  • Key Idea Dimension Reduction + Residual Error Preservation
  • Orthogonal Decomposition
  • A Probabilistic Implementation
  • Search (Online)
  • Search Example
  • Search Example (2)
  • Simultaneously Modeling Semantics and Structure of Threaded Dis
  • Semantic amp structure
  • Optimize them together
  • Reply reconstruction
  • Baselines
  • Evaluation
  • Expert finding
  • Baselines (2)
  • Evaluation (2)
  • Summary
Page 29: An Introduction To Matrix Decomposition and Graphical Model

HITS (Hyperlink Induced Topic Search)

bull Idea Web includes two flavors of prominent pagesndash authorities contain high-quality informationndash hubs are comprehensive lists of links to authorities

bull A page is a good authority if many hubs point to itbull A page is a good hub if it points to many authoritiesbull Good authorities are pointed to by good hubs and good hubs

point to good authorities

Hubs Authorities

Power Iteration

bull Each page i has both a hub score hi and an authority score aibull HITS successively refines these scores by computing

bull Define the adjacency matrix L of the directed web graph

bull Now

HITS and SVD

bull L rows are outlinks columns are inlinks

bull a will be the dominant eigenvector of the authority matrix LTL bull h will be the dominant eigenvector of the hub matrix LLT

bull They are in fact the first left and right singular vectors of Lbull We are in fact running SVD on the adjacency matrix

HITS vs PageRankbull PageRank may be computed once HITS is computed per

query

bull HITS takes query into account PageRank doesnrsquot

bull PageRank has no concept of hubs

bull HITS is sensitive to local topology insertion or deletion of a small number of nodes may change the scores a lot

bull PageRank more stable because of its random jump step

NMF ndash NON-NEGATIVE MATRIX FACTORIZATION

Definition

bull Given a nonnegative matrix Vntimesm find non-negative matrix factors Wntimesk and Hktimesm such that

VntimesmasympWntimeskHktimesm

bull V column matrix each column is a data sample (n-dimension)bull Wi k-basis represents one basebull H coordinates of V projected to W

vj asymp Wntimeskhj

Motivationbull Non-negativity is natural in many applications

bull Probability is also non-negative

bull Additive model to capture local structure

Multiplicative Update Algorithmbull Cost function Euclidean distance

bull Multiplicative Update

Multiplicative Update Algorithmbull Cost function Divergence

ndash Reduce to Kullback-Leibler divergence whenndash A and B can be regarded as normalized probability distributions

bull Multiplicative update

bull PLSA is NMF with KL divergence

NMF vs PCA

bull n = 2429 faces m = 19x19 pixelsbull Positive values are illustrated with black pixels and negative values

with red pixels

bull NMF Parts-based representationbull PCA Holistic representations

Referencebull D D Lee and H S Seung Algorithms for non-negative matrix

factorization (pdf) NIPS 2001bull D D Lee and H S Seung Learning the parts of objects by non-negative

matrix factorization (pdf) Nature 401 788-791 (1999)

Major Referencebull Saara Hyvoumlnen Linear Algebra Methods for Data Mining Spring 2007

University of Helsinki (Highly recommend)

Outlinebull Basic concepts

ndash Likelihood iidndash ML MAP and Bayesian Inferencendash Expectation-Maximizationndash Mixture Gaussian Parameter estimation

bull pLSAndash Motivationndash Derivation amp Geometry propertiesndash Applications

bull LDAndash Motivation - Why to add a hyper parameterndash Dirichlet Distributionndash Variational EMndash Relations with other topic modalsndash Incorporating category information

bull Summary

Not Included

bull General graphical model theoriesbull Markov random field (belief propagation)bull Detailed derivation of LDA

BASIC CONCEPTS

What Is Machine Learning

Databull Let x = (x1 x2 xD) T denote a data point and D = x(1) x(2) x(N) a

data set D is sometimes associated with desired outputs y1 y2

Predictionsbull We are generally interested in predicting something based on the observed

data setbull Given D what can we say about x(N+1)

Modelbull To make predictions we need to make some assumptions We can often

express these assumptions in the form of a model with some parameters θbull Given data D we learn the model parameters from which we can predict

new data pointsbull The model can often be expressed as a probability distribution over data

points

Likelihood Function

bull Given a set of parameter values probability density function (PDF) will show that some data are more probable than other data

bull Inversely given the observed data and a model of interest Likelihood function is defined as

L(θ) = fθ(x|θ) = p(x|θ)

bull That is likelihood function L(θ) will show that some parameters are more likely to have produced the data

Maximum Likelihood (ML)

bull Maximum likelihood will find the best model parameters that make the data ldquomost likelyrdquo generated from this model

bull Suppose we are given n data samples (x1 x2 hellip xn)

bull Maximum likelihood will find θ that maximize L(θ)

bull Predictive distribution

IID ndash Independent Identically Distributed

bull IID means

bull The problem is considerably simplified as

bull Usually log likehood is used

Reference

bull Zoubin Ghahramani Machine Learning (4F13) 2006 Cambridge(Introduction to Machine Learning Lectures 1-2 Slides)

bull Gregor Heinrich Parameter estimation for text analysis Technical Note 2005-2008

EXPECTATION MAXIMIZATION

Why We Need EM

bull The ExpectationndashMaximization (EM) algorithm is a method for ML learning of parameters in latent variable models

bull Why we need latent variables

bull To describe complex model Gaussian Mixture Model

bull To discover the intrinsic structure inside a data set Topic Models such as pLSA LDA

More General

bull Data set bull Likelihood

bull Goal learn maximum likelihood (ML) parameter values

bull The maximum likelihood procedure finds parameters θ such that

bull Because of the integral (or sum) over latent variables the likelihood can be a very complicated and hard to optimize function

The Expectation Maximization (EM) Algorithm

bull The EM algorithm finds a (local) maximum of a latent variable model likelihood It starts from arbitrary values of the parameters and iterates two steps

bull E step Fill in values of latent variables according to posterior given data

bull M step Maximize likelihood as if latent variables were not hidden

bull Decomposes difficult problems into series of tractable steps

Jensenrsquos Inequality

Lower Bounding the Log Likelihoodbull Observed data D = yn Latent variables X = xn Parameters θbull Goal Maximize the log likelihood (ie ML learning) wrt θ

bull Any distribution q(X) over the hidden variables can be used to obtain a lower bound on the log likelihood using Jensenrsquos inequality

bull where H[q] is the entropy of q(X)

The E and M Steps of EM

bull The lower bound on the log likelihood is given by

bull EM alternates betweenbull E step optimize wrt distribution over hidden variables

holding params fixed

bull M step maximize wrt parameters holding hidden distribution fixed

The E Step

bull E step for fixed θ

bull The second term is the Kullback-Leibler divergencebull This means that for fixed θ F is bounded above by L and achieves

that bound whenbull So the E step simply sets

The M Step

bull M step maximize wrt parameters holding hidden distribution q fixed

bull The second equality comes from fact that entropy of q(X) does not depend directly on θ

bull The specific form of the M step depends on the model Often the maximum wrt θ can be found analytically

EM Never Decreases the Likelihood

bull The E and M steps together never decrease the log likelihood

bull The E step brings F(q θ) to the likelihood L(θ) 顶bull The M-step maximizes F(q θ) wrt θ 抬bull F(q θ) lt L(θ) by Jensen ndash or equivalently from the non-

negativity of KL

Reference

bull Zoubin Ghahramani Machine Learning (4F13) 2006 Cambridge(Unsupervised learning Lecture 5 Slides)

bull Christopher M Bishop (2006) Pattern Recognition and Machine Learning Springer

WHY DO WE NEED GRAPHICAL MODEL

Why Do We Need Graphical Models

bull Consndash Graphical model is so complex even with a few circleshellipndash We have to make too many assumptions

bull Prosndash We do need probability to explain our world But joint probability is

hard to computendash Graphical model can help us analyze and understand our problemsndash Graphs are an intuitive way of representing and visualizing the

relationships between many variablesndash With a graphical model we can decouple joint probability to

conditional probabilities which are usually easier

Directed Acyclic Graphical Models (Bayesian Networks)

bull A DAG Model Bayesian network corresponds to a factorization of the joint probability distribution

p(ABCDE) = p(A)p(B)p(C|AB)p(D|BC)p(E|CD)

bull In general

bull where pa(i) are the parents of node i

Directed Graphs for Statistical ModelsPlate Notation

bull A data set of N points generated from a Gaussian

PLSA ndash PROBABILISTIC LATENT SEMANTIC ANALYSIS

Latent Semantic Indexing (LSI) Review

bull For natural Language Queries simple term matching does not work effectivelyndash Ambiguous terms ndash Same Queries vary due to personal styles

bull Latent semantic indexingndash Creates this lsquolatent semantic spacersquo (hidden meaning)

bull LSI puts documents together even if they donrsquot have common words if the docs share frequently co-occurring terms

bull Disadvantagesndash Statistical foundation is missing

pLSA ndash Probabilistic Latent Semantic Analysis

bull Automated Document Indexing and Information retrievalbull Identification of Latent Classes using an Expectation

Maximization (EM) Algorithmbull Shown to solve

ndash Polysemybull Java could mean ldquocoffeerdquo and also the ldquoPL Javardquobull Cricket is a ldquogamerdquo and also an ldquoinsectrdquo

ndash Synonymybull ldquocomputerrdquo ldquopcrdquo ldquodesktoprdquo all could mean the same

bull Has a better statistical foundation than LSA

pLSA

M

Nd

d

z

w

M

d

z1

w1

z2

w2

z3

w3

zN

wN

hellip

z1 hellip zN are variables ziє[1K]K is the number of latent topics

pLSA

d1

z1

w1

z2

w2

zN1

wN1

hellip

d2

z1

w1

z2

w2

zN2

wN2

hellip

dM

z1

w1

z2

w2

zNm

wNm

hellip

p(w|z=1) p(w|z=2) p(w | z=Nm) are shared for all documents

Likelihood

Joint Probability vs Likelihood

bull Joint probability

bull Likelihood (only for observed variables)

bull p(d) is assumed to be uniform

Document Decomposition

bull Each document can be decomposed as

bull This is similar to the matrix decomposition if we consider each discrete distribution as a vector

p(w|d) = ZVtimesk p(z|d)

bull With many documents we hope to find latent topics as common basis

pLSA ndash Objective Function

bull pLSA tries to maximize the log likelihood

bull Due to the summation over z inside log we have to resort to EM

EM Steps

bull E-Stepndash Expectation of the likelihood function is calculated with the current

parameter valuesbull M-Step

ndash Update the parameters with the calculated posterior probabilitiesndash Find the parameters that maximizes the likelihood function

Lower Bounding the Log Likelihood

EM Steps

bull The E-Step

bull The M-Step

Latent Subspace

pLSA vs LSAbull LSA and PLSA perform dimensionality reduction

ndash In LSA by keeping only K singular valuesndash In PLSA by having K aspects

bull Comparison to SVDndash U Matrix related to P(z|d) (doc to aspect)ndash V Matrix related to P(w|z) (aspect to term)ndash E Matrix related to P(z) (aspect strength)

pLSA vs LSAbull The main difference is the way the approximation is done

bull PLSA generates a model (aspect model) and maximizes its predictive power

bull Selecting the proper value of K is heuristic in LSA

bull Model selection in statistics can determine optimal K in PLSA

Applications

bull Text mining topic discovering

bull Scene Classification

Text Mining

Scene Classification

Classification Result

Reference

bull Thomas Hofmann Probabilistic latent semantic analysis In Proc of Uncertainty in Artificial Intelligence UAI99 Stockholm 1999

bull Bosch A Zisserman A and Munoz X Scene Classification via pLSAProceedings of the European Conference on Computer Vision (2006)

bull Sivic J Russell B C Efros A A Zisserman A and Freeman W T Discovering Object Categories in Image Collections MIT AI Lab Memo AIM-2005-005 February 2005

LDA ndash LATENT DIRICHILET ALLOCATION

Problems in pLSA

bull pLSA provides no probabilistic model at the document level Each doc has its own topic mixture proportion

bull The number of parameters in the model grows linearly with M (the number of documents in the training set)

Problems in pLSA

bull There is no constraint for distributions p(z|di)

bull Easy to lead to serious problems with over-fitting

d1

z1

w1

z2

w2

zN1

wN1

hellip

d2

z1

w1

z2

w2

zN2

wN2

hellip

dm

z1

w1

z2

w2

zNm

wNm

hellip

p(z|d1) p(z|d2) p(z|dm)

Dirichlet Distribution

bull In the LDA model the topic mixture proportions for each document are assumed to follow some distribution

bull Requirement for such a distributionndash The samples (mixture proportions) generated from it are K-tuples of non-

negative numbers that sum to one That is the samples are multinormialsndash Easy to optimize

bull Dirichlet Distribution is one of such distributions

bull The space of all of these multinomials has a nice geometric interpretation as a (k-1)-simplex

Dirichlet Distribution

bull Definition

bull The density is zero outside this open (K minus 1)-dimensional simplex

k

i ii

ki ik

i i

ki i

kk

xx

xxxxp i

1

11

1

12121

1 0 st

)Γ()(Γ)|(

bull Various parameter α

(6 2 2) (3 7 5)

(2 3 4) (6 2 6)

Example Dirichlet Distributions (K=3)

Example Dirichlet Distributions (K=3)

bull Equal αi different

α0=01 α0=1 α0=10

k

i i10

The LDA Model

z4z3z2z1

w4w3w2w1

b

z4z3z2z1

w4w3w2w1

z4z3z2z1

w4w3w2w1

bull For each documentbull Choose ~Dirichlet()bull For each of the N words wn

ndash Choose a topic znraquo Multinomial()

ndash Choose a word wn from p(wn|znb) a multinomial probability conditioned on the topic zn

The LDA Model

bull For each documentbull Choose ~Dirichlet()bull For each of the N words wn

ndash Choose a topic znraquo Multinomial()

ndash Choose a word wn from p(wn|znb) a multinomial probability conditioned on the topic zn

Joint Probability

bull Given parameter α and β

where

Likelihood

bull Joint Probability

bull Marginal distribution of a document

bull Likelihood over all the documents

Inference

bull The likelihood can be computed by summing each documentbull Jansenrsquos inequality in EM

Inference

bull In E-Step we need to compute the posterior distribution of the hidden variables

bull Unfortunately this distribution is intractable to compute in general

bull We have to resort to variational approach

Variational Inference

bull In variational inference we consider a simplified graphical model with variational parameters and minimize the KL Divergence between the variational and posterior distributions

Variantional Inference

bull The difference between the lower bound and the likelihood is the KL divergence

bull Maximizing the lower bound L(b) with respect to and is equivalent to minimizing the KL divergence

VBEM vs EM

bull Only different in the E-Step

bull In standard EM q(X) is directly set to p(X|Dθ) and let KL=0bull In VBEM it is intractable to compute p(X|Dθ) Instead it

approximates p(X|Dθ) by a variational distribution q(X) by minimizing KL(q(X) | P(X|D θ)

bull This is also equivalent to maximizing the lower bound L(θ)

Parameter Estimation

bull Given a corpus of documents we would like to find the parameters and b which maximize the likelihood of the observed data

bull Strategy (Variational EM)

bull Lower bound log p(w|b) by a function L(b)bull Repeat until convergence

ndash E Maximize L(b) with respect to the variational parameters ndash M Maximize the bound with respect to parameters and b

Parameter Estimation

bull E-Step Variational Inference ndash repeat until convergence

bull M-Step Parameter estimation

β

α can be implemented using the Newton-Raphson method

Topic Examples in a 100-topic LDA Model)bull 16000 documents from a subset of the TREC AP corpus

Classification (50-topic LDA + SVM)

bull Reuters-21578 dataset ndash contains 8000 documents and 15818 words

(a) EARN vs NOT EARN (b) GRAIN vs NOT GRAIN

Problems in LDA

bull Dirichlet Distribution is helpful to avoid over-fitting But the assumption might be too strong

z4z3z2z1

w4w3w2w1

b

z4z3z2z1

w4w3w2w1

z4z3z2z1

w4w3w2w1

A Bayesian Hierarchical Model for Learning Natural Scene Categories

bull Incorporating category information

MNd

π

z

x

θ

β

Codebookbull 174 Local Image Patches

bull DetectionEvenly Sampled GridRandom SamplingSaliency DetectorLowersquos DoG Detector

bull RepresentationNormalized 11x11 gray values128-dim SIFT

Topic Distribution in Different Categories

Topic Hierarchical Clustering

More Topic Models

bull Dynamic topic models ICML 2006bull Correlated Topic Model NIPS 2005bull Hierarchical Dirichlet Process Journal of the American

Statistical Association 2003bull Nonparametric Bayes pachinko allocation UAI 2007bull Supervised LDA NIPS 2007bull MedLDA ndash Maximum Margin Discrimant LDA ICML 2009bull hellip

Are you really into Graphical Models

bull Describing Visual Scenes using Transformed Dirichlet Processes E Sudderth A Torralba W Freeman and A Willsky NIPS Dec 2005

Reference

bull David M Blei Andrew Y Ng Michael I Jordan Latent Dirichlet Allocation Journal of Machine Learning Research (JMLR) 2003

bull Matthew J Beal Variational Algorithms for Approximate Bayesian Inference PhD Thesis University of Cambridge 1998

bull L Fei-Fei and P Perona A Bayesian Hierarchical Model for Learning Natural Scene Categories CVPR 2005

Outline

bull Matrix Decompositionndash PCA SVD NMFndash LDA ICA Sparse Coding etc

bull Graphical Modelndash Basic concepts in probabilistic machine learningndash EMndash pLSAndash LDA

bull Two Applicationsndash Document decomposition for ldquolong queryrdquo retrieval ICCV 2009ndash Modeling Threaded Discussions SIGIR 2009

Large-Scale Indexing for ldquoLong Queryrdquo Retrieval (Similarity Search)

Xiao Zhang Zhiwei Li Lei Zhang Wei-Ying Ma and Heung-Yeung Shum

ICCV 2009

The Long Query Problem

bull If a query contains 1000 keywordsndash Need to access 1000 inverted listsndash The intersection of 1000 inverted lists may be emptyndash The union of 1000 inverted list may be the whole corpus

bull Dimension reductionTerm1 Term2 Term3 Term4 hellip TermN

Img1 1 2 0 0 hellip 2

f1 f2 hellip fM

Img1 02 01 hellip 003

Topic Projection

Dim = 1 million

Dim = 200

Key Idea Dimension Reduction + Residual Error Preservation

bull p original TF-IDF vector in vocabulary spacebull X projection matrix for dimension reductionbull w low dimensional feature vectorbull residual error

p Xw

An image = 0 01 02 03 04 05 06 07 08 09 1

0

2

4

6

8

10

12

14

16

+ a few words(10 words)

Orthogonal Decomposition

p Xw 1 11 1 1

2 21 2 1 2

1 1

1

k

k

W k W

W W Wk W

p x xp x x w

p wp x x

Base vector

Low dimensional representation

Residual

An image = 0 01 02 03 04 05 06 07 08 09 1

0

2

4

6

8

10

12

14

16

+ a few words(10 words)

T Tp q p q

p q

w w

X1 X2 X3 hellip Xk

A Probabilistic Implementation

x is a switch variable It controls a word generated from

bull a topic specific distribution

bull a document specific distribution

bull a background distribution

( | )p w d 1( 0 | ) ( | ) ( | )K

kp x d p w z k p z k d

( 1| ) ( | )p x d p w d ( 2 | ) ( | )p x d p w

C Chemudugunta etal Modeling General and Specific Aspects of Documents with a Probabilistic Topic Model NIPS 2006

Search (Online)

DS1 DS2 hellip

DS1 DS2 hellip

DS1 DS2 hellip

DS1 DS2 hellip

DS1 DS2 hellip

hellip

hellip

hellip

hellip

hellip

LSH Index

Doc 300 Doc 401 hellip

A query = 0 01 02 03 04 05 06 07 08 09 1

0

2

4

6

8

10

12

14

16

+ a few words

Re-rankingDoc 401 hellip

Doc 1

Doc 2

Doc 300

Doc 401

Doc N

Doc 300

Index 10M Images 46GBSearch Speed lt 100ms

Doc Meta

Search Example

Query Image

Search Example

Query Image

Simultaneously Modeling Semantics and Structure of Threaded Discussions A Sparse

Coding Approach and Its Applications

Chen LIN Jiang-Ming YANG Rui CAI Xin-jing WANG Wei WANG Lei ZHANG

SIGIR 2009

123

Semantic amp structure

124

SemanticTopics

StructureWho reply to who

Optimize them together

Model semantic

Model structure

125

Reply reconstruction

126

DocumentSimilarity

TopicSimilarity

StructureSimilarity

BaselinesNP

Reply to Nearest PostRR

Reply to RootDS

Document SimilarityLDA

Latent Dirichlet Allocation Project documents to topic space

SWB Special Words Topic Model with Background distribution Project documents to topic and junk topic space

127

Evaluation

method Slashdot Apple

All Posts Good Posts All Posts Good Posts

NP 0021 0012 0289 0239

RR 0183 0319 0269 0474

DS 0463 0643 0409 0628

LDA 0465 0644 0410 0648

SWB 0463 0644 0410 0641

SMSS 0524 0737 0517 0772

128

Expert finding

Reply reconstruction

Network construction

Expert finding

Methods

HITS

PageRank

hellip

129

Baselines

LMFormal Models for Expert Finding in Enterprise Corpora SIGIR

06Achieves stable performance in expert finding task using a

language modelPageRank

Benchmark nodal ranking methodHITS

Find hub nodes and authority nodeEABIF

Personalized Recommendation Driven by Information Flow SIGIR rsquo06

Find most influential node130

Evaluation

131

bull Bayesian estimate

Method MRR MAP P10

LM 0821 0698 0800

EABIF(ori) 0674 0362 0243

EABIF(rec) 0742 0318 0281

PageRank(ori) 0675 0377 0263

PageRank(rec) 0743 0321 0266

HITS(ori) 0906 0832 0900

HITS(rec) 0938 0822 0906

Summary

bull Matrix and probability are fundamental mathematics in information retrieval and computer visionndash Matrix decomposition ndash a good practice to learn matrixndash Graphical model ndash a good practice to learn probability

bull Graphical model is a good tool to analyze problems

bull The essence of decomposition is to discover a set of mid-level features to describe original documentsimages

bull It is more adaptable for various applications than matrix decomposition

  • An Introduction To Matrix Decomposition and Graphical Model
  • Outline
  • What Is Matrix Decomposition
  • Why We Need Matrix Decomposition
  • Why We Need Matrix Decomposition (2)
  • Principle Component Analysis
  • Definition ndash Eigenvalue amp Eigenvector
  • Definition ndash Principle Component Analysis
  • Principle Component Analysis (2)
  • Maximizing Variance
  • Optimization Problem
  • Property Data Decomposition
  • Face Recognition ndash Eigenface
  • Slide 14
  • Slide 15
  • PageRank ndash Power Iteration
  • Column-Stochastic amp Irreducible
  • Iterative PageRank Calculation
  • Convergence of the power iteration
  • Singular value decomposition
  • SVD - Definition
  • Singular Values And Singular Vectors
  • Matrix approximation
  • SVD and PCA
  • Example - LSI
  • SVD and PCA (2)
  • Latent Semantic Indexing (LSI)
  • Latent Semantic Indexing
  • Similarity Measures
  • Similarity Measures (2)
  • HITS (Hyperlink Induced Topic Search)
  • Power Iteration
  • HITS and SVD
  • HITS vs PageRank
  • NMF ndash Non-Negative Matrix Factorization
  • Definition
  • Motivation
  • Multiplicative Update Algorithm
  • Multiplicative Update Algorithm (2)
  • NMF vs PCA
  • Reference
  • Major Reference
  • Outline (2)
  • Not Included
  • Basic Concepts
  • What Is Machine Learning
  • Likelihood Function
  • Maximum Likelihood (ML)
  • IID ndash Independent Identically Distributed
  • Reference (2)
  • Expectation Maximization
  • Why We Need EM
  • More General
  • The Expectation Maximization (EM) Algorithm
  • Jensenrsquos Inequality
  • Lower Bounding the Log Likelihood
  • The E and M Steps of EM
  • The E Step
  • The M Step
  • EM Never Decreases the Likelihood
  • Reference (3)
  • Why Do We Need Graphical Model
  • Why Do We Need Graphical Models
  • Directed Acyclic Graphical Models (Bayesian Networks)
  • Directed Graphs for Statistical Models Plate Notation
  • pLSA ndash Probabilistic Latent Semantic Analysis
  • Latent Semantic Indexing (LSI) Review
  • pLSA ndash Probabilistic Latent Semantic Analysis (2)
  • pLSA
  • pLSA (2)
  • Joint Probability vs Likelihood
  • Document Decomposition
  • pLSA ndash Objective Function
  • EM Steps
  • Lower Bounding the Log Likelihood (2)
  • EM Steps (2)
  • Latent Subspace
  • pLSA vs LSA
  • pLSA vs LSA (2)
  • Applications
  • Text Mining
  • Scene Classification
  • Classification Result
  • Reference (4)
  • LDA ndash Latent Dirichilet Allocation
  • Problems in pLSA
  • Problems in pLSA (2)
  • Dirichlet Distribution
  • Dirichlet Distribution (2)
  • Example Dirichlet Distributions (K=3)
  • Example Dirichlet Distributions (K=3) (2)
  • The LDA Model
  • The LDA Model (2)
  • Joint Probability
  • Likelihood
  • Inference
  • Inference (2)
  • Variational Inference
  • Variantional Inference
  • VBEM vs EM
  • Parameter Estimation
  • Parameter Estimation (2)
  • Topic Examples in a 100-topic LDA Model)
  • Classification (50-topic LDA + SVM)
  • Problems in LDA
  • A Bayesian Hierarchical Model for Learning Natural Scene Catego
  • Codebook
  • Topic Distribution in Different Categories
  • Topic Hierarchical Clustering
  • More Topic Models
  • Are you really into Graphical Models
  • Reference (5)
  • Outline (3)
  • Slide 115
  • The Long Query Problem
  • Key Idea Dimension Reduction + Residual Error Preservation
  • Orthogonal Decomposition
  • A Probabilistic Implementation
  • Search (Online)
  • Search Example
  • Search Example (2)
  • Simultaneously Modeling Semantics and Structure of Threaded Dis
  • Semantic amp structure
  • Optimize them together
  • Reply reconstruction
  • Baselines
  • Evaluation
  • Expert finding
  • Baselines (2)
  • Evaluation (2)
  • Summary
Page 30: An Introduction To Matrix Decomposition and Graphical Model

Power Iteration

bull Each page i has both a hub score hi and an authority score aibull HITS successively refines these scores by computing

bull Define the adjacency matrix L of the directed web graph

bull Now

HITS and SVD

bull L rows are outlinks columns are inlinks

bull a will be the dominant eigenvector of the authority matrix LTL bull h will be the dominant eigenvector of the hub matrix LLT

bull They are in fact the first left and right singular vectors of Lbull We are in fact running SVD on the adjacency matrix

HITS vs PageRankbull PageRank may be computed once HITS is computed per

query

bull HITS takes query into account PageRank doesnrsquot

bull PageRank has no concept of hubs

bull HITS is sensitive to local topology insertion or deletion of a small number of nodes may change the scores a lot

bull PageRank more stable because of its random jump step

NMF ndash NON-NEGATIVE MATRIX FACTORIZATION

Definition

bull Given a nonnegative matrix Vntimesm find non-negative matrix factors Wntimesk and Hktimesm such that

VntimesmasympWntimeskHktimesm

bull V column matrix each column is a data sample (n-dimension)bull Wi k-basis represents one basebull H coordinates of V projected to W

vj asymp Wntimeskhj

Motivationbull Non-negativity is natural in many applications

bull Probability is also non-negative

bull Additive model to capture local structure

Multiplicative Update Algorithmbull Cost function Euclidean distance

bull Multiplicative Update

Multiplicative Update Algorithmbull Cost function Divergence

ndash Reduce to Kullback-Leibler divergence whenndash A and B can be regarded as normalized probability distributions

bull Multiplicative update

bull PLSA is NMF with KL divergence

NMF vs PCA

bull n = 2429 faces m = 19x19 pixelsbull Positive values are illustrated with black pixels and negative values

with red pixels

bull NMF Parts-based representationbull PCA Holistic representations

Referencebull D D Lee and H S Seung Algorithms for non-negative matrix

factorization (pdf) NIPS 2001bull D D Lee and H S Seung Learning the parts of objects by non-negative

matrix factorization (pdf) Nature 401 788-791 (1999)

Major Referencebull Saara Hyvoumlnen Linear Algebra Methods for Data Mining Spring 2007

University of Helsinki (Highly recommend)

Outlinebull Basic concepts

ndash Likelihood iidndash ML MAP and Bayesian Inferencendash Expectation-Maximizationndash Mixture Gaussian Parameter estimation

bull pLSAndash Motivationndash Derivation amp Geometry propertiesndash Applications

bull LDAndash Motivation - Why to add a hyper parameterndash Dirichlet Distributionndash Variational EMndash Relations with other topic modalsndash Incorporating category information

bull Summary

Not Included

bull General graphical model theoriesbull Markov random field (belief propagation)bull Detailed derivation of LDA

BASIC CONCEPTS

What Is Machine Learning

Databull Let x = (x1 x2 xD) T denote a data point and D = x(1) x(2) x(N) a

data set D is sometimes associated with desired outputs y1 y2

Predictionsbull We are generally interested in predicting something based on the observed

data setbull Given D what can we say about x(N+1)

Modelbull To make predictions we need to make some assumptions We can often

express these assumptions in the form of a model with some parameters θbull Given data D we learn the model parameters from which we can predict

new data pointsbull The model can often be expressed as a probability distribution over data

points

Likelihood Function

bull Given a set of parameter values probability density function (PDF) will show that some data are more probable than other data

bull Inversely given the observed data and a model of interest Likelihood function is defined as

L(θ) = fθ(x|θ) = p(x|θ)

bull That is likelihood function L(θ) will show that some parameters are more likely to have produced the data

Maximum Likelihood (ML)

bull Maximum likelihood will find the best model parameters that make the data ldquomost likelyrdquo generated from this model

bull Suppose we are given n data samples (x1 x2 hellip xn)

bull Maximum likelihood will find θ that maximize L(θ)

bull Predictive distribution

IID ndash Independent Identically Distributed

bull IID means

bull The problem is considerably simplified as

bull Usually log likehood is used

Reference

bull Zoubin Ghahramani Machine Learning (4F13) 2006 Cambridge(Introduction to Machine Learning Lectures 1-2 Slides)

bull Gregor Heinrich Parameter estimation for text analysis Technical Note 2005-2008

EXPECTATION MAXIMIZATION

Why We Need EM

bull The ExpectationndashMaximization (EM) algorithm is a method for ML learning of parameters in latent variable models

bull Why we need latent variables

bull To describe complex model Gaussian Mixture Model

bull To discover the intrinsic structure inside a data set Topic Models such as pLSA LDA

More General

bull Data set bull Likelihood

bull Goal learn maximum likelihood (ML) parameter values

bull The maximum likelihood procedure finds parameters θ such that

bull Because of the integral (or sum) over latent variables the likelihood can be a very complicated and hard to optimize function

The Expectation Maximization (EM) Algorithm

bull The EM algorithm finds a (local) maximum of a latent variable model likelihood It starts from arbitrary values of the parameters and iterates two steps

bull E step Fill in values of latent variables according to posterior given data

bull M step Maximize likelihood as if latent variables were not hidden

bull Decomposes difficult problems into series of tractable steps

Jensenrsquos Inequality

Lower Bounding the Log Likelihoodbull Observed data D = yn Latent variables X = xn Parameters θbull Goal Maximize the log likelihood (ie ML learning) wrt θ

bull Any distribution q(X) over the hidden variables can be used to obtain a lower bound on the log likelihood using Jensenrsquos inequality

bull where H[q] is the entropy of q(X)

The E and M Steps of EM

bull The lower bound on the log likelihood is given by

bull EM alternates betweenbull E step optimize wrt distribution over hidden variables

holding params fixed

bull M step maximize wrt parameters holding hidden distribution fixed

The E Step

bull E step for fixed θ

bull The second term is the Kullback-Leibler divergencebull This means that for fixed θ F is bounded above by L and achieves

that bound whenbull So the E step simply sets

The M Step

bull M step maximize wrt parameters holding hidden distribution q fixed

bull The second equality comes from fact that entropy of q(X) does not depend directly on θ

bull The specific form of the M step depends on the model Often the maximum wrt θ can be found analytically

EM Never Decreases the Likelihood

bull The E and M steps together never decrease the log likelihood

bull The E step brings F(q θ) to the likelihood L(θ) 顶bull The M-step maximizes F(q θ) wrt θ 抬bull F(q θ) lt L(θ) by Jensen ndash or equivalently from the non-

negativity of KL

Reference

bull Zoubin Ghahramani Machine Learning (4F13) 2006 Cambridge(Unsupervised learning Lecture 5 Slides)

bull Christopher M Bishop (2006) Pattern Recognition and Machine Learning Springer

WHY DO WE NEED GRAPHICAL MODEL

Why Do We Need Graphical Models

bull Consndash Graphical model is so complex even with a few circleshellipndash We have to make too many assumptions

bull Prosndash We do need probability to explain our world But joint probability is

hard to computendash Graphical model can help us analyze and understand our problemsndash Graphs are an intuitive way of representing and visualizing the

relationships between many variablesndash With a graphical model we can decouple joint probability to

conditional probabilities which are usually easier

Directed Acyclic Graphical Models (Bayesian Networks)

bull A DAG Model Bayesian network corresponds to a factorization of the joint probability distribution

p(ABCDE) = p(A)p(B)p(C|AB)p(D|BC)p(E|CD)

bull In general

bull where pa(i) are the parents of node i

Directed Graphs for Statistical ModelsPlate Notation

bull A data set of N points generated from a Gaussian

PLSA ndash PROBABILISTIC LATENT SEMANTIC ANALYSIS

Latent Semantic Indexing (LSI) Review

bull For natural Language Queries simple term matching does not work effectivelyndash Ambiguous terms ndash Same Queries vary due to personal styles

bull Latent semantic indexingndash Creates this lsquolatent semantic spacersquo (hidden meaning)

bull LSI puts documents together even if they donrsquot have common words if the docs share frequently co-occurring terms

bull Disadvantagesndash Statistical foundation is missing

pLSA ndash Probabilistic Latent Semantic Analysis

bull Automated Document Indexing and Information retrievalbull Identification of Latent Classes using an Expectation

Maximization (EM) Algorithmbull Shown to solve

ndash Polysemybull Java could mean ldquocoffeerdquo and also the ldquoPL Javardquobull Cricket is a ldquogamerdquo and also an ldquoinsectrdquo

ndash Synonymybull ldquocomputerrdquo ldquopcrdquo ldquodesktoprdquo all could mean the same

bull Has a better statistical foundation than LSA

pLSA

M

Nd

d

z

w

M

d

z1

w1

z2

w2

z3

w3

zN

wN

hellip

z1 hellip zN are variables ziє[1K]K is the number of latent topics

pLSA

d1

z1

w1

z2

w2

zN1

wN1

hellip

d2

z1

w1

z2

w2

zN2

wN2

hellip

dM

z1

w1

z2

w2

zNm

wNm

hellip

p(w|z=1) p(w|z=2) p(w | z=Nm) are shared for all documents

Likelihood

Joint Probability vs Likelihood

bull Joint probability

bull Likelihood (only for observed variables)

bull p(d) is assumed to be uniform

Document Decomposition

bull Each document can be decomposed as

bull This is similar to the matrix decomposition if we consider each discrete distribution as a vector

p(w|d) = ZVtimesk p(z|d)

bull With many documents we hope to find latent topics as common basis

pLSA ndash Objective Function

bull pLSA tries to maximize the log likelihood

bull Due to the summation over z inside log we have to resort to EM

EM Steps

bull E-Stepndash Expectation of the likelihood function is calculated with the current

parameter valuesbull M-Step

ndash Update the parameters with the calculated posterior probabilitiesndash Find the parameters that maximizes the likelihood function

Lower Bounding the Log Likelihood

EM Steps

bull The E-Step

bull The M-Step

Latent Subspace

pLSA vs LSAbull LSA and PLSA perform dimensionality reduction

ndash In LSA by keeping only K singular valuesndash In PLSA by having K aspects

bull Comparison to SVDndash U Matrix related to P(z|d) (doc to aspect)ndash V Matrix related to P(w|z) (aspect to term)ndash E Matrix related to P(z) (aspect strength)

pLSA vs LSAbull The main difference is the way the approximation is done

bull PLSA generates a model (aspect model) and maximizes its predictive power

bull Selecting the proper value of K is heuristic in LSA

bull Model selection in statistics can determine optimal K in PLSA

Applications

bull Text mining topic discovering

bull Scene Classification

Text Mining

Scene Classification

Classification Result

Reference

bull Thomas Hofmann Probabilistic latent semantic analysis In Proc of Uncertainty in Artificial Intelligence UAI99 Stockholm 1999

bull Bosch A Zisserman A and Munoz X Scene Classification via pLSAProceedings of the European Conference on Computer Vision (2006)

bull Sivic J Russell B C Efros A A Zisserman A and Freeman W T Discovering Object Categories in Image Collections MIT AI Lab Memo AIM-2005-005 February 2005

LDA ndash LATENT DIRICHILET ALLOCATION

Problems in pLSA

bull pLSA provides no probabilistic model at the document level Each doc has its own topic mixture proportion

bull The number of parameters in the model grows linearly with M (the number of documents in the training set)

Problems in pLSA

bull There is no constraint for distributions p(z|di)

bull Easy to lead to serious problems with over-fitting

d1

z1

w1

z2

w2

zN1

wN1

hellip

d2

z1

w1

z2

w2

zN2

wN2

hellip

dm

z1

w1

z2

w2

zNm

wNm

hellip

p(z|d1) p(z|d2) p(z|dm)

Dirichlet Distribution

bull In the LDA model the topic mixture proportions for each document are assumed to follow some distribution

bull Requirement for such a distributionndash The samples (mixture proportions) generated from it are K-tuples of non-

negative numbers that sum to one That is the samples are multinormialsndash Easy to optimize

bull Dirichlet Distribution is one of such distributions

bull The space of all of these multinomials has a nice geometric interpretation as a (k-1)-simplex

Dirichlet Distribution

bull Definition

bull The density is zero outside this open (K minus 1)-dimensional simplex

k

i ii

ki ik

i i

ki i

kk

xx

xxxxp i

1

11

1

12121

1 0 st

)Γ()(Γ)|(

bull Various parameter α

(6 2 2) (3 7 5)

(2 3 4) (6 2 6)

Example Dirichlet Distributions (K=3)

Example Dirichlet Distributions (K=3)

bull Equal αi different

α0=01 α0=1 α0=10

k

i i10

The LDA Model

z4z3z2z1

w4w3w2w1

b

z4z3z2z1

w4w3w2w1

z4z3z2z1

w4w3w2w1

bull For each documentbull Choose ~Dirichlet()bull For each of the N words wn

ndash Choose a topic znraquo Multinomial()

ndash Choose a word wn from p(wn|znb) a multinomial probability conditioned on the topic zn

The LDA Model

bull For each documentbull Choose ~Dirichlet()bull For each of the N words wn

ndash Choose a topic znraquo Multinomial()

ndash Choose a word wn from p(wn|znb) a multinomial probability conditioned on the topic zn

Joint Probability

bull Given parameter α and β

where

Likelihood

bull Joint Probability

bull Marginal distribution of a document

bull Likelihood over all the documents

Inference

bull The likelihood can be computed by summing each documentbull Jansenrsquos inequality in EM

Inference

bull In E-Step we need to compute the posterior distribution of the hidden variables

bull Unfortunately this distribution is intractable to compute in general

bull We have to resort to variational approach

Variational Inference

bull In variational inference we consider a simplified graphical model with variational parameters and minimize the KL Divergence between the variational and posterior distributions

Variantional Inference

bull The difference between the lower bound and the likelihood is the KL divergence

bull Maximizing the lower bound L(b) with respect to and is equivalent to minimizing the KL divergence

VBEM vs EM

bull Only different in the E-Step

bull In standard EM q(X) is directly set to p(X|Dθ) and let KL=0bull In VBEM it is intractable to compute p(X|Dθ) Instead it

approximates p(X|Dθ) by a variational distribution q(X) by minimizing KL(q(X) | P(X|D θ)

bull This is also equivalent to maximizing the lower bound L(θ)

Parameter Estimation

bull Given a corpus of documents we would like to find the parameters and b which maximize the likelihood of the observed data

bull Strategy (Variational EM)

bull Lower bound log p(w|b) by a function L(b)bull Repeat until convergence

ndash E Maximize L(b) with respect to the variational parameters ndash M Maximize the bound with respect to parameters and b

Parameter Estimation

bull E-Step Variational Inference ndash repeat until convergence

bull M-Step Parameter estimation

β

α can be implemented using the Newton-Raphson method

Topic Examples in a 100-topic LDA Model)bull 16000 documents from a subset of the TREC AP corpus

Classification (50-topic LDA + SVM)

bull Reuters-21578 dataset ndash contains 8000 documents and 15818 words

(a) EARN vs NOT EARN (b) GRAIN vs NOT GRAIN

Problems in LDA

bull Dirichlet Distribution is helpful to avoid over-fitting But the assumption might be too strong

z4z3z2z1

w4w3w2w1

b

z4z3z2z1

w4w3w2w1

z4z3z2z1

w4w3w2w1

A Bayesian Hierarchical Model for Learning Natural Scene Categories

bull Incorporating category information

MNd

π

z

x

θ

β

Codebookbull 174 Local Image Patches

bull DetectionEvenly Sampled GridRandom SamplingSaliency DetectorLowersquos DoG Detector

bull RepresentationNormalized 11x11 gray values128-dim SIFT

Topic Distribution in Different Categories

Topic Hierarchical Clustering

More Topic Models

bull Dynamic topic models ICML 2006bull Correlated Topic Model NIPS 2005bull Hierarchical Dirichlet Process Journal of the American

Statistical Association 2003bull Nonparametric Bayes pachinko allocation UAI 2007bull Supervised LDA NIPS 2007bull MedLDA ndash Maximum Margin Discrimant LDA ICML 2009bull hellip

Are you really into Graphical Models

bull Describing Visual Scenes using Transformed Dirichlet Processes E Sudderth A Torralba W Freeman and A Willsky NIPS Dec 2005

Reference

bull David M Blei Andrew Y Ng Michael I Jordan Latent Dirichlet Allocation Journal of Machine Learning Research (JMLR) 2003

bull Matthew J Beal Variational Algorithms for Approximate Bayesian Inference PhD Thesis University of Cambridge 1998

bull L Fei-Fei and P Perona A Bayesian Hierarchical Model for Learning Natural Scene Categories CVPR 2005

Outline

bull Matrix Decompositionndash PCA SVD NMFndash LDA ICA Sparse Coding etc

bull Graphical Modelndash Basic concepts in probabilistic machine learningndash EMndash pLSAndash LDA

bull Two Applicationsndash Document decomposition for ldquolong queryrdquo retrieval ICCV 2009ndash Modeling Threaded Discussions SIGIR 2009

Large-Scale Indexing for ldquoLong Queryrdquo Retrieval (Similarity Search)

Xiao Zhang Zhiwei Li Lei Zhang Wei-Ying Ma and Heung-Yeung Shum

ICCV 2009

The Long Query Problem

bull If a query contains 1000 keywordsndash Need to access 1000 inverted listsndash The intersection of 1000 inverted lists may be emptyndash The union of 1000 inverted list may be the whole corpus

bull Dimension reductionTerm1 Term2 Term3 Term4 hellip TermN

Img1 1 2 0 0 hellip 2

f1 f2 hellip fM

Img1 02 01 hellip 003

Topic Projection

Dim = 1 million

Dim = 200

Key Idea Dimension Reduction + Residual Error Preservation

bull p original TF-IDF vector in vocabulary spacebull X projection matrix for dimension reductionbull w low dimensional feature vectorbull residual error

p Xw

An image = 0 01 02 03 04 05 06 07 08 09 1

0

2

4

6

8

10

12

14

16

+ a few words(10 words)

Orthogonal Decomposition

p Xw 1 11 1 1

2 21 2 1 2

1 1

1

k

k

W k W

W W Wk W

p x xp x x w

p wp x x

Base vector

Low dimensional representation

Residual

An image = 0 01 02 03 04 05 06 07 08 09 1

0

2

4

6

8

10

12

14

16

+ a few words(10 words)

T Tp q p q

p q

w w

X1 X2 X3 hellip Xk

A Probabilistic Implementation

x is a switch variable It controls a word generated from

bull a topic specific distribution

bull a document specific distribution

bull a background distribution

( | )p w d 1( 0 | ) ( | ) ( | )K

kp x d p w z k p z k d

( 1| ) ( | )p x d p w d ( 2 | ) ( | )p x d p w

C Chemudugunta etal Modeling General and Specific Aspects of Documents with a Probabilistic Topic Model NIPS 2006

Search (Online)

DS1 DS2 hellip

DS1 DS2 hellip

DS1 DS2 hellip

DS1 DS2 hellip

DS1 DS2 hellip

hellip

hellip

hellip

hellip

hellip

LSH Index

Doc 300 Doc 401 hellip

A query = 0 01 02 03 04 05 06 07 08 09 1

0

2

4

6

8

10

12

14

16

+ a few words

Re-rankingDoc 401 hellip

Doc 1

Doc 2

Doc 300

Doc 401

Doc N

Doc 300

Index 10M Images 46GBSearch Speed lt 100ms

Doc Meta

Search Example

Query Image

Search Example

Query Image

Simultaneously Modeling Semantics and Structure of Threaded Discussions A Sparse

Coding Approach and Its Applications

Chen LIN Jiang-Ming YANG Rui CAI Xin-jing WANG Wei WANG Lei ZHANG

SIGIR 2009

123

Semantic amp structure

124

SemanticTopics

StructureWho reply to who

Optimize them together

Model semantic

Model structure

125

Reply reconstruction

126

DocumentSimilarity

TopicSimilarity

StructureSimilarity

BaselinesNP

Reply to Nearest PostRR

Reply to RootDS

Document SimilarityLDA

Latent Dirichlet Allocation Project documents to topic space

SWB Special Words Topic Model with Background distribution Project documents to topic and junk topic space

127

Evaluation

method Slashdot Apple

All Posts Good Posts All Posts Good Posts

NP 0021 0012 0289 0239

RR 0183 0319 0269 0474

DS 0463 0643 0409 0628

LDA 0465 0644 0410 0648

SWB 0463 0644 0410 0641

SMSS 0524 0737 0517 0772

128

Expert finding

Reply reconstruction

Network construction

Expert finding

Methods

HITS

PageRank

hellip

129

Baselines

LMFormal Models for Expert Finding in Enterprise Corpora SIGIR

06Achieves stable performance in expert finding task using a

language modelPageRank

Benchmark nodal ranking methodHITS

Find hub nodes and authority nodeEABIF

Personalized Recommendation Driven by Information Flow SIGIR rsquo06

Find most influential node130

Evaluation

131

bull Bayesian estimate

Method MRR MAP P10

LM 0821 0698 0800

EABIF(ori) 0674 0362 0243

EABIF(rec) 0742 0318 0281

PageRank(ori) 0675 0377 0263

PageRank(rec) 0743 0321 0266

HITS(ori) 0906 0832 0900

HITS(rec) 0938 0822 0906

Summary

bull Matrix and probability are fundamental mathematics in information retrieval and computer visionndash Matrix decomposition ndash a good practice to learn matrixndash Graphical model ndash a good practice to learn probability

bull Graphical model is a good tool to analyze problems

bull The essence of decomposition is to discover a set of mid-level features to describe original documentsimages

bull It is more adaptable for various applications than matrix decomposition

  • An Introduction To Matrix Decomposition and Graphical Model
  • Outline
  • What Is Matrix Decomposition
  • Why We Need Matrix Decomposition
  • Why We Need Matrix Decomposition (2)
  • Principle Component Analysis
  • Definition ndash Eigenvalue amp Eigenvector
  • Definition ndash Principle Component Analysis
  • Principle Component Analysis (2)
  • Maximizing Variance
  • Optimization Problem
  • Property Data Decomposition
  • Face Recognition ndash Eigenface
  • Slide 14
  • Slide 15
  • PageRank ndash Power Iteration
  • Column-Stochastic amp Irreducible
  • Iterative PageRank Calculation
  • Convergence of the power iteration
  • Singular value decomposition
  • SVD - Definition
  • Singular Values And Singular Vectors
  • Matrix approximation
  • SVD and PCA
  • Example - LSI
  • SVD and PCA (2)
  • Latent Semantic Indexing (LSI)
  • Latent Semantic Indexing
  • Similarity Measures
  • Similarity Measures (2)
  • HITS (Hyperlink Induced Topic Search)
  • Power Iteration
  • HITS and SVD
  • HITS vs PageRank
  • NMF ndash Non-Negative Matrix Factorization
  • Definition
  • Motivation
  • Multiplicative Update Algorithm
  • Multiplicative Update Algorithm (2)
  • NMF vs PCA
  • Reference
  • Major Reference
  • Outline (2)
  • Not Included
  • Basic Concepts
  • What Is Machine Learning
  • Likelihood Function
  • Maximum Likelihood (ML)
  • IID ndash Independent Identically Distributed
  • Reference (2)
  • Expectation Maximization
  • Why We Need EM
  • More General
  • The Expectation Maximization (EM) Algorithm
  • Jensenrsquos Inequality
  • Lower Bounding the Log Likelihood
  • The E and M Steps of EM
  • The E Step
  • The M Step
  • EM Never Decreases the Likelihood
  • Reference (3)
  • Why Do We Need Graphical Model
  • Why Do We Need Graphical Models
  • Directed Acyclic Graphical Models (Bayesian Networks)
  • Directed Graphs for Statistical Models Plate Notation
  • pLSA ndash Probabilistic Latent Semantic Analysis
  • Latent Semantic Indexing (LSI) Review
  • pLSA ndash Probabilistic Latent Semantic Analysis (2)
  • pLSA
  • pLSA (2)
  • Joint Probability vs Likelihood
  • Document Decomposition
  • pLSA ndash Objective Function
  • EM Steps
  • Lower Bounding the Log Likelihood (2)
  • EM Steps (2)
  • Latent Subspace
  • pLSA vs LSA
  • pLSA vs LSA (2)
  • Applications
  • Text Mining
  • Scene Classification
  • Classification Result
  • Reference (4)
  • LDA ndash Latent Dirichilet Allocation
  • Problems in pLSA
  • Problems in pLSA (2)
  • Dirichlet Distribution
  • Dirichlet Distribution (2)
  • Example Dirichlet Distributions (K=3)
  • Example Dirichlet Distributions (K=3) (2)
  • The LDA Model
  • The LDA Model (2)
  • Joint Probability
  • Likelihood
  • Inference
  • Inference (2)
  • Variational Inference
  • Variantional Inference
  • VBEM vs EM
  • Parameter Estimation
  • Parameter Estimation (2)
  • Topic Examples in a 100-topic LDA Model)
  • Classification (50-topic LDA + SVM)
  • Problems in LDA
  • A Bayesian Hierarchical Model for Learning Natural Scene Catego
  • Codebook
  • Topic Distribution in Different Categories
  • Topic Hierarchical Clustering
  • More Topic Models
  • Are you really into Graphical Models
  • Reference (5)
  • Outline (3)
  • Slide 115
  • The Long Query Problem
  • Key Idea Dimension Reduction + Residual Error Preservation
  • Orthogonal Decomposition
  • A Probabilistic Implementation
  • Search (Online)
  • Search Example
  • Search Example (2)
  • Simultaneously Modeling Semantics and Structure of Threaded Dis
  • Semantic amp structure
  • Optimize them together
  • Reply reconstruction
  • Baselines
  • Evaluation
  • Expert finding
  • Baselines (2)
  • Evaluation (2)
  • Summary
Page 31: An Introduction To Matrix Decomposition and Graphical Model

HITS and SVD

bull L rows are outlinks columns are inlinks

bull a will be the dominant eigenvector of the authority matrix LTL bull h will be the dominant eigenvector of the hub matrix LLT

bull They are in fact the first left and right singular vectors of Lbull We are in fact running SVD on the adjacency matrix

HITS vs PageRankbull PageRank may be computed once HITS is computed per

query

bull HITS takes query into account PageRank doesnrsquot

bull PageRank has no concept of hubs

bull HITS is sensitive to local topology insertion or deletion of a small number of nodes may change the scores a lot

bull PageRank more stable because of its random jump step

NMF ndash NON-NEGATIVE MATRIX FACTORIZATION

Definition

bull Given a nonnegative matrix Vntimesm find non-negative matrix factors Wntimesk and Hktimesm such that

VntimesmasympWntimeskHktimesm

bull V column matrix each column is a data sample (n-dimension)bull Wi k-basis represents one basebull H coordinates of V projected to W

vj asymp Wntimeskhj

Motivationbull Non-negativity is natural in many applications

bull Probability is also non-negative

bull Additive model to capture local structure

Multiplicative Update Algorithmbull Cost function Euclidean distance

bull Multiplicative Update

Multiplicative Update Algorithmbull Cost function Divergence

ndash Reduce to Kullback-Leibler divergence whenndash A and B can be regarded as normalized probability distributions

bull Multiplicative update

bull PLSA is NMF with KL divergence

NMF vs PCA

bull n = 2429 faces m = 19x19 pixelsbull Positive values are illustrated with black pixels and negative values

with red pixels

bull NMF Parts-based representationbull PCA Holistic representations

Referencebull D D Lee and H S Seung Algorithms for non-negative matrix

factorization (pdf) NIPS 2001bull D D Lee and H S Seung Learning the parts of objects by non-negative

matrix factorization (pdf) Nature 401 788-791 (1999)

Major Referencebull Saara Hyvoumlnen Linear Algebra Methods for Data Mining Spring 2007

University of Helsinki (Highly recommend)

Outlinebull Basic concepts

ndash Likelihood iidndash ML MAP and Bayesian Inferencendash Expectation-Maximizationndash Mixture Gaussian Parameter estimation

bull pLSAndash Motivationndash Derivation amp Geometry propertiesndash Applications

bull LDAndash Motivation - Why to add a hyper parameterndash Dirichlet Distributionndash Variational EMndash Relations with other topic modalsndash Incorporating category information

bull Summary

Not Included

bull General graphical model theoriesbull Markov random field (belief propagation)bull Detailed derivation of LDA

BASIC CONCEPTS

What Is Machine Learning

Databull Let x = (x1 x2 xD) T denote a data point and D = x(1) x(2) x(N) a

data set D is sometimes associated with desired outputs y1 y2

Predictionsbull We are generally interested in predicting something based on the observed

data setbull Given D what can we say about x(N+1)

Modelbull To make predictions we need to make some assumptions We can often

express these assumptions in the form of a model with some parameters θbull Given data D we learn the model parameters from which we can predict

new data pointsbull The model can often be expressed as a probability distribution over data

points

Likelihood Function

bull Given a set of parameter values probability density function (PDF) will show that some data are more probable than other data

bull Inversely given the observed data and a model of interest Likelihood function is defined as

L(θ) = fθ(x|θ) = p(x|θ)

bull That is likelihood function L(θ) will show that some parameters are more likely to have produced the data

Maximum Likelihood (ML)

bull Maximum likelihood will find the best model parameters that make the data ldquomost likelyrdquo generated from this model

bull Suppose we are given n data samples (x1 x2 hellip xn)

bull Maximum likelihood will find θ that maximize L(θ)

bull Predictive distribution

IID ndash Independent Identically Distributed

bull IID means

bull The problem is considerably simplified as

bull Usually log likehood is used

Reference

bull Zoubin Ghahramani Machine Learning (4F13) 2006 Cambridge(Introduction to Machine Learning Lectures 1-2 Slides)

bull Gregor Heinrich Parameter estimation for text analysis Technical Note 2005-2008

EXPECTATION MAXIMIZATION

Why We Need EM

bull The ExpectationndashMaximization (EM) algorithm is a method for ML learning of parameters in latent variable models

bull Why we need latent variables

bull To describe complex model Gaussian Mixture Model

bull To discover the intrinsic structure inside a data set Topic Models such as pLSA LDA

More General

bull Data set bull Likelihood

bull Goal learn maximum likelihood (ML) parameter values

bull The maximum likelihood procedure finds parameters θ such that

bull Because of the integral (or sum) over latent variables the likelihood can be a very complicated and hard to optimize function

The Expectation Maximization (EM) Algorithm

bull The EM algorithm finds a (local) maximum of a latent variable model likelihood It starts from arbitrary values of the parameters and iterates two steps

bull E step Fill in values of latent variables according to posterior given data

bull M step Maximize likelihood as if latent variables were not hidden

bull Decomposes difficult problems into series of tractable steps

Jensenrsquos Inequality

Lower Bounding the Log Likelihoodbull Observed data D = yn Latent variables X = xn Parameters θbull Goal Maximize the log likelihood (ie ML learning) wrt θ

bull Any distribution q(X) over the hidden variables can be used to obtain a lower bound on the log likelihood using Jensenrsquos inequality

bull where H[q] is the entropy of q(X)

The E and M Steps of EM

bull The lower bound on the log likelihood is given by

bull EM alternates betweenbull E step optimize wrt distribution over hidden variables

holding params fixed

bull M step maximize wrt parameters holding hidden distribution fixed

The E Step

bull E step for fixed θ

bull The second term is the Kullback-Leibler divergencebull This means that for fixed θ F is bounded above by L and achieves

that bound whenbull So the E step simply sets

The M Step

bull M step maximize wrt parameters holding hidden distribution q fixed

bull The second equality comes from fact that entropy of q(X) does not depend directly on θ

bull The specific form of the M step depends on the model Often the maximum wrt θ can be found analytically

EM Never Decreases the Likelihood

bull The E and M steps together never decrease the log likelihood

bull The E step brings F(q θ) to the likelihood L(θ) 顶bull The M-step maximizes F(q θ) wrt θ 抬bull F(q θ) lt L(θ) by Jensen ndash or equivalently from the non-

negativity of KL

Reference

bull Zoubin Ghahramani Machine Learning (4F13) 2006 Cambridge(Unsupervised learning Lecture 5 Slides)

bull Christopher M Bishop (2006) Pattern Recognition and Machine Learning Springer

WHY DO WE NEED GRAPHICAL MODEL

Why Do We Need Graphical Models

bull Consndash Graphical model is so complex even with a few circleshellipndash We have to make too many assumptions

bull Prosndash We do need probability to explain our world But joint probability is

hard to computendash Graphical model can help us analyze and understand our problemsndash Graphs are an intuitive way of representing and visualizing the

relationships between many variablesndash With a graphical model we can decouple joint probability to

conditional probabilities which are usually easier

Directed Acyclic Graphical Models (Bayesian Networks)

bull A DAG Model Bayesian network corresponds to a factorization of the joint probability distribution

p(ABCDE) = p(A)p(B)p(C|AB)p(D|BC)p(E|CD)

bull In general

bull where pa(i) are the parents of node i

Directed Graphs for Statistical ModelsPlate Notation

bull A data set of N points generated from a Gaussian

PLSA ndash PROBABILISTIC LATENT SEMANTIC ANALYSIS

Latent Semantic Indexing (LSI) Review

bull For natural Language Queries simple term matching does not work effectivelyndash Ambiguous terms ndash Same Queries vary due to personal styles

bull Latent semantic indexingndash Creates this lsquolatent semantic spacersquo (hidden meaning)

bull LSI puts documents together even if they donrsquot have common words if the docs share frequently co-occurring terms

bull Disadvantagesndash Statistical foundation is missing

pLSA ndash Probabilistic Latent Semantic Analysis

bull Automated Document Indexing and Information retrievalbull Identification of Latent Classes using an Expectation

Maximization (EM) Algorithmbull Shown to solve

ndash Polysemybull Java could mean ldquocoffeerdquo and also the ldquoPL Javardquobull Cricket is a ldquogamerdquo and also an ldquoinsectrdquo

ndash Synonymybull ldquocomputerrdquo ldquopcrdquo ldquodesktoprdquo all could mean the same

bull Has a better statistical foundation than LSA

pLSA

M

Nd

d

z

w

M

d

z1

w1

z2

w2

z3

w3

zN

wN

hellip

z1 hellip zN are variables ziє[1K]K is the number of latent topics

pLSA

d1

z1

w1

z2

w2

zN1

wN1

hellip

d2

z1

w1

z2

w2

zN2

wN2

hellip

dM

z1

w1

z2

w2

zNm

wNm

hellip

p(w|z=1) p(w|z=2) p(w | z=Nm) are shared for all documents

Likelihood

Joint Probability vs Likelihood

bull Joint probability

bull Likelihood (only for observed variables)

bull p(d) is assumed to be uniform

Document Decomposition

bull Each document can be decomposed as

bull This is similar to the matrix decomposition if we consider each discrete distribution as a vector

p(w|d) = ZVtimesk p(z|d)

bull With many documents we hope to find latent topics as common basis

pLSA ndash Objective Function

bull pLSA tries to maximize the log likelihood

bull Due to the summation over z inside log we have to resort to EM

EM Steps

bull E-Stepndash Expectation of the likelihood function is calculated with the current

parameter valuesbull M-Step

ndash Update the parameters with the calculated posterior probabilitiesndash Find the parameters that maximizes the likelihood function

Lower Bounding the Log Likelihood

EM Steps

bull The E-Step

bull The M-Step

Latent Subspace

pLSA vs LSAbull LSA and PLSA perform dimensionality reduction

ndash In LSA by keeping only K singular valuesndash In PLSA by having K aspects

bull Comparison to SVDndash U Matrix related to P(z|d) (doc to aspect)ndash V Matrix related to P(w|z) (aspect to term)ndash E Matrix related to P(z) (aspect strength)

pLSA vs LSAbull The main difference is the way the approximation is done

bull PLSA generates a model (aspect model) and maximizes its predictive power

bull Selecting the proper value of K is heuristic in LSA

bull Model selection in statistics can determine optimal K in PLSA

Applications

bull Text mining topic discovering

bull Scene Classification

Text Mining

Scene Classification

Classification Result

Reference

bull Thomas Hofmann Probabilistic latent semantic analysis In Proc of Uncertainty in Artificial Intelligence UAI99 Stockholm 1999

bull Bosch A Zisserman A and Munoz X Scene Classification via pLSAProceedings of the European Conference on Computer Vision (2006)

bull Sivic J Russell B C Efros A A Zisserman A and Freeman W T Discovering Object Categories in Image Collections MIT AI Lab Memo AIM-2005-005 February 2005

LDA ndash LATENT DIRICHILET ALLOCATION

Problems in pLSA

bull pLSA provides no probabilistic model at the document level Each doc has its own topic mixture proportion

bull The number of parameters in the model grows linearly with M (the number of documents in the training set)

Problems in pLSA

bull There is no constraint for distributions p(z|di)

bull Easy to lead to serious problems with over-fitting

d1

z1

w1

z2

w2

zN1

wN1

hellip

d2

z1

w1

z2

w2

zN2

wN2

hellip

dm

z1

w1

z2

w2

zNm

wNm

hellip

p(z|d1) p(z|d2) p(z|dm)

Dirichlet Distribution

bull In the LDA model the topic mixture proportions for each document are assumed to follow some distribution

bull Requirement for such a distributionndash The samples (mixture proportions) generated from it are K-tuples of non-

negative numbers that sum to one That is the samples are multinormialsndash Easy to optimize

bull Dirichlet Distribution is one of such distributions

bull The space of all of these multinomials has a nice geometric interpretation as a (k-1)-simplex

Dirichlet Distribution

bull Definition

bull The density is zero outside this open (K minus 1)-dimensional simplex

k

i ii

ki ik

i i

ki i

kk

xx

xxxxp i

1

11

1

12121

1 0 st

)Γ()(Γ)|(

bull Various parameter α

(6 2 2) (3 7 5)

(2 3 4) (6 2 6)

Example Dirichlet Distributions (K=3)

Example Dirichlet Distributions (K=3)

bull Equal αi different

α0=01 α0=1 α0=10

k

i i10

The LDA Model

z4z3z2z1

w4w3w2w1

b

z4z3z2z1

w4w3w2w1

z4z3z2z1

w4w3w2w1

bull For each documentbull Choose ~Dirichlet()bull For each of the N words wn

ndash Choose a topic znraquo Multinomial()

ndash Choose a word wn from p(wn|znb) a multinomial probability conditioned on the topic zn

The LDA Model

bull For each documentbull Choose ~Dirichlet()bull For each of the N words wn

ndash Choose a topic znraquo Multinomial()

ndash Choose a word wn from p(wn|znb) a multinomial probability conditioned on the topic zn

Joint Probability

bull Given parameter α and β

where

Likelihood

bull Joint Probability

bull Marginal distribution of a document

bull Likelihood over all the documents

Inference

bull The likelihood can be computed by summing each documentbull Jansenrsquos inequality in EM

Inference

bull In E-Step we need to compute the posterior distribution of the hidden variables

bull Unfortunately this distribution is intractable to compute in general

bull We have to resort to variational approach

Variational Inference

bull In variational inference we consider a simplified graphical model with variational parameters and minimize the KL Divergence between the variational and posterior distributions

Variantional Inference

bull The difference between the lower bound and the likelihood is the KL divergence

bull Maximizing the lower bound L(b) with respect to and is equivalent to minimizing the KL divergence

VBEM vs EM

bull Only different in the E-Step

bull In standard EM q(X) is directly set to p(X|Dθ) and let KL=0bull In VBEM it is intractable to compute p(X|Dθ) Instead it

approximates p(X|Dθ) by a variational distribution q(X) by minimizing KL(q(X) | P(X|D θ)

bull This is also equivalent to maximizing the lower bound L(θ)

Parameter Estimation

bull Given a corpus of documents we would like to find the parameters and b which maximize the likelihood of the observed data

bull Strategy (Variational EM)

bull Lower bound log p(w|b) by a function L(b)bull Repeat until convergence

ndash E Maximize L(b) with respect to the variational parameters ndash M Maximize the bound with respect to parameters and b

Parameter Estimation

bull E-Step Variational Inference ndash repeat until convergence

bull M-Step Parameter estimation

β

α can be implemented using the Newton-Raphson method

Topic Examples in a 100-topic LDA Model)bull 16000 documents from a subset of the TREC AP corpus

Classification (50-topic LDA + SVM)

bull Reuters-21578 dataset ndash contains 8000 documents and 15818 words

(a) EARN vs NOT EARN (b) GRAIN vs NOT GRAIN

Problems in LDA

bull Dirichlet Distribution is helpful to avoid over-fitting But the assumption might be too strong

z4z3z2z1

w4w3w2w1

b

z4z3z2z1

w4w3w2w1

z4z3z2z1

w4w3w2w1

A Bayesian Hierarchical Model for Learning Natural Scene Categories

bull Incorporating category information

MNd

π

z

x

θ

β

Codebookbull 174 Local Image Patches

bull DetectionEvenly Sampled GridRandom SamplingSaliency DetectorLowersquos DoG Detector

bull RepresentationNormalized 11x11 gray values128-dim SIFT

Topic Distribution in Different Categories

Topic Hierarchical Clustering

More Topic Models

bull Dynamic topic models ICML 2006bull Correlated Topic Model NIPS 2005bull Hierarchical Dirichlet Process Journal of the American

Statistical Association 2003bull Nonparametric Bayes pachinko allocation UAI 2007bull Supervised LDA NIPS 2007bull MedLDA ndash Maximum Margin Discrimant LDA ICML 2009bull hellip

Are you really into Graphical Models

bull Describing Visual Scenes using Transformed Dirichlet Processes E Sudderth A Torralba W Freeman and A Willsky NIPS Dec 2005

Reference

bull David M Blei Andrew Y Ng Michael I Jordan Latent Dirichlet Allocation Journal of Machine Learning Research (JMLR) 2003

bull Matthew J Beal Variational Algorithms for Approximate Bayesian Inference PhD Thesis University of Cambridge 1998

bull L Fei-Fei and P Perona A Bayesian Hierarchical Model for Learning Natural Scene Categories CVPR 2005

Outline

bull Matrix Decompositionndash PCA SVD NMFndash LDA ICA Sparse Coding etc

bull Graphical Modelndash Basic concepts in probabilistic machine learningndash EMndash pLSAndash LDA

bull Two Applicationsndash Document decomposition for ldquolong queryrdquo retrieval ICCV 2009ndash Modeling Threaded Discussions SIGIR 2009

Large-Scale Indexing for ldquoLong Queryrdquo Retrieval (Similarity Search)

Xiao Zhang Zhiwei Li Lei Zhang Wei-Ying Ma and Heung-Yeung Shum

ICCV 2009

The Long Query Problem

bull If a query contains 1000 keywordsndash Need to access 1000 inverted listsndash The intersection of 1000 inverted lists may be emptyndash The union of 1000 inverted list may be the whole corpus

bull Dimension reductionTerm1 Term2 Term3 Term4 hellip TermN

Img1 1 2 0 0 hellip 2

f1 f2 hellip fM

Img1 02 01 hellip 003

Topic Projection

Dim = 1 million

Dim = 200

Key Idea Dimension Reduction + Residual Error Preservation

bull p original TF-IDF vector in vocabulary spacebull X projection matrix for dimension reductionbull w low dimensional feature vectorbull residual error

p Xw

An image = 0 01 02 03 04 05 06 07 08 09 1

0

2

4

6

8

10

12

14

16

+ a few words(10 words)

Orthogonal Decomposition

p Xw 1 11 1 1

2 21 2 1 2

1 1

1

k

k

W k W

W W Wk W

p x xp x x w

p wp x x

Base vector

Low dimensional representation

Residual

An image = 0 01 02 03 04 05 06 07 08 09 1

0

2

4

6

8

10

12

14

16

+ a few words(10 words)

T Tp q p q

p q

w w

X1 X2 X3 hellip Xk

A Probabilistic Implementation

x is a switch variable It controls a word generated from

bull a topic specific distribution

bull a document specific distribution

bull a background distribution

( | )p w d 1( 0 | ) ( | ) ( | )K

kp x d p w z k p z k d

( 1| ) ( | )p x d p w d ( 2 | ) ( | )p x d p w

C Chemudugunta etal Modeling General and Specific Aspects of Documents with a Probabilistic Topic Model NIPS 2006

Search (Online)

DS1 DS2 hellip

DS1 DS2 hellip

DS1 DS2 hellip

DS1 DS2 hellip

DS1 DS2 hellip

hellip

hellip

hellip

hellip

hellip

LSH Index

Doc 300 Doc 401 hellip

A query = 0 01 02 03 04 05 06 07 08 09 1

0

2

4

6

8

10

12

14

16

+ a few words

Re-rankingDoc 401 hellip

Doc 1

Doc 2

Doc 300

Doc 401

Doc N

Doc 300

Index 10M Images 46GBSearch Speed lt 100ms

Doc Meta

Search Example

Query Image

Search Example

Query Image

Simultaneously Modeling Semantics and Structure of Threaded Discussions A Sparse

Coding Approach and Its Applications

Chen LIN Jiang-Ming YANG Rui CAI Xin-jing WANG Wei WANG Lei ZHANG

SIGIR 2009

123

Semantic amp structure

124

SemanticTopics

StructureWho reply to who

Optimize them together

Model semantic

Model structure

125

Reply reconstruction

126

DocumentSimilarity

TopicSimilarity

StructureSimilarity

BaselinesNP

Reply to Nearest PostRR

Reply to RootDS

Document SimilarityLDA

Latent Dirichlet Allocation Project documents to topic space

SWB Special Words Topic Model with Background distribution Project documents to topic and junk topic space

127

Evaluation

method Slashdot Apple

All Posts Good Posts All Posts Good Posts

NP 0021 0012 0289 0239

RR 0183 0319 0269 0474

DS 0463 0643 0409 0628

LDA 0465 0644 0410 0648

SWB 0463 0644 0410 0641

SMSS 0524 0737 0517 0772

128

Expert finding

Reply reconstruction

Network construction

Expert finding

Methods

HITS

PageRank

hellip

129

Baselines

LMFormal Models for Expert Finding in Enterprise Corpora SIGIR

06Achieves stable performance in expert finding task using a

language modelPageRank

Benchmark nodal ranking methodHITS

Find hub nodes and authority nodeEABIF

Personalized Recommendation Driven by Information Flow SIGIR rsquo06

Find most influential node130

Evaluation

131

bull Bayesian estimate

Method MRR MAP P10

LM 0821 0698 0800

EABIF(ori) 0674 0362 0243

EABIF(rec) 0742 0318 0281

PageRank(ori) 0675 0377 0263

PageRank(rec) 0743 0321 0266

HITS(ori) 0906 0832 0900

HITS(rec) 0938 0822 0906

Summary

bull Matrix and probability are fundamental mathematics in information retrieval and computer visionndash Matrix decomposition ndash a good practice to learn matrixndash Graphical model ndash a good practice to learn probability

bull Graphical model is a good tool to analyze problems

bull The essence of decomposition is to discover a set of mid-level features to describe original documentsimages

bull It is more adaptable for various applications than matrix decomposition

  • An Introduction To Matrix Decomposition and Graphical Model
  • Outline
  • What Is Matrix Decomposition
  • Why We Need Matrix Decomposition
  • Why We Need Matrix Decomposition (2)
  • Principle Component Analysis
  • Definition ndash Eigenvalue amp Eigenvector
  • Definition ndash Principle Component Analysis
  • Principle Component Analysis (2)
  • Maximizing Variance
  • Optimization Problem
  • Property Data Decomposition
  • Face Recognition ndash Eigenface
  • Slide 14
  • Slide 15
  • PageRank ndash Power Iteration
  • Column-Stochastic amp Irreducible
  • Iterative PageRank Calculation
  • Convergence of the power iteration
  • Singular value decomposition
  • SVD - Definition
  • Singular Values And Singular Vectors
  • Matrix approximation
  • SVD and PCA
  • Example - LSI
  • SVD and PCA (2)
  • Latent Semantic Indexing (LSI)
  • Latent Semantic Indexing
  • Similarity Measures
  • Similarity Measures (2)
  • HITS (Hyperlink Induced Topic Search)
  • Power Iteration
  • HITS and SVD
  • HITS vs PageRank
  • NMF ndash Non-Negative Matrix Factorization
  • Definition
  • Motivation
  • Multiplicative Update Algorithm
  • Multiplicative Update Algorithm (2)
  • NMF vs PCA
  • Reference
  • Major Reference
  • Outline (2)
  • Not Included
  • Basic Concepts
  • What Is Machine Learning
  • Likelihood Function
  • Maximum Likelihood (ML)
  • IID ndash Independent Identically Distributed
  • Reference (2)
  • Expectation Maximization
  • Why We Need EM
  • More General
  • The Expectation Maximization (EM) Algorithm
  • Jensenrsquos Inequality
  • Lower Bounding the Log Likelihood
  • The E and M Steps of EM
  • The E Step
  • The M Step
  • EM Never Decreases the Likelihood
  • Reference (3)
  • Why Do We Need Graphical Model
  • Why Do We Need Graphical Models
  • Directed Acyclic Graphical Models (Bayesian Networks)
  • Directed Graphs for Statistical Models Plate Notation
  • pLSA ndash Probabilistic Latent Semantic Analysis
  • Latent Semantic Indexing (LSI) Review
  • pLSA ndash Probabilistic Latent Semantic Analysis (2)
  • pLSA
  • pLSA (2)
  • Joint Probability vs Likelihood
  • Document Decomposition
  • pLSA ndash Objective Function
  • EM Steps
  • Lower Bounding the Log Likelihood (2)
  • EM Steps (2)
  • Latent Subspace
  • pLSA vs LSA
  • pLSA vs LSA (2)
  • Applications
  • Text Mining
  • Scene Classification
  • Classification Result
  • Reference (4)
  • LDA ndash Latent Dirichilet Allocation
  • Problems in pLSA
  • Problems in pLSA (2)
  • Dirichlet Distribution
  • Dirichlet Distribution (2)
  • Example Dirichlet Distributions (K=3)
  • Example Dirichlet Distributions (K=3) (2)
  • The LDA Model
  • The LDA Model (2)
  • Joint Probability
  • Likelihood
  • Inference
  • Inference (2)
  • Variational Inference
  • Variantional Inference
  • VBEM vs EM
  • Parameter Estimation
  • Parameter Estimation (2)
  • Topic Examples in a 100-topic LDA Model)
  • Classification (50-topic LDA + SVM)
  • Problems in LDA
  • A Bayesian Hierarchical Model for Learning Natural Scene Catego
  • Codebook
  • Topic Distribution in Different Categories
  • Topic Hierarchical Clustering
  • More Topic Models
  • Are you really into Graphical Models
  • Reference (5)
  • Outline (3)
  • Slide 115
  • The Long Query Problem
  • Key Idea Dimension Reduction + Residual Error Preservation
  • Orthogonal Decomposition
  • A Probabilistic Implementation
  • Search (Online)
  • Search Example
  • Search Example (2)
  • Simultaneously Modeling Semantics and Structure of Threaded Dis
  • Semantic amp structure
  • Optimize them together
  • Reply reconstruction
  • Baselines
  • Evaluation
  • Expert finding
  • Baselines (2)
  • Evaluation (2)
  • Summary
Page 32: An Introduction To Matrix Decomposition and Graphical Model

HITS vs PageRankbull PageRank may be computed once HITS is computed per

query

bull HITS takes query into account PageRank doesnrsquot

bull PageRank has no concept of hubs

bull HITS is sensitive to local topology insertion or deletion of a small number of nodes may change the scores a lot

bull PageRank more stable because of its random jump step

NMF ndash NON-NEGATIVE MATRIX FACTORIZATION

Definition

bull Given a nonnegative matrix Vntimesm find non-negative matrix factors Wntimesk and Hktimesm such that

VntimesmasympWntimeskHktimesm

bull V column matrix each column is a data sample (n-dimension)bull Wi k-basis represents one basebull H coordinates of V projected to W

vj asymp Wntimeskhj

Motivationbull Non-negativity is natural in many applications

bull Probability is also non-negative

bull Additive model to capture local structure

Multiplicative Update Algorithmbull Cost function Euclidean distance

bull Multiplicative Update

Multiplicative Update Algorithmbull Cost function Divergence

ndash Reduce to Kullback-Leibler divergence whenndash A and B can be regarded as normalized probability distributions

bull Multiplicative update

bull PLSA is NMF with KL divergence

NMF vs PCA

bull n = 2429 faces m = 19x19 pixelsbull Positive values are illustrated with black pixels and negative values

with red pixels

bull NMF Parts-based representationbull PCA Holistic representations

Referencebull D D Lee and H S Seung Algorithms for non-negative matrix

factorization (pdf) NIPS 2001bull D D Lee and H S Seung Learning the parts of objects by non-negative

matrix factorization (pdf) Nature 401 788-791 (1999)

Major Referencebull Saara Hyvoumlnen Linear Algebra Methods for Data Mining Spring 2007

University of Helsinki (Highly recommend)

Outlinebull Basic concepts

ndash Likelihood iidndash ML MAP and Bayesian Inferencendash Expectation-Maximizationndash Mixture Gaussian Parameter estimation

bull pLSAndash Motivationndash Derivation amp Geometry propertiesndash Applications

bull LDAndash Motivation - Why to add a hyper parameterndash Dirichlet Distributionndash Variational EMndash Relations with other topic modalsndash Incorporating category information

bull Summary

Not Included

bull General graphical model theoriesbull Markov random field (belief propagation)bull Detailed derivation of LDA

BASIC CONCEPTS

What Is Machine Learning

Databull Let x = (x1 x2 xD) T denote a data point and D = x(1) x(2) x(N) a

data set D is sometimes associated with desired outputs y1 y2

Predictionsbull We are generally interested in predicting something based on the observed

data setbull Given D what can we say about x(N+1)

Modelbull To make predictions we need to make some assumptions We can often

express these assumptions in the form of a model with some parameters θbull Given data D we learn the model parameters from which we can predict

new data pointsbull The model can often be expressed as a probability distribution over data

points

Likelihood Function

bull Given a set of parameter values probability density function (PDF) will show that some data are more probable than other data

bull Inversely given the observed data and a model of interest Likelihood function is defined as

L(θ) = fθ(x|θ) = p(x|θ)

bull That is likelihood function L(θ) will show that some parameters are more likely to have produced the data

Maximum Likelihood (ML)

bull Maximum likelihood will find the best model parameters that make the data ldquomost likelyrdquo generated from this model

bull Suppose we are given n data samples (x1 x2 hellip xn)

bull Maximum likelihood will find θ that maximize L(θ)

bull Predictive distribution

IID ndash Independent Identically Distributed

bull IID means

bull The problem is considerably simplified as

bull Usually log likehood is used

Reference

bull Zoubin Ghahramani Machine Learning (4F13) 2006 Cambridge(Introduction to Machine Learning Lectures 1-2 Slides)

bull Gregor Heinrich Parameter estimation for text analysis Technical Note 2005-2008

EXPECTATION MAXIMIZATION

Why We Need EM

bull The ExpectationndashMaximization (EM) algorithm is a method for ML learning of parameters in latent variable models

bull Why we need latent variables

bull To describe complex model Gaussian Mixture Model

bull To discover the intrinsic structure inside a data set Topic Models such as pLSA LDA

More General

bull Data set bull Likelihood

bull Goal learn maximum likelihood (ML) parameter values

bull The maximum likelihood procedure finds parameters θ such that

bull Because of the integral (or sum) over latent variables the likelihood can be a very complicated and hard to optimize function

The Expectation Maximization (EM) Algorithm

bull The EM algorithm finds a (local) maximum of a latent variable model likelihood It starts from arbitrary values of the parameters and iterates two steps

bull E step Fill in values of latent variables according to posterior given data

bull M step Maximize likelihood as if latent variables were not hidden

bull Decomposes difficult problems into series of tractable steps

Jensenrsquos Inequality

Lower Bounding the Log Likelihoodbull Observed data D = yn Latent variables X = xn Parameters θbull Goal Maximize the log likelihood (ie ML learning) wrt θ

bull Any distribution q(X) over the hidden variables can be used to obtain a lower bound on the log likelihood using Jensenrsquos inequality

bull where H[q] is the entropy of q(X)

The E and M Steps of EM

bull The lower bound on the log likelihood is given by

bull EM alternates betweenbull E step optimize wrt distribution over hidden variables

holding params fixed

bull M step maximize wrt parameters holding hidden distribution fixed

The E Step

bull E step for fixed θ

bull The second term is the Kullback-Leibler divergencebull This means that for fixed θ F is bounded above by L and achieves

that bound whenbull So the E step simply sets

The M Step

bull M step maximize wrt parameters holding hidden distribution q fixed

bull The second equality comes from fact that entropy of q(X) does not depend directly on θ

bull The specific form of the M step depends on the model Often the maximum wrt θ can be found analytically

EM Never Decreases the Likelihood

bull The E and M steps together never decrease the log likelihood

bull The E step brings F(q θ) to the likelihood L(θ) 顶bull The M-step maximizes F(q θ) wrt θ 抬bull F(q θ) lt L(θ) by Jensen ndash or equivalently from the non-

negativity of KL

Reference

bull Zoubin Ghahramani Machine Learning (4F13) 2006 Cambridge(Unsupervised learning Lecture 5 Slides)

bull Christopher M Bishop (2006) Pattern Recognition and Machine Learning Springer

WHY DO WE NEED GRAPHICAL MODEL

Why Do We Need Graphical Models

bull Consndash Graphical model is so complex even with a few circleshellipndash We have to make too many assumptions

bull Prosndash We do need probability to explain our world But joint probability is

hard to computendash Graphical model can help us analyze and understand our problemsndash Graphs are an intuitive way of representing and visualizing the

relationships between many variablesndash With a graphical model we can decouple joint probability to

conditional probabilities which are usually easier

Directed Acyclic Graphical Models (Bayesian Networks)

bull A DAG Model Bayesian network corresponds to a factorization of the joint probability distribution

p(ABCDE) = p(A)p(B)p(C|AB)p(D|BC)p(E|CD)

bull In general

bull where pa(i) are the parents of node i

Directed Graphs for Statistical ModelsPlate Notation

bull A data set of N points generated from a Gaussian

PLSA ndash PROBABILISTIC LATENT SEMANTIC ANALYSIS

Latent Semantic Indexing (LSI) Review

bull For natural Language Queries simple term matching does not work effectivelyndash Ambiguous terms ndash Same Queries vary due to personal styles

bull Latent semantic indexingndash Creates this lsquolatent semantic spacersquo (hidden meaning)

bull LSI puts documents together even if they donrsquot have common words if the docs share frequently co-occurring terms

bull Disadvantagesndash Statistical foundation is missing

pLSA ndash Probabilistic Latent Semantic Analysis

bull Automated Document Indexing and Information retrievalbull Identification of Latent Classes using an Expectation

Maximization (EM) Algorithmbull Shown to solve

ndash Polysemybull Java could mean ldquocoffeerdquo and also the ldquoPL Javardquobull Cricket is a ldquogamerdquo and also an ldquoinsectrdquo

ndash Synonymybull ldquocomputerrdquo ldquopcrdquo ldquodesktoprdquo all could mean the same

bull Has a better statistical foundation than LSA

pLSA

M

Nd

d

z

w

M

d

z1

w1

z2

w2

z3

w3

zN

wN

hellip

z1 hellip zN are variables ziє[1K]K is the number of latent topics

pLSA

d1

z1

w1

z2

w2

zN1

wN1

hellip

d2

z1

w1

z2

w2

zN2

wN2

hellip

dM

z1

w1

z2

w2

zNm

wNm

hellip

p(w|z=1) p(w|z=2) p(w | z=Nm) are shared for all documents

Likelihood

Joint Probability vs Likelihood

bull Joint probability

bull Likelihood (only for observed variables)

bull p(d) is assumed to be uniform

Document Decomposition

bull Each document can be decomposed as

bull This is similar to the matrix decomposition if we consider each discrete distribution as a vector

p(w|d) = ZVtimesk p(z|d)

bull With many documents we hope to find latent topics as common basis

pLSA ndash Objective Function

bull pLSA tries to maximize the log likelihood

bull Due to the summation over z inside log we have to resort to EM

EM Steps

bull E-Stepndash Expectation of the likelihood function is calculated with the current

parameter valuesbull M-Step

ndash Update the parameters with the calculated posterior probabilitiesndash Find the parameters that maximizes the likelihood function

Lower Bounding the Log Likelihood

EM Steps

bull The E-Step

bull The M-Step

Latent Subspace

pLSA vs LSAbull LSA and PLSA perform dimensionality reduction

ndash In LSA by keeping only K singular valuesndash In PLSA by having K aspects

bull Comparison to SVDndash U Matrix related to P(z|d) (doc to aspect)ndash V Matrix related to P(w|z) (aspect to term)ndash E Matrix related to P(z) (aspect strength)

pLSA vs LSAbull The main difference is the way the approximation is done

bull PLSA generates a model (aspect model) and maximizes its predictive power

bull Selecting the proper value of K is heuristic in LSA

bull Model selection in statistics can determine optimal K in PLSA

Applications

bull Text mining topic discovering

bull Scene Classification

Text Mining

Scene Classification

Classification Result

Reference

bull Thomas Hofmann Probabilistic latent semantic analysis In Proc of Uncertainty in Artificial Intelligence UAI99 Stockholm 1999

bull Bosch A Zisserman A and Munoz X Scene Classification via pLSAProceedings of the European Conference on Computer Vision (2006)

bull Sivic J Russell B C Efros A A Zisserman A and Freeman W T Discovering Object Categories in Image Collections MIT AI Lab Memo AIM-2005-005 February 2005

LDA ndash LATENT DIRICHILET ALLOCATION

Problems in pLSA

bull pLSA provides no probabilistic model at the document level Each doc has its own topic mixture proportion

bull The number of parameters in the model grows linearly with M (the number of documents in the training set)

Problems in pLSA

bull There is no constraint for distributions p(z|di)

bull Easy to lead to serious problems with over-fitting

d1

z1

w1

z2

w2

zN1

wN1

hellip

d2

z1

w1

z2

w2

zN2

wN2

hellip

dm

z1

w1

z2

w2

zNm

wNm

hellip

p(z|d1) p(z|d2) p(z|dm)

Dirichlet Distribution

bull In the LDA model the topic mixture proportions for each document are assumed to follow some distribution

bull Requirement for such a distributionndash The samples (mixture proportions) generated from it are K-tuples of non-

negative numbers that sum to one That is the samples are multinormialsndash Easy to optimize

bull Dirichlet Distribution is one of such distributions

bull The space of all of these multinomials has a nice geometric interpretation as a (k-1)-simplex

Dirichlet Distribution

bull Definition

bull The density is zero outside this open (K minus 1)-dimensional simplex

k

i ii

ki ik

i i

ki i

kk

xx

xxxxp i

1

11

1

12121

1 0 st

)Γ()(Γ)|(

bull Various parameter α

(6 2 2) (3 7 5)

(2 3 4) (6 2 6)

Example Dirichlet Distributions (K=3)

Example Dirichlet Distributions (K=3)

bull Equal αi different

α0=01 α0=1 α0=10

k

i i10

The LDA Model

z4z3z2z1

w4w3w2w1

b

z4z3z2z1

w4w3w2w1

z4z3z2z1

w4w3w2w1

bull For each documentbull Choose ~Dirichlet()bull For each of the N words wn

ndash Choose a topic znraquo Multinomial()

ndash Choose a word wn from p(wn|znb) a multinomial probability conditioned on the topic zn

The LDA Model

bull For each documentbull Choose ~Dirichlet()bull For each of the N words wn

ndash Choose a topic znraquo Multinomial()

ndash Choose a word wn from p(wn|znb) a multinomial probability conditioned on the topic zn

Joint Probability

bull Given parameter α and β

where

Likelihood

bull Joint Probability

bull Marginal distribution of a document

bull Likelihood over all the documents

Inference

bull The likelihood can be computed by summing each documentbull Jansenrsquos inequality in EM

Inference

bull In E-Step we need to compute the posterior distribution of the hidden variables

bull Unfortunately this distribution is intractable to compute in general

bull We have to resort to variational approach

Variational Inference

bull In variational inference we consider a simplified graphical model with variational parameters and minimize the KL Divergence between the variational and posterior distributions

Variantional Inference

bull The difference between the lower bound and the likelihood is the KL divergence

bull Maximizing the lower bound L(b) with respect to and is equivalent to minimizing the KL divergence

VBEM vs EM

bull Only different in the E-Step

bull In standard EM q(X) is directly set to p(X|Dθ) and let KL=0bull In VBEM it is intractable to compute p(X|Dθ) Instead it

approximates p(X|Dθ) by a variational distribution q(X) by minimizing KL(q(X) | P(X|D θ)

bull This is also equivalent to maximizing the lower bound L(θ)

Parameter Estimation

bull Given a corpus of documents we would like to find the parameters and b which maximize the likelihood of the observed data

bull Strategy (Variational EM)

bull Lower bound log p(w|b) by a function L(b)bull Repeat until convergence

ndash E Maximize L(b) with respect to the variational parameters ndash M Maximize the bound with respect to parameters and b

Parameter Estimation

bull E-Step Variational Inference ndash repeat until convergence

bull M-Step Parameter estimation

β

α can be implemented using the Newton-Raphson method

Topic Examples in a 100-topic LDA Model)bull 16000 documents from a subset of the TREC AP corpus

Classification (50-topic LDA + SVM)

bull Reuters-21578 dataset ndash contains 8000 documents and 15818 words

(a) EARN vs NOT EARN (b) GRAIN vs NOT GRAIN

Problems in LDA

bull Dirichlet Distribution is helpful to avoid over-fitting But the assumption might be too strong

z4z3z2z1

w4w3w2w1

b

z4z3z2z1

w4w3w2w1

z4z3z2z1

w4w3w2w1

A Bayesian Hierarchical Model for Learning Natural Scene Categories

bull Incorporating category information

MNd

π

z

x

θ

β

Codebookbull 174 Local Image Patches

bull DetectionEvenly Sampled GridRandom SamplingSaliency DetectorLowersquos DoG Detector

bull RepresentationNormalized 11x11 gray values128-dim SIFT

Topic Distribution in Different Categories

Topic Hierarchical Clustering

More Topic Models

bull Dynamic topic models ICML 2006bull Correlated Topic Model NIPS 2005bull Hierarchical Dirichlet Process Journal of the American

Statistical Association 2003bull Nonparametric Bayes pachinko allocation UAI 2007bull Supervised LDA NIPS 2007bull MedLDA ndash Maximum Margin Discrimant LDA ICML 2009bull hellip

Are you really into Graphical Models

bull Describing Visual Scenes using Transformed Dirichlet Processes E Sudderth A Torralba W Freeman and A Willsky NIPS Dec 2005

Reference

bull David M Blei Andrew Y Ng Michael I Jordan Latent Dirichlet Allocation Journal of Machine Learning Research (JMLR) 2003

bull Matthew J Beal Variational Algorithms for Approximate Bayesian Inference PhD Thesis University of Cambridge 1998

bull L Fei-Fei and P Perona A Bayesian Hierarchical Model for Learning Natural Scene Categories CVPR 2005

Outline

bull Matrix Decompositionndash PCA SVD NMFndash LDA ICA Sparse Coding etc

bull Graphical Modelndash Basic concepts in probabilistic machine learningndash EMndash pLSAndash LDA

bull Two Applicationsndash Document decomposition for ldquolong queryrdquo retrieval ICCV 2009ndash Modeling Threaded Discussions SIGIR 2009

Large-Scale Indexing for ldquoLong Queryrdquo Retrieval (Similarity Search)

Xiao Zhang Zhiwei Li Lei Zhang Wei-Ying Ma and Heung-Yeung Shum

ICCV 2009

The Long Query Problem

bull If a query contains 1000 keywordsndash Need to access 1000 inverted listsndash The intersection of 1000 inverted lists may be emptyndash The union of 1000 inverted list may be the whole corpus

bull Dimension reductionTerm1 Term2 Term3 Term4 hellip TermN

Img1 1 2 0 0 hellip 2

f1 f2 hellip fM

Img1 02 01 hellip 003

Topic Projection

Dim = 1 million

Dim = 200

Key Idea Dimension Reduction + Residual Error Preservation

bull p original TF-IDF vector in vocabulary spacebull X projection matrix for dimension reductionbull w low dimensional feature vectorbull residual error

p Xw

An image = 0 01 02 03 04 05 06 07 08 09 1

0

2

4

6

8

10

12

14

16

+ a few words(10 words)

Orthogonal Decomposition

p Xw 1 11 1 1

2 21 2 1 2

1 1

1

k

k

W k W

W W Wk W

p x xp x x w

p wp x x

Base vector

Low dimensional representation

Residual

An image = 0 01 02 03 04 05 06 07 08 09 1

0

2

4

6

8

10

12

14

16

+ a few words(10 words)

T Tp q p q

p q

w w

X1 X2 X3 hellip Xk

A Probabilistic Implementation

x is a switch variable It controls a word generated from

bull a topic specific distribution

bull a document specific distribution

bull a background distribution

( | )p w d 1( 0 | ) ( | ) ( | )K

kp x d p w z k p z k d

( 1| ) ( | )p x d p w d ( 2 | ) ( | )p x d p w

C Chemudugunta etal Modeling General and Specific Aspects of Documents with a Probabilistic Topic Model NIPS 2006

Search (Online)

DS1 DS2 hellip

DS1 DS2 hellip

DS1 DS2 hellip

DS1 DS2 hellip

DS1 DS2 hellip

hellip

hellip

hellip

hellip

hellip

LSH Index

Doc 300 Doc 401 hellip

A query = 0 01 02 03 04 05 06 07 08 09 1

0

2

4

6

8

10

12

14

16

+ a few words

Re-rankingDoc 401 hellip

Doc 1

Doc 2

Doc 300

Doc 401

Doc N

Doc 300

Index 10M Images 46GBSearch Speed lt 100ms

Doc Meta

Search Example

Query Image

Search Example

Query Image

Simultaneously Modeling Semantics and Structure of Threaded Discussions A Sparse

Coding Approach and Its Applications

Chen LIN Jiang-Ming YANG Rui CAI Xin-jing WANG Wei WANG Lei ZHANG

SIGIR 2009

123

Semantic amp structure

124

SemanticTopics

StructureWho reply to who

Optimize them together

Model semantic

Model structure

125

Reply reconstruction

126

DocumentSimilarity

TopicSimilarity

StructureSimilarity

BaselinesNP

Reply to Nearest PostRR

Reply to RootDS

Document SimilarityLDA

Latent Dirichlet Allocation Project documents to topic space

SWB Special Words Topic Model with Background distribution Project documents to topic and junk topic space

127

Evaluation

method Slashdot Apple

All Posts Good Posts All Posts Good Posts

NP 0021 0012 0289 0239

RR 0183 0319 0269 0474

DS 0463 0643 0409 0628

LDA 0465 0644 0410 0648

SWB 0463 0644 0410 0641

SMSS 0524 0737 0517 0772

128

Expert finding

Reply reconstruction

Network construction

Expert finding

Methods

HITS

PageRank

hellip

129

Baselines

LMFormal Models for Expert Finding in Enterprise Corpora SIGIR

06Achieves stable performance in expert finding task using a

language modelPageRank

Benchmark nodal ranking methodHITS

Find hub nodes and authority nodeEABIF

Personalized Recommendation Driven by Information Flow SIGIR rsquo06

Find most influential node130

Evaluation

131

bull Bayesian estimate

Method MRR MAP P10

LM 0821 0698 0800

EABIF(ori) 0674 0362 0243

EABIF(rec) 0742 0318 0281

PageRank(ori) 0675 0377 0263

PageRank(rec) 0743 0321 0266

HITS(ori) 0906 0832 0900

HITS(rec) 0938 0822 0906

Summary

bull Matrix and probability are fundamental mathematics in information retrieval and computer visionndash Matrix decomposition ndash a good practice to learn matrixndash Graphical model ndash a good practice to learn probability

bull Graphical model is a good tool to analyze problems

bull The essence of decomposition is to discover a set of mid-level features to describe original documentsimages

bull It is more adaptable for various applications than matrix decomposition

  • An Introduction To Matrix Decomposition and Graphical Model
  • Outline
  • What Is Matrix Decomposition
  • Why We Need Matrix Decomposition
  • Why We Need Matrix Decomposition (2)
  • Principle Component Analysis
  • Definition ndash Eigenvalue amp Eigenvector
  • Definition ndash Principle Component Analysis
  • Principle Component Analysis (2)
  • Maximizing Variance
  • Optimization Problem
  • Property Data Decomposition
  • Face Recognition ndash Eigenface
  • Slide 14
  • Slide 15
  • PageRank ndash Power Iteration
  • Column-Stochastic amp Irreducible
  • Iterative PageRank Calculation
  • Convergence of the power iteration
  • Singular value decomposition
  • SVD - Definition
  • Singular Values And Singular Vectors
  • Matrix approximation
  • SVD and PCA
  • Example - LSI
  • SVD and PCA (2)
  • Latent Semantic Indexing (LSI)
  • Latent Semantic Indexing
  • Similarity Measures
  • Similarity Measures (2)
  • HITS (Hyperlink Induced Topic Search)
  • Power Iteration
  • HITS and SVD
  • HITS vs PageRank
  • NMF ndash Non-Negative Matrix Factorization
  • Definition
  • Motivation
  • Multiplicative Update Algorithm
  • Multiplicative Update Algorithm (2)
  • NMF vs PCA
  • Reference
  • Major Reference
  • Outline (2)
  • Not Included
  • Basic Concepts
  • What Is Machine Learning
  • Likelihood Function
  • Maximum Likelihood (ML)
  • IID ndash Independent Identically Distributed
  • Reference (2)
  • Expectation Maximization
  • Why We Need EM
  • More General
  • The Expectation Maximization (EM) Algorithm
  • Jensenrsquos Inequality
  • Lower Bounding the Log Likelihood
  • The E and M Steps of EM
  • The E Step
  • The M Step
  • EM Never Decreases the Likelihood
  • Reference (3)
  • Why Do We Need Graphical Model
  • Why Do We Need Graphical Models
  • Directed Acyclic Graphical Models (Bayesian Networks)
  • Directed Graphs for Statistical Models Plate Notation
  • pLSA ndash Probabilistic Latent Semantic Analysis
  • Latent Semantic Indexing (LSI) Review
  • pLSA ndash Probabilistic Latent Semantic Analysis (2)
  • pLSA
  • pLSA (2)
  • Joint Probability vs Likelihood
  • Document Decomposition
  • pLSA ndash Objective Function
  • EM Steps
  • Lower Bounding the Log Likelihood (2)
  • EM Steps (2)
  • Latent Subspace
  • pLSA vs LSA
  • pLSA vs LSA (2)
  • Applications
  • Text Mining
  • Scene Classification
  • Classification Result
  • Reference (4)
  • LDA ndash Latent Dirichilet Allocation
  • Problems in pLSA
  • Problems in pLSA (2)
  • Dirichlet Distribution
  • Dirichlet Distribution (2)
  • Example Dirichlet Distributions (K=3)
  • Example Dirichlet Distributions (K=3) (2)
  • The LDA Model
  • The LDA Model (2)
  • Joint Probability
  • Likelihood
  • Inference
  • Inference (2)
  • Variational Inference
  • Variantional Inference
  • VBEM vs EM
  • Parameter Estimation
  • Parameter Estimation (2)
  • Topic Examples in a 100-topic LDA Model)
  • Classification (50-topic LDA + SVM)
  • Problems in LDA
  • A Bayesian Hierarchical Model for Learning Natural Scene Catego
  • Codebook
  • Topic Distribution in Different Categories
  • Topic Hierarchical Clustering
  • More Topic Models
  • Are you really into Graphical Models
  • Reference (5)
  • Outline (3)
  • Slide 115
  • The Long Query Problem
  • Key Idea Dimension Reduction + Residual Error Preservation
  • Orthogonal Decomposition
  • A Probabilistic Implementation
  • Search (Online)
  • Search Example
  • Search Example (2)
  • Simultaneously Modeling Semantics and Structure of Threaded Dis
  • Semantic amp structure
  • Optimize them together
  • Reply reconstruction
  • Baselines
  • Evaluation
  • Expert finding
  • Baselines (2)
  • Evaluation (2)
  • Summary
Page 33: An Introduction To Matrix Decomposition and Graphical Model

NMF ndash NON-NEGATIVE MATRIX FACTORIZATION

Definition

bull Given a nonnegative matrix Vntimesm find non-negative matrix factors Wntimesk and Hktimesm such that

VntimesmasympWntimeskHktimesm

bull V column matrix each column is a data sample (n-dimension)bull Wi k-basis represents one basebull H coordinates of V projected to W

vj asymp Wntimeskhj

Motivationbull Non-negativity is natural in many applications

bull Probability is also non-negative

bull Additive model to capture local structure

Multiplicative Update Algorithmbull Cost function Euclidean distance

bull Multiplicative Update

Multiplicative Update Algorithmbull Cost function Divergence

ndash Reduce to Kullback-Leibler divergence whenndash A and B can be regarded as normalized probability distributions

bull Multiplicative update

bull PLSA is NMF with KL divergence

NMF vs PCA

bull n = 2429 faces m = 19x19 pixelsbull Positive values are illustrated with black pixels and negative values

with red pixels

bull NMF Parts-based representationbull PCA Holistic representations

Referencebull D D Lee and H S Seung Algorithms for non-negative matrix

factorization (pdf) NIPS 2001bull D D Lee and H S Seung Learning the parts of objects by non-negative

matrix factorization (pdf) Nature 401 788-791 (1999)

Major Referencebull Saara Hyvoumlnen Linear Algebra Methods for Data Mining Spring 2007

University of Helsinki (Highly recommend)

Outlinebull Basic concepts

ndash Likelihood iidndash ML MAP and Bayesian Inferencendash Expectation-Maximizationndash Mixture Gaussian Parameter estimation

bull pLSAndash Motivationndash Derivation amp Geometry propertiesndash Applications

bull LDAndash Motivation - Why to add a hyper parameterndash Dirichlet Distributionndash Variational EMndash Relations with other topic modalsndash Incorporating category information

bull Summary

Not Included

bull General graphical model theoriesbull Markov random field (belief propagation)bull Detailed derivation of LDA

BASIC CONCEPTS

What Is Machine Learning

Databull Let x = (x1 x2 xD) T denote a data point and D = x(1) x(2) x(N) a

data set D is sometimes associated with desired outputs y1 y2

Predictionsbull We are generally interested in predicting something based on the observed

data setbull Given D what can we say about x(N+1)

Modelbull To make predictions we need to make some assumptions We can often

express these assumptions in the form of a model with some parameters θbull Given data D we learn the model parameters from which we can predict

new data pointsbull The model can often be expressed as a probability distribution over data

points

Likelihood Function

bull Given a set of parameter values probability density function (PDF) will show that some data are more probable than other data

bull Inversely given the observed data and a model of interest Likelihood function is defined as

L(θ) = fθ(x|θ) = p(x|θ)

bull That is likelihood function L(θ) will show that some parameters are more likely to have produced the data

Maximum Likelihood (ML)

bull Maximum likelihood will find the best model parameters that make the data ldquomost likelyrdquo generated from this model

bull Suppose we are given n data samples (x1 x2 hellip xn)

bull Maximum likelihood will find θ that maximize L(θ)

bull Predictive distribution

IID ndash Independent Identically Distributed

bull IID means

bull The problem is considerably simplified as

bull Usually log likehood is used

Reference

bull Zoubin Ghahramani Machine Learning (4F13) 2006 Cambridge(Introduction to Machine Learning Lectures 1-2 Slides)

bull Gregor Heinrich Parameter estimation for text analysis Technical Note 2005-2008

EXPECTATION MAXIMIZATION

Why We Need EM

bull The ExpectationndashMaximization (EM) algorithm is a method for ML learning of parameters in latent variable models

bull Why we need latent variables

bull To describe complex model Gaussian Mixture Model

bull To discover the intrinsic structure inside a data set Topic Models such as pLSA LDA

More General

bull Data set bull Likelihood

bull Goal learn maximum likelihood (ML) parameter values

bull The maximum likelihood procedure finds parameters θ such that

bull Because of the integral (or sum) over latent variables the likelihood can be a very complicated and hard to optimize function

The Expectation Maximization (EM) Algorithm

bull The EM algorithm finds a (local) maximum of a latent variable model likelihood It starts from arbitrary values of the parameters and iterates two steps

bull E step Fill in values of latent variables according to posterior given data

bull M step Maximize likelihood as if latent variables were not hidden

bull Decomposes difficult problems into series of tractable steps

Jensenrsquos Inequality

Lower Bounding the Log Likelihoodbull Observed data D = yn Latent variables X = xn Parameters θbull Goal Maximize the log likelihood (ie ML learning) wrt θ

bull Any distribution q(X) over the hidden variables can be used to obtain a lower bound on the log likelihood using Jensenrsquos inequality

bull where H[q] is the entropy of q(X)

The E and M Steps of EM

bull The lower bound on the log likelihood is given by

bull EM alternates betweenbull E step optimize wrt distribution over hidden variables

holding params fixed

bull M step maximize wrt parameters holding hidden distribution fixed

The E Step

bull E step for fixed θ

bull The second term is the Kullback-Leibler divergencebull This means that for fixed θ F is bounded above by L and achieves

that bound whenbull So the E step simply sets

The M Step

bull M step maximize wrt parameters holding hidden distribution q fixed

bull The second equality comes from fact that entropy of q(X) does not depend directly on θ

bull The specific form of the M step depends on the model Often the maximum wrt θ can be found analytically

EM Never Decreases the Likelihood

bull The E and M steps together never decrease the log likelihood

bull The E step brings F(q θ) to the likelihood L(θ) 顶bull The M-step maximizes F(q θ) wrt θ 抬bull F(q θ) lt L(θ) by Jensen ndash or equivalently from the non-

negativity of KL

Reference

bull Zoubin Ghahramani Machine Learning (4F13) 2006 Cambridge(Unsupervised learning Lecture 5 Slides)

bull Christopher M Bishop (2006) Pattern Recognition and Machine Learning Springer

WHY DO WE NEED GRAPHICAL MODEL

Why Do We Need Graphical Models

bull Consndash Graphical model is so complex even with a few circleshellipndash We have to make too many assumptions

bull Prosndash We do need probability to explain our world But joint probability is

hard to computendash Graphical model can help us analyze and understand our problemsndash Graphs are an intuitive way of representing and visualizing the

relationships between many variablesndash With a graphical model we can decouple joint probability to

conditional probabilities which are usually easier

Directed Acyclic Graphical Models (Bayesian Networks)

bull A DAG Model Bayesian network corresponds to a factorization of the joint probability distribution

p(ABCDE) = p(A)p(B)p(C|AB)p(D|BC)p(E|CD)

bull In general

bull where pa(i) are the parents of node i

Directed Graphs for Statistical ModelsPlate Notation

bull A data set of N points generated from a Gaussian

PLSA ndash PROBABILISTIC LATENT SEMANTIC ANALYSIS

Latent Semantic Indexing (LSI) Review

bull For natural Language Queries simple term matching does not work effectivelyndash Ambiguous terms ndash Same Queries vary due to personal styles

bull Latent semantic indexingndash Creates this lsquolatent semantic spacersquo (hidden meaning)

bull LSI puts documents together even if they donrsquot have common words if the docs share frequently co-occurring terms

bull Disadvantagesndash Statistical foundation is missing

pLSA ndash Probabilistic Latent Semantic Analysis

bull Automated Document Indexing and Information retrievalbull Identification of Latent Classes using an Expectation

Maximization (EM) Algorithmbull Shown to solve

ndash Polysemybull Java could mean ldquocoffeerdquo and also the ldquoPL Javardquobull Cricket is a ldquogamerdquo and also an ldquoinsectrdquo

ndash Synonymybull ldquocomputerrdquo ldquopcrdquo ldquodesktoprdquo all could mean the same

bull Has a better statistical foundation than LSA

pLSA

M

Nd

d

z

w

M

d

z1

w1

z2

w2

z3

w3

zN

wN

hellip

z1 hellip zN are variables ziє[1K]K is the number of latent topics

pLSA

d1

z1

w1

z2

w2

zN1

wN1

hellip

d2

z1

w1

z2

w2

zN2

wN2

hellip

dM

z1

w1

z2

w2

zNm

wNm

hellip

p(w|z=1) p(w|z=2) p(w | z=Nm) are shared for all documents

Likelihood

Joint Probability vs Likelihood

bull Joint probability

bull Likelihood (only for observed variables)

bull p(d) is assumed to be uniform

Document Decomposition

bull Each document can be decomposed as

bull This is similar to the matrix decomposition if we consider each discrete distribution as a vector

p(w|d) = ZVtimesk p(z|d)

bull With many documents we hope to find latent topics as common basis

pLSA ndash Objective Function

bull pLSA tries to maximize the log likelihood

bull Due to the summation over z inside log we have to resort to EM

EM Steps

bull E-Stepndash Expectation of the likelihood function is calculated with the current

parameter valuesbull M-Step

ndash Update the parameters with the calculated posterior probabilitiesndash Find the parameters that maximizes the likelihood function

Lower Bounding the Log Likelihood

EM Steps

bull The E-Step

bull The M-Step

Latent Subspace

pLSA vs LSAbull LSA and PLSA perform dimensionality reduction

ndash In LSA by keeping only K singular valuesndash In PLSA by having K aspects

bull Comparison to SVDndash U Matrix related to P(z|d) (doc to aspect)ndash V Matrix related to P(w|z) (aspect to term)ndash E Matrix related to P(z) (aspect strength)

pLSA vs LSAbull The main difference is the way the approximation is done

bull PLSA generates a model (aspect model) and maximizes its predictive power

bull Selecting the proper value of K is heuristic in LSA

bull Model selection in statistics can determine optimal K in PLSA

Applications

bull Text mining topic discovering

bull Scene Classification

Text Mining

Scene Classification

Classification Result

Reference

bull Thomas Hofmann Probabilistic latent semantic analysis In Proc of Uncertainty in Artificial Intelligence UAI99 Stockholm 1999

bull Bosch A Zisserman A and Munoz X Scene Classification via pLSAProceedings of the European Conference on Computer Vision (2006)

bull Sivic J Russell B C Efros A A Zisserman A and Freeman W T Discovering Object Categories in Image Collections MIT AI Lab Memo AIM-2005-005 February 2005

LDA ndash LATENT DIRICHILET ALLOCATION

Problems in pLSA

bull pLSA provides no probabilistic model at the document level Each doc has its own topic mixture proportion

bull The number of parameters in the model grows linearly with M (the number of documents in the training set)

Problems in pLSA

bull There is no constraint for distributions p(z|di)

bull Easy to lead to serious problems with over-fitting

d1

z1

w1

z2

w2

zN1

wN1

hellip

d2

z1

w1

z2

w2

zN2

wN2

hellip

dm

z1

w1

z2

w2

zNm

wNm

hellip

p(z|d1) p(z|d2) p(z|dm)

Dirichlet Distribution

bull In the LDA model the topic mixture proportions for each document are assumed to follow some distribution

bull Requirement for such a distributionndash The samples (mixture proportions) generated from it are K-tuples of non-

negative numbers that sum to one That is the samples are multinormialsndash Easy to optimize

bull Dirichlet Distribution is one of such distributions

bull The space of all of these multinomials has a nice geometric interpretation as a (k-1)-simplex

Dirichlet Distribution

bull Definition

bull The density is zero outside this open (K minus 1)-dimensional simplex

k

i ii

ki ik

i i

ki i

kk

xx

xxxxp i

1

11

1

12121

1 0 st

)Γ()(Γ)|(

bull Various parameter α

(6 2 2) (3 7 5)

(2 3 4) (6 2 6)

Example Dirichlet Distributions (K=3)

Example Dirichlet Distributions (K=3)

bull Equal αi different

α0=01 α0=1 α0=10

k

i i10

The LDA Model

z4z3z2z1

w4w3w2w1

b

z4z3z2z1

w4w3w2w1

z4z3z2z1

w4w3w2w1

bull For each documentbull Choose ~Dirichlet()bull For each of the N words wn

ndash Choose a topic znraquo Multinomial()

ndash Choose a word wn from p(wn|znb) a multinomial probability conditioned on the topic zn

The LDA Model

bull For each documentbull Choose ~Dirichlet()bull For each of the N words wn

ndash Choose a topic znraquo Multinomial()

ndash Choose a word wn from p(wn|znb) a multinomial probability conditioned on the topic zn

Joint Probability

bull Given parameter α and β

where

Likelihood

bull Joint Probability

bull Marginal distribution of a document

bull Likelihood over all the documents

Inference

bull The likelihood can be computed by summing each documentbull Jansenrsquos inequality in EM

Inference

bull In E-Step we need to compute the posterior distribution of the hidden variables

bull Unfortunately this distribution is intractable to compute in general

bull We have to resort to variational approach

Variational Inference

bull In variational inference we consider a simplified graphical model with variational parameters and minimize the KL Divergence between the variational and posterior distributions

Variantional Inference

bull The difference between the lower bound and the likelihood is the KL divergence

bull Maximizing the lower bound L(b) with respect to and is equivalent to minimizing the KL divergence

VBEM vs EM

bull Only different in the E-Step

bull In standard EM q(X) is directly set to p(X|Dθ) and let KL=0bull In VBEM it is intractable to compute p(X|Dθ) Instead it

approximates p(X|Dθ) by a variational distribution q(X) by minimizing KL(q(X) | P(X|D θ)

bull This is also equivalent to maximizing the lower bound L(θ)

Parameter Estimation

bull Given a corpus of documents we would like to find the parameters and b which maximize the likelihood of the observed data

bull Strategy (Variational EM)

bull Lower bound log p(w|b) by a function L(b)bull Repeat until convergence

ndash E Maximize L(b) with respect to the variational parameters ndash M Maximize the bound with respect to parameters and b

Parameter Estimation

bull E-Step Variational Inference ndash repeat until convergence

bull M-Step Parameter estimation

β

α can be implemented using the Newton-Raphson method

Topic Examples in a 100-topic LDA Model)bull 16000 documents from a subset of the TREC AP corpus

Classification (50-topic LDA + SVM)

bull Reuters-21578 dataset ndash contains 8000 documents and 15818 words

(a) EARN vs NOT EARN (b) GRAIN vs NOT GRAIN

Problems in LDA

bull Dirichlet Distribution is helpful to avoid over-fitting But the assumption might be too strong

z4z3z2z1

w4w3w2w1

b

z4z3z2z1

w4w3w2w1

z4z3z2z1

w4w3w2w1

A Bayesian Hierarchical Model for Learning Natural Scene Categories

bull Incorporating category information

MNd

π

z

x

θ

β

Codebookbull 174 Local Image Patches

bull DetectionEvenly Sampled GridRandom SamplingSaliency DetectorLowersquos DoG Detector

bull RepresentationNormalized 11x11 gray values128-dim SIFT

Topic Distribution in Different Categories

Topic Hierarchical Clustering

More Topic Models

bull Dynamic topic models ICML 2006bull Correlated Topic Model NIPS 2005bull Hierarchical Dirichlet Process Journal of the American

Statistical Association 2003bull Nonparametric Bayes pachinko allocation UAI 2007bull Supervised LDA NIPS 2007bull MedLDA ndash Maximum Margin Discrimant LDA ICML 2009bull hellip

Are you really into Graphical Models

bull Describing Visual Scenes using Transformed Dirichlet Processes E Sudderth A Torralba W Freeman and A Willsky NIPS Dec 2005

Reference

bull David M Blei Andrew Y Ng Michael I Jordan Latent Dirichlet Allocation Journal of Machine Learning Research (JMLR) 2003

bull Matthew J Beal Variational Algorithms for Approximate Bayesian Inference PhD Thesis University of Cambridge 1998

bull L Fei-Fei and P Perona A Bayesian Hierarchical Model for Learning Natural Scene Categories CVPR 2005

Outline

bull Matrix Decompositionndash PCA SVD NMFndash LDA ICA Sparse Coding etc

bull Graphical Modelndash Basic concepts in probabilistic machine learningndash EMndash pLSAndash LDA

bull Two Applicationsndash Document decomposition for ldquolong queryrdquo retrieval ICCV 2009ndash Modeling Threaded Discussions SIGIR 2009

Large-Scale Indexing for ldquoLong Queryrdquo Retrieval (Similarity Search)

Xiao Zhang Zhiwei Li Lei Zhang Wei-Ying Ma and Heung-Yeung Shum

ICCV 2009

The Long Query Problem

bull If a query contains 1000 keywordsndash Need to access 1000 inverted listsndash The intersection of 1000 inverted lists may be emptyndash The union of 1000 inverted list may be the whole corpus

bull Dimension reductionTerm1 Term2 Term3 Term4 hellip TermN

Img1 1 2 0 0 hellip 2

f1 f2 hellip fM

Img1 02 01 hellip 003

Topic Projection

Dim = 1 million

Dim = 200

Key Idea Dimension Reduction + Residual Error Preservation

bull p original TF-IDF vector in vocabulary spacebull X projection matrix for dimension reductionbull w low dimensional feature vectorbull residual error

p Xw

An image = 0 01 02 03 04 05 06 07 08 09 1

0

2

4

6

8

10

12

14

16

+ a few words(10 words)

Orthogonal Decomposition

p Xw 1 11 1 1

2 21 2 1 2

1 1

1

k

k

W k W

W W Wk W

p x xp x x w

p wp x x

Base vector

Low dimensional representation

Residual

An image = 0 01 02 03 04 05 06 07 08 09 1

0

2

4

6

8

10

12

14

16

+ a few words(10 words)

T Tp q p q

p q

w w

X1 X2 X3 hellip Xk

A Probabilistic Implementation

x is a switch variable It controls a word generated from

bull a topic specific distribution

bull a document specific distribution

bull a background distribution

( | )p w d 1( 0 | ) ( | ) ( | )K

kp x d p w z k p z k d

( 1| ) ( | )p x d p w d ( 2 | ) ( | )p x d p w

C Chemudugunta etal Modeling General and Specific Aspects of Documents with a Probabilistic Topic Model NIPS 2006

Search (Online)

DS1 DS2 hellip

DS1 DS2 hellip

DS1 DS2 hellip

DS1 DS2 hellip

DS1 DS2 hellip

hellip

hellip

hellip

hellip

hellip

LSH Index

Doc 300 Doc 401 hellip

A query = 0 01 02 03 04 05 06 07 08 09 1

0

2

4

6

8

10

12

14

16

+ a few words

Re-rankingDoc 401 hellip

Doc 1

Doc 2

Doc 300

Doc 401

Doc N

Doc 300

Index 10M Images 46GBSearch Speed lt 100ms

Doc Meta

Search Example

Query Image

Search Example

Query Image

Simultaneously Modeling Semantics and Structure of Threaded Discussions A Sparse

Coding Approach and Its Applications

Chen LIN Jiang-Ming YANG Rui CAI Xin-jing WANG Wei WANG Lei ZHANG

SIGIR 2009

123

Semantic amp structure

124

SemanticTopics

StructureWho reply to who

Optimize them together

Model semantic

Model structure

125

Reply reconstruction

126

DocumentSimilarity

TopicSimilarity

StructureSimilarity

BaselinesNP

Reply to Nearest PostRR

Reply to RootDS

Document SimilarityLDA

Latent Dirichlet Allocation Project documents to topic space

SWB Special Words Topic Model with Background distribution Project documents to topic and junk topic space

127

Evaluation

method Slashdot Apple

All Posts Good Posts All Posts Good Posts

NP 0021 0012 0289 0239

RR 0183 0319 0269 0474

DS 0463 0643 0409 0628

LDA 0465 0644 0410 0648

SWB 0463 0644 0410 0641

SMSS 0524 0737 0517 0772

128

Expert finding

Reply reconstruction

Network construction

Expert finding

Methods

HITS

PageRank

hellip

129

Baselines

LMFormal Models for Expert Finding in Enterprise Corpora SIGIR

06Achieves stable performance in expert finding task using a

language modelPageRank

Benchmark nodal ranking methodHITS

Find hub nodes and authority nodeEABIF

Personalized Recommendation Driven by Information Flow SIGIR rsquo06

Find most influential node130

Evaluation

131

bull Bayesian estimate

Method MRR MAP P10

LM 0821 0698 0800

EABIF(ori) 0674 0362 0243

EABIF(rec) 0742 0318 0281

PageRank(ori) 0675 0377 0263

PageRank(rec) 0743 0321 0266

HITS(ori) 0906 0832 0900

HITS(rec) 0938 0822 0906

Summary

bull Matrix and probability are fundamental mathematics in information retrieval and computer visionndash Matrix decomposition ndash a good practice to learn matrixndash Graphical model ndash a good practice to learn probability

bull Graphical model is a good tool to analyze problems

bull The essence of decomposition is to discover a set of mid-level features to describe original documentsimages

bull It is more adaptable for various applications than matrix decomposition

  • An Introduction To Matrix Decomposition and Graphical Model
  • Outline
  • What Is Matrix Decomposition
  • Why We Need Matrix Decomposition
  • Why We Need Matrix Decomposition (2)
  • Principle Component Analysis
  • Definition ndash Eigenvalue amp Eigenvector
  • Definition ndash Principle Component Analysis
  • Principle Component Analysis (2)
  • Maximizing Variance
  • Optimization Problem
  • Property Data Decomposition
  • Face Recognition ndash Eigenface
  • Slide 14
  • Slide 15
  • PageRank ndash Power Iteration
  • Column-Stochastic amp Irreducible
  • Iterative PageRank Calculation
  • Convergence of the power iteration
  • Singular value decomposition
  • SVD - Definition
  • Singular Values And Singular Vectors
  • Matrix approximation
  • SVD and PCA
  • Example - LSI
  • SVD and PCA (2)
  • Latent Semantic Indexing (LSI)
  • Latent Semantic Indexing
  • Similarity Measures
  • Similarity Measures (2)
  • HITS (Hyperlink Induced Topic Search)
  • Power Iteration
  • HITS and SVD
  • HITS vs PageRank
  • NMF ndash Non-Negative Matrix Factorization
  • Definition
  • Motivation
  • Multiplicative Update Algorithm
  • Multiplicative Update Algorithm (2)
  • NMF vs PCA
  • Reference
  • Major Reference
  • Outline (2)
  • Not Included
  • Basic Concepts
  • What Is Machine Learning
  • Likelihood Function
  • Maximum Likelihood (ML)
  • IID ndash Independent Identically Distributed
  • Reference (2)
  • Expectation Maximization
  • Why We Need EM
  • More General
  • The Expectation Maximization (EM) Algorithm
  • Jensenrsquos Inequality
  • Lower Bounding the Log Likelihood
  • The E and M Steps of EM
  • The E Step
  • The M Step
  • EM Never Decreases the Likelihood
  • Reference (3)
  • Why Do We Need Graphical Model
  • Why Do We Need Graphical Models
  • Directed Acyclic Graphical Models (Bayesian Networks)
  • Directed Graphs for Statistical Models Plate Notation
  • pLSA ndash Probabilistic Latent Semantic Analysis
  • Latent Semantic Indexing (LSI) Review
  • pLSA ndash Probabilistic Latent Semantic Analysis (2)
  • pLSA
  • pLSA (2)
  • Joint Probability vs Likelihood
  • Document Decomposition
  • pLSA ndash Objective Function
  • EM Steps
  • Lower Bounding the Log Likelihood (2)
  • EM Steps (2)
  • Latent Subspace
  • pLSA vs LSA
  • pLSA vs LSA (2)
  • Applications
  • Text Mining
  • Scene Classification
  • Classification Result
  • Reference (4)
  • LDA ndash Latent Dirichilet Allocation
  • Problems in pLSA
  • Problems in pLSA (2)
  • Dirichlet Distribution
  • Dirichlet Distribution (2)
  • Example Dirichlet Distributions (K=3)
  • Example Dirichlet Distributions (K=3) (2)
  • The LDA Model
  • The LDA Model (2)
  • Joint Probability
  • Likelihood
  • Inference
  • Inference (2)
  • Variational Inference
  • Variantional Inference
  • VBEM vs EM
  • Parameter Estimation
  • Parameter Estimation (2)
  • Topic Examples in a 100-topic LDA Model)
  • Classification (50-topic LDA + SVM)
  • Problems in LDA
  • A Bayesian Hierarchical Model for Learning Natural Scene Catego
  • Codebook
  • Topic Distribution in Different Categories
  • Topic Hierarchical Clustering
  • More Topic Models
  • Are you really into Graphical Models
  • Reference (5)
  • Outline (3)
  • Slide 115
  • The Long Query Problem
  • Key Idea Dimension Reduction + Residual Error Preservation
  • Orthogonal Decomposition
  • A Probabilistic Implementation
  • Search (Online)
  • Search Example
  • Search Example (2)
  • Simultaneously Modeling Semantics and Structure of Threaded Dis
  • Semantic amp structure
  • Optimize them together
  • Reply reconstruction
  • Baselines
  • Evaluation
  • Expert finding
  • Baselines (2)
  • Evaluation (2)
  • Summary
Page 34: An Introduction To Matrix Decomposition and Graphical Model

Definition

bull Given a nonnegative matrix Vntimesm find non-negative matrix factors Wntimesk and Hktimesm such that

VntimesmasympWntimeskHktimesm

bull V column matrix each column is a data sample (n-dimension)bull Wi k-basis represents one basebull H coordinates of V projected to W

vj asymp Wntimeskhj

Motivationbull Non-negativity is natural in many applications

bull Probability is also non-negative

bull Additive model to capture local structure

Multiplicative Update Algorithmbull Cost function Euclidean distance

bull Multiplicative Update

Multiplicative Update Algorithmbull Cost function Divergence

ndash Reduce to Kullback-Leibler divergence whenndash A and B can be regarded as normalized probability distributions

bull Multiplicative update

bull PLSA is NMF with KL divergence

NMF vs PCA

bull n = 2429 faces m = 19x19 pixelsbull Positive values are illustrated with black pixels and negative values

with red pixels

bull NMF Parts-based representationbull PCA Holistic representations

Referencebull D D Lee and H S Seung Algorithms for non-negative matrix

factorization (pdf) NIPS 2001bull D D Lee and H S Seung Learning the parts of objects by non-negative

matrix factorization (pdf) Nature 401 788-791 (1999)

Major Referencebull Saara Hyvoumlnen Linear Algebra Methods for Data Mining Spring 2007

University of Helsinki (Highly recommend)

Outlinebull Basic concepts

ndash Likelihood iidndash ML MAP and Bayesian Inferencendash Expectation-Maximizationndash Mixture Gaussian Parameter estimation

bull pLSAndash Motivationndash Derivation amp Geometry propertiesndash Applications

bull LDAndash Motivation - Why to add a hyper parameterndash Dirichlet Distributionndash Variational EMndash Relations with other topic modalsndash Incorporating category information

bull Summary

Not Included

bull General graphical model theoriesbull Markov random field (belief propagation)bull Detailed derivation of LDA

BASIC CONCEPTS

What Is Machine Learning

Databull Let x = (x1 x2 xD) T denote a data point and D = x(1) x(2) x(N) a

data set D is sometimes associated with desired outputs y1 y2

Predictionsbull We are generally interested in predicting something based on the observed

data setbull Given D what can we say about x(N+1)

Modelbull To make predictions we need to make some assumptions We can often

express these assumptions in the form of a model with some parameters θbull Given data D we learn the model parameters from which we can predict

new data pointsbull The model can often be expressed as a probability distribution over data

points

Likelihood Function

bull Given a set of parameter values probability density function (PDF) will show that some data are more probable than other data

bull Inversely given the observed data and a model of interest Likelihood function is defined as

L(θ) = fθ(x|θ) = p(x|θ)

bull That is likelihood function L(θ) will show that some parameters are more likely to have produced the data

Maximum Likelihood (ML)

bull Maximum likelihood will find the best model parameters that make the data ldquomost likelyrdquo generated from this model

bull Suppose we are given n data samples (x1 x2 hellip xn)

bull Maximum likelihood will find θ that maximize L(θ)

bull Predictive distribution

IID ndash Independent Identically Distributed

bull IID means

bull The problem is considerably simplified as

bull Usually log likehood is used

Reference

bull Zoubin Ghahramani Machine Learning (4F13) 2006 Cambridge(Introduction to Machine Learning Lectures 1-2 Slides)

bull Gregor Heinrich Parameter estimation for text analysis Technical Note 2005-2008

EXPECTATION MAXIMIZATION

Why We Need EM

bull The ExpectationndashMaximization (EM) algorithm is a method for ML learning of parameters in latent variable models

bull Why we need latent variables

bull To describe complex model Gaussian Mixture Model

bull To discover the intrinsic structure inside a data set Topic Models such as pLSA LDA

More General

bull Data set bull Likelihood

bull Goal learn maximum likelihood (ML) parameter values

bull The maximum likelihood procedure finds parameters θ such that

bull Because of the integral (or sum) over latent variables the likelihood can be a very complicated and hard to optimize function

The Expectation Maximization (EM) Algorithm

bull The EM algorithm finds a (local) maximum of a latent variable model likelihood It starts from arbitrary values of the parameters and iterates two steps

bull E step Fill in values of latent variables according to posterior given data

bull M step Maximize likelihood as if latent variables were not hidden

bull Decomposes difficult problems into series of tractable steps

Jensenrsquos Inequality

Lower Bounding the Log Likelihoodbull Observed data D = yn Latent variables X = xn Parameters θbull Goal Maximize the log likelihood (ie ML learning) wrt θ

bull Any distribution q(X) over the hidden variables can be used to obtain a lower bound on the log likelihood using Jensenrsquos inequality

bull where H[q] is the entropy of q(X)

The E and M Steps of EM

bull The lower bound on the log likelihood is given by

bull EM alternates betweenbull E step optimize wrt distribution over hidden variables

holding params fixed

bull M step maximize wrt parameters holding hidden distribution fixed

The E Step

bull E step for fixed θ

bull The second term is the Kullback-Leibler divergencebull This means that for fixed θ F is bounded above by L and achieves

that bound whenbull So the E step simply sets

The M Step

bull M step maximize wrt parameters holding hidden distribution q fixed

bull The second equality comes from fact that entropy of q(X) does not depend directly on θ

bull The specific form of the M step depends on the model Often the maximum wrt θ can be found analytically

EM Never Decreases the Likelihood

bull The E and M steps together never decrease the log likelihood

bull The E step brings F(q θ) to the likelihood L(θ) 顶bull The M-step maximizes F(q θ) wrt θ 抬bull F(q θ) lt L(θ) by Jensen ndash or equivalently from the non-

negativity of KL

Reference

bull Zoubin Ghahramani Machine Learning (4F13) 2006 Cambridge(Unsupervised learning Lecture 5 Slides)

bull Christopher M Bishop (2006) Pattern Recognition and Machine Learning Springer

WHY DO WE NEED GRAPHICAL MODEL

Why Do We Need Graphical Models

bull Consndash Graphical model is so complex even with a few circleshellipndash We have to make too many assumptions

bull Prosndash We do need probability to explain our world But joint probability is

hard to computendash Graphical model can help us analyze and understand our problemsndash Graphs are an intuitive way of representing and visualizing the

relationships between many variablesndash With a graphical model we can decouple joint probability to

conditional probabilities which are usually easier

Directed Acyclic Graphical Models (Bayesian Networks)

bull A DAG Model Bayesian network corresponds to a factorization of the joint probability distribution

p(ABCDE) = p(A)p(B)p(C|AB)p(D|BC)p(E|CD)

bull In general

bull where pa(i) are the parents of node i

Directed Graphs for Statistical ModelsPlate Notation

bull A data set of N points generated from a Gaussian

PLSA ndash PROBABILISTIC LATENT SEMANTIC ANALYSIS

Latent Semantic Indexing (LSI) Review

bull For natural Language Queries simple term matching does not work effectivelyndash Ambiguous terms ndash Same Queries vary due to personal styles

bull Latent semantic indexingndash Creates this lsquolatent semantic spacersquo (hidden meaning)

bull LSI puts documents together even if they donrsquot have common words if the docs share frequently co-occurring terms

bull Disadvantagesndash Statistical foundation is missing

pLSA ndash Probabilistic Latent Semantic Analysis

bull Automated Document Indexing and Information retrievalbull Identification of Latent Classes using an Expectation

Maximization (EM) Algorithmbull Shown to solve

ndash Polysemybull Java could mean ldquocoffeerdquo and also the ldquoPL Javardquobull Cricket is a ldquogamerdquo and also an ldquoinsectrdquo

ndash Synonymybull ldquocomputerrdquo ldquopcrdquo ldquodesktoprdquo all could mean the same

bull Has a better statistical foundation than LSA

pLSA

M

Nd

d

z

w

M

d

z1

w1

z2

w2

z3

w3

zN

wN

hellip

z1 hellip zN are variables ziє[1K]K is the number of latent topics

pLSA

d1

z1

w1

z2

w2

zN1

wN1

hellip

d2

z1

w1

z2

w2

zN2

wN2

hellip

dM

z1

w1

z2

w2

zNm

wNm

hellip

p(w|z=1) p(w|z=2) p(w | z=Nm) are shared for all documents

Likelihood

Joint Probability vs Likelihood

bull Joint probability

bull Likelihood (only for observed variables)

bull p(d) is assumed to be uniform

Document Decomposition

bull Each document can be decomposed as

bull This is similar to the matrix decomposition if we consider each discrete distribution as a vector

p(w|d) = ZVtimesk p(z|d)

bull With many documents we hope to find latent topics as common basis

pLSA ndash Objective Function

bull pLSA tries to maximize the log likelihood

bull Due to the summation over z inside log we have to resort to EM

EM Steps

bull E-Stepndash Expectation of the likelihood function is calculated with the current

parameter valuesbull M-Step

ndash Update the parameters with the calculated posterior probabilitiesndash Find the parameters that maximizes the likelihood function

Lower Bounding the Log Likelihood

EM Steps

bull The E-Step

bull The M-Step

Latent Subspace

pLSA vs LSAbull LSA and PLSA perform dimensionality reduction

ndash In LSA by keeping only K singular valuesndash In PLSA by having K aspects

bull Comparison to SVDndash U Matrix related to P(z|d) (doc to aspect)ndash V Matrix related to P(w|z) (aspect to term)ndash E Matrix related to P(z) (aspect strength)

pLSA vs LSAbull The main difference is the way the approximation is done

bull PLSA generates a model (aspect model) and maximizes its predictive power

bull Selecting the proper value of K is heuristic in LSA

bull Model selection in statistics can determine optimal K in PLSA

Applications

bull Text mining topic discovering

bull Scene Classification

Text Mining

Scene Classification

Classification Result

Reference

bull Thomas Hofmann Probabilistic latent semantic analysis In Proc of Uncertainty in Artificial Intelligence UAI99 Stockholm 1999

bull Bosch A Zisserman A and Munoz X Scene Classification via pLSAProceedings of the European Conference on Computer Vision (2006)

bull Sivic J Russell B C Efros A A Zisserman A and Freeman W T Discovering Object Categories in Image Collections MIT AI Lab Memo AIM-2005-005 February 2005

LDA ndash LATENT DIRICHILET ALLOCATION

Problems in pLSA

bull pLSA provides no probabilistic model at the document level Each doc has its own topic mixture proportion

bull The number of parameters in the model grows linearly with M (the number of documents in the training set)

Problems in pLSA

bull There is no constraint for distributions p(z|di)

bull Easy to lead to serious problems with over-fitting

d1

z1

w1

z2

w2

zN1

wN1

hellip

d2

z1

w1

z2

w2

zN2

wN2

hellip

dm

z1

w1

z2

w2

zNm

wNm

hellip

p(z|d1) p(z|d2) p(z|dm)

Dirichlet Distribution

bull In the LDA model the topic mixture proportions for each document are assumed to follow some distribution

bull Requirement for such a distributionndash The samples (mixture proportions) generated from it are K-tuples of non-

negative numbers that sum to one That is the samples are multinormialsndash Easy to optimize

bull Dirichlet Distribution is one of such distributions

bull The space of all of these multinomials has a nice geometric interpretation as a (k-1)-simplex

Dirichlet Distribution

bull Definition

bull The density is zero outside this open (K minus 1)-dimensional simplex

k

i ii

ki ik

i i

ki i

kk

xx

xxxxp i

1

11

1

12121

1 0 st

)Γ()(Γ)|(

bull Various parameter α

(6 2 2) (3 7 5)

(2 3 4) (6 2 6)

Example Dirichlet Distributions (K=3)

Example Dirichlet Distributions (K=3)

bull Equal αi different

α0=01 α0=1 α0=10

k

i i10

The LDA Model

z4z3z2z1

w4w3w2w1

b

z4z3z2z1

w4w3w2w1

z4z3z2z1

w4w3w2w1

bull For each documentbull Choose ~Dirichlet()bull For each of the N words wn

ndash Choose a topic znraquo Multinomial()

ndash Choose a word wn from p(wn|znb) a multinomial probability conditioned on the topic zn

The LDA Model

bull For each documentbull Choose ~Dirichlet()bull For each of the N words wn

ndash Choose a topic znraquo Multinomial()

ndash Choose a word wn from p(wn|znb) a multinomial probability conditioned on the topic zn

Joint Probability

bull Given parameter α and β

where

Likelihood

bull Joint Probability

bull Marginal distribution of a document

bull Likelihood over all the documents

Inference

bull The likelihood can be computed by summing each documentbull Jansenrsquos inequality in EM

Inference

bull In E-Step we need to compute the posterior distribution of the hidden variables

bull Unfortunately this distribution is intractable to compute in general

bull We have to resort to variational approach

Variational Inference

bull In variational inference we consider a simplified graphical model with variational parameters and minimize the KL Divergence between the variational and posterior distributions

Variantional Inference

bull The difference between the lower bound and the likelihood is the KL divergence

bull Maximizing the lower bound L(b) with respect to and is equivalent to minimizing the KL divergence

VBEM vs EM

bull Only different in the E-Step

bull In standard EM q(X) is directly set to p(X|Dθ) and let KL=0bull In VBEM it is intractable to compute p(X|Dθ) Instead it

approximates p(X|Dθ) by a variational distribution q(X) by minimizing KL(q(X) | P(X|D θ)

bull This is also equivalent to maximizing the lower bound L(θ)

Parameter Estimation

bull Given a corpus of documents we would like to find the parameters and b which maximize the likelihood of the observed data

bull Strategy (Variational EM)

bull Lower bound log p(w|b) by a function L(b)bull Repeat until convergence

ndash E Maximize L(b) with respect to the variational parameters ndash M Maximize the bound with respect to parameters and b

Parameter Estimation

bull E-Step Variational Inference ndash repeat until convergence

bull M-Step Parameter estimation

β

α can be implemented using the Newton-Raphson method

Topic Examples in a 100-topic LDA Model)bull 16000 documents from a subset of the TREC AP corpus

Classification (50-topic LDA + SVM)

bull Reuters-21578 dataset ndash contains 8000 documents and 15818 words

(a) EARN vs NOT EARN (b) GRAIN vs NOT GRAIN

Problems in LDA

bull Dirichlet Distribution is helpful to avoid over-fitting But the assumption might be too strong

z4z3z2z1

w4w3w2w1

b

z4z3z2z1

w4w3w2w1

z4z3z2z1

w4w3w2w1

A Bayesian Hierarchical Model for Learning Natural Scene Categories

bull Incorporating category information

MNd

π

z

x

θ

β

Codebookbull 174 Local Image Patches

bull DetectionEvenly Sampled GridRandom SamplingSaliency DetectorLowersquos DoG Detector

bull RepresentationNormalized 11x11 gray values128-dim SIFT

Topic Distribution in Different Categories

Topic Hierarchical Clustering

More Topic Models

bull Dynamic topic models ICML 2006bull Correlated Topic Model NIPS 2005bull Hierarchical Dirichlet Process Journal of the American

Statistical Association 2003bull Nonparametric Bayes pachinko allocation UAI 2007bull Supervised LDA NIPS 2007bull MedLDA ndash Maximum Margin Discrimant LDA ICML 2009bull hellip

Are you really into Graphical Models

bull Describing Visual Scenes using Transformed Dirichlet Processes E Sudderth A Torralba W Freeman and A Willsky NIPS Dec 2005

Reference

bull David M Blei Andrew Y Ng Michael I Jordan Latent Dirichlet Allocation Journal of Machine Learning Research (JMLR) 2003

bull Matthew J Beal Variational Algorithms for Approximate Bayesian Inference PhD Thesis University of Cambridge 1998

bull L Fei-Fei and P Perona A Bayesian Hierarchical Model for Learning Natural Scene Categories CVPR 2005

Outline

bull Matrix Decompositionndash PCA SVD NMFndash LDA ICA Sparse Coding etc

bull Graphical Modelndash Basic concepts in probabilistic machine learningndash EMndash pLSAndash LDA

bull Two Applicationsndash Document decomposition for ldquolong queryrdquo retrieval ICCV 2009ndash Modeling Threaded Discussions SIGIR 2009

Large-Scale Indexing for ldquoLong Queryrdquo Retrieval (Similarity Search)

Xiao Zhang Zhiwei Li Lei Zhang Wei-Ying Ma and Heung-Yeung Shum

ICCV 2009

The Long Query Problem

bull If a query contains 1000 keywordsndash Need to access 1000 inverted listsndash The intersection of 1000 inverted lists may be emptyndash The union of 1000 inverted list may be the whole corpus

bull Dimension reductionTerm1 Term2 Term3 Term4 hellip TermN

Img1 1 2 0 0 hellip 2

f1 f2 hellip fM

Img1 02 01 hellip 003

Topic Projection

Dim = 1 million

Dim = 200

Key Idea Dimension Reduction + Residual Error Preservation

bull p original TF-IDF vector in vocabulary spacebull X projection matrix for dimension reductionbull w low dimensional feature vectorbull residual error

p Xw

An image = 0 01 02 03 04 05 06 07 08 09 1

0

2

4

6

8

10

12

14

16

+ a few words(10 words)

Orthogonal Decomposition

p Xw 1 11 1 1

2 21 2 1 2

1 1

1

k

k

W k W

W W Wk W

p x xp x x w

p wp x x

Base vector

Low dimensional representation

Residual

An image = 0 01 02 03 04 05 06 07 08 09 1

0

2

4

6

8

10

12

14

16

+ a few words(10 words)

T Tp q p q

p q

w w

X1 X2 X3 hellip Xk

A Probabilistic Implementation

x is a switch variable It controls a word generated from

bull a topic specific distribution

bull a document specific distribution

bull a background distribution

( | )p w d 1( 0 | ) ( | ) ( | )K

kp x d p w z k p z k d

( 1| ) ( | )p x d p w d ( 2 | ) ( | )p x d p w

C Chemudugunta etal Modeling General and Specific Aspects of Documents with a Probabilistic Topic Model NIPS 2006

Search (Online)

DS1 DS2 hellip

DS1 DS2 hellip

DS1 DS2 hellip

DS1 DS2 hellip

DS1 DS2 hellip

hellip

hellip

hellip

hellip

hellip

LSH Index

Doc 300 Doc 401 hellip

A query = 0 01 02 03 04 05 06 07 08 09 1

0

2

4

6

8

10

12

14

16

+ a few words

Re-rankingDoc 401 hellip

Doc 1

Doc 2

Doc 300

Doc 401

Doc N

Doc 300

Index 10M Images 46GBSearch Speed lt 100ms

Doc Meta

Search Example

Query Image

Search Example

Query Image

Simultaneously Modeling Semantics and Structure of Threaded Discussions A Sparse

Coding Approach and Its Applications

Chen LIN Jiang-Ming YANG Rui CAI Xin-jing WANG Wei WANG Lei ZHANG

SIGIR 2009

123

Semantic amp structure

124

SemanticTopics

StructureWho reply to who

Optimize them together

Model semantic

Model structure

125

Reply reconstruction

126

DocumentSimilarity

TopicSimilarity

StructureSimilarity

BaselinesNP

Reply to Nearest PostRR

Reply to RootDS

Document SimilarityLDA

Latent Dirichlet Allocation Project documents to topic space

SWB Special Words Topic Model with Background distribution Project documents to topic and junk topic space

127

Evaluation

method Slashdot Apple

All Posts Good Posts All Posts Good Posts

NP 0021 0012 0289 0239

RR 0183 0319 0269 0474

DS 0463 0643 0409 0628

LDA 0465 0644 0410 0648

SWB 0463 0644 0410 0641

SMSS 0524 0737 0517 0772

128

Expert finding

Reply reconstruction

Network construction

Expert finding

Methods

HITS

PageRank

hellip

129

Baselines

LMFormal Models for Expert Finding in Enterprise Corpora SIGIR

06Achieves stable performance in expert finding task using a

language modelPageRank

Benchmark nodal ranking methodHITS

Find hub nodes and authority nodeEABIF

Personalized Recommendation Driven by Information Flow SIGIR rsquo06

Find most influential node130

Evaluation

131

bull Bayesian estimate

Method MRR MAP P10

LM 0821 0698 0800

EABIF(ori) 0674 0362 0243

EABIF(rec) 0742 0318 0281

PageRank(ori) 0675 0377 0263

PageRank(rec) 0743 0321 0266

HITS(ori) 0906 0832 0900

HITS(rec) 0938 0822 0906

Summary

bull Matrix and probability are fundamental mathematics in information retrieval and computer visionndash Matrix decomposition ndash a good practice to learn matrixndash Graphical model ndash a good practice to learn probability

bull Graphical model is a good tool to analyze problems

bull The essence of decomposition is to discover a set of mid-level features to describe original documentsimages

bull It is more adaptable for various applications than matrix decomposition

  • An Introduction To Matrix Decomposition and Graphical Model
  • Outline
  • What Is Matrix Decomposition
  • Why We Need Matrix Decomposition
  • Why We Need Matrix Decomposition (2)
  • Principle Component Analysis
  • Definition ndash Eigenvalue amp Eigenvector
  • Definition ndash Principle Component Analysis
  • Principle Component Analysis (2)
  • Maximizing Variance
  • Optimization Problem
  • Property Data Decomposition
  • Face Recognition ndash Eigenface
  • Slide 14
  • Slide 15
  • PageRank ndash Power Iteration
  • Column-Stochastic amp Irreducible
  • Iterative PageRank Calculation
  • Convergence of the power iteration
  • Singular value decomposition
  • SVD - Definition
  • Singular Values And Singular Vectors
  • Matrix approximation
  • SVD and PCA
  • Example - LSI
  • SVD and PCA (2)
  • Latent Semantic Indexing (LSI)
  • Latent Semantic Indexing
  • Similarity Measures
  • Similarity Measures (2)
  • HITS (Hyperlink Induced Topic Search)
  • Power Iteration
  • HITS and SVD
  • HITS vs PageRank
  • NMF ndash Non-Negative Matrix Factorization
  • Definition
  • Motivation
  • Multiplicative Update Algorithm
  • Multiplicative Update Algorithm (2)
  • NMF vs PCA
  • Reference
  • Major Reference
  • Outline (2)
  • Not Included
  • Basic Concepts
  • What Is Machine Learning
  • Likelihood Function
  • Maximum Likelihood (ML)
  • IID ndash Independent Identically Distributed
  • Reference (2)
  • Expectation Maximization
  • Why We Need EM
  • More General
  • The Expectation Maximization (EM) Algorithm
  • Jensenrsquos Inequality
  • Lower Bounding the Log Likelihood
  • The E and M Steps of EM
  • The E Step
  • The M Step
  • EM Never Decreases the Likelihood
  • Reference (3)
  • Why Do We Need Graphical Model
  • Why Do We Need Graphical Models
  • Directed Acyclic Graphical Models (Bayesian Networks)
  • Directed Graphs for Statistical Models Plate Notation
  • pLSA ndash Probabilistic Latent Semantic Analysis
  • Latent Semantic Indexing (LSI) Review
  • pLSA ndash Probabilistic Latent Semantic Analysis (2)
  • pLSA
  • pLSA (2)
  • Joint Probability vs Likelihood
  • Document Decomposition
  • pLSA ndash Objective Function
  • EM Steps
  • Lower Bounding the Log Likelihood (2)
  • EM Steps (2)
  • Latent Subspace
  • pLSA vs LSA
  • pLSA vs LSA (2)
  • Applications
  • Text Mining
  • Scene Classification
  • Classification Result
  • Reference (4)
  • LDA ndash Latent Dirichilet Allocation
  • Problems in pLSA
  • Problems in pLSA (2)
  • Dirichlet Distribution
  • Dirichlet Distribution (2)
  • Example Dirichlet Distributions (K=3)
  • Example Dirichlet Distributions (K=3) (2)
  • The LDA Model
  • The LDA Model (2)
  • Joint Probability
  • Likelihood
  • Inference
  • Inference (2)
  • Variational Inference
  • Variantional Inference
  • VBEM vs EM
  • Parameter Estimation
  • Parameter Estimation (2)
  • Topic Examples in a 100-topic LDA Model)
  • Classification (50-topic LDA + SVM)
  • Problems in LDA
  • A Bayesian Hierarchical Model for Learning Natural Scene Catego
  • Codebook
  • Topic Distribution in Different Categories
  • Topic Hierarchical Clustering
  • More Topic Models
  • Are you really into Graphical Models
  • Reference (5)
  • Outline (3)
  • Slide 115
  • The Long Query Problem
  • Key Idea Dimension Reduction + Residual Error Preservation
  • Orthogonal Decomposition
  • A Probabilistic Implementation
  • Search (Online)
  • Search Example
  • Search Example (2)
  • Simultaneously Modeling Semantics and Structure of Threaded Dis
  • Semantic amp structure
  • Optimize them together
  • Reply reconstruction
  • Baselines
  • Evaluation
  • Expert finding
  • Baselines (2)
  • Evaluation (2)
  • Summary
Page 35: An Introduction To Matrix Decomposition and Graphical Model

Motivationbull Non-negativity is natural in many applications

bull Probability is also non-negative

bull Additive model to capture local structure

Multiplicative Update Algorithmbull Cost function Euclidean distance

bull Multiplicative Update

Multiplicative Update Algorithmbull Cost function Divergence

ndash Reduce to Kullback-Leibler divergence whenndash A and B can be regarded as normalized probability distributions

bull Multiplicative update

bull PLSA is NMF with KL divergence

NMF vs PCA

bull n = 2429 faces m = 19x19 pixelsbull Positive values are illustrated with black pixels and negative values

with red pixels

bull NMF Parts-based representationbull PCA Holistic representations

Referencebull D D Lee and H S Seung Algorithms for non-negative matrix

factorization (pdf) NIPS 2001bull D D Lee and H S Seung Learning the parts of objects by non-negative

matrix factorization (pdf) Nature 401 788-791 (1999)

Major Referencebull Saara Hyvoumlnen Linear Algebra Methods for Data Mining Spring 2007

University of Helsinki (Highly recommend)

Outlinebull Basic concepts

ndash Likelihood iidndash ML MAP and Bayesian Inferencendash Expectation-Maximizationndash Mixture Gaussian Parameter estimation

bull pLSAndash Motivationndash Derivation amp Geometry propertiesndash Applications

bull LDAndash Motivation - Why to add a hyper parameterndash Dirichlet Distributionndash Variational EMndash Relations with other topic modalsndash Incorporating category information

bull Summary

Not Included

bull General graphical model theoriesbull Markov random field (belief propagation)bull Detailed derivation of LDA

BASIC CONCEPTS

What Is Machine Learning

Databull Let x = (x1 x2 xD) T denote a data point and D = x(1) x(2) x(N) a

data set D is sometimes associated with desired outputs y1 y2

Predictionsbull We are generally interested in predicting something based on the observed

data setbull Given D what can we say about x(N+1)

Modelbull To make predictions we need to make some assumptions We can often

express these assumptions in the form of a model with some parameters θbull Given data D we learn the model parameters from which we can predict

new data pointsbull The model can often be expressed as a probability distribution over data

points

Likelihood Function

bull Given a set of parameter values probability density function (PDF) will show that some data are more probable than other data

bull Inversely given the observed data and a model of interest Likelihood function is defined as

L(θ) = fθ(x|θ) = p(x|θ)

bull That is likelihood function L(θ) will show that some parameters are more likely to have produced the data

Maximum Likelihood (ML)

bull Maximum likelihood will find the best model parameters that make the data ldquomost likelyrdquo generated from this model

bull Suppose we are given n data samples (x1 x2 hellip xn)

bull Maximum likelihood will find θ that maximize L(θ)

bull Predictive distribution

IID ndash Independent Identically Distributed

bull IID means

bull The problem is considerably simplified as

bull Usually log likehood is used

Reference

bull Zoubin Ghahramani Machine Learning (4F13) 2006 Cambridge(Introduction to Machine Learning Lectures 1-2 Slides)

bull Gregor Heinrich Parameter estimation for text analysis Technical Note 2005-2008

EXPECTATION MAXIMIZATION

Why We Need EM

bull The ExpectationndashMaximization (EM) algorithm is a method for ML learning of parameters in latent variable models

bull Why we need latent variables

bull To describe complex model Gaussian Mixture Model

bull To discover the intrinsic structure inside a data set Topic Models such as pLSA LDA

More General

bull Data set bull Likelihood

bull Goal learn maximum likelihood (ML) parameter values

bull The maximum likelihood procedure finds parameters θ such that

bull Because of the integral (or sum) over latent variables the likelihood can be a very complicated and hard to optimize function

The Expectation Maximization (EM) Algorithm

bull The EM algorithm finds a (local) maximum of a latent variable model likelihood It starts from arbitrary values of the parameters and iterates two steps

bull E step Fill in values of latent variables according to posterior given data

bull M step Maximize likelihood as if latent variables were not hidden

bull Decomposes difficult problems into series of tractable steps

Jensenrsquos Inequality

Lower Bounding the Log Likelihoodbull Observed data D = yn Latent variables X = xn Parameters θbull Goal Maximize the log likelihood (ie ML learning) wrt θ

bull Any distribution q(X) over the hidden variables can be used to obtain a lower bound on the log likelihood using Jensenrsquos inequality

bull where H[q] is the entropy of q(X)

The E and M Steps of EM

bull The lower bound on the log likelihood is given by

bull EM alternates betweenbull E step optimize wrt distribution over hidden variables

holding params fixed

bull M step maximize wrt parameters holding hidden distribution fixed

The E Step

bull E step for fixed θ

bull The second term is the Kullback-Leibler divergencebull This means that for fixed θ F is bounded above by L and achieves

that bound whenbull So the E step simply sets

The M Step

bull M step maximize wrt parameters holding hidden distribution q fixed

bull The second equality comes from fact that entropy of q(X) does not depend directly on θ

bull The specific form of the M step depends on the model Often the maximum wrt θ can be found analytically

EM Never Decreases the Likelihood

bull The E and M steps together never decrease the log likelihood

bull The E step brings F(q θ) to the likelihood L(θ) 顶bull The M-step maximizes F(q θ) wrt θ 抬bull F(q θ) lt L(θ) by Jensen ndash or equivalently from the non-

negativity of KL

Reference

bull Zoubin Ghahramani Machine Learning (4F13) 2006 Cambridge(Unsupervised learning Lecture 5 Slides)

bull Christopher M Bishop (2006) Pattern Recognition and Machine Learning Springer

WHY DO WE NEED GRAPHICAL MODEL

Why Do We Need Graphical Models

bull Consndash Graphical model is so complex even with a few circleshellipndash We have to make too many assumptions

bull Prosndash We do need probability to explain our world But joint probability is

hard to computendash Graphical model can help us analyze and understand our problemsndash Graphs are an intuitive way of representing and visualizing the

relationships between many variablesndash With a graphical model we can decouple joint probability to

conditional probabilities which are usually easier

Directed Acyclic Graphical Models (Bayesian Networks)

bull A DAG Model Bayesian network corresponds to a factorization of the joint probability distribution

p(ABCDE) = p(A)p(B)p(C|AB)p(D|BC)p(E|CD)

bull In general

bull where pa(i) are the parents of node i

Directed Graphs for Statistical ModelsPlate Notation

bull A data set of N points generated from a Gaussian

PLSA ndash PROBABILISTIC LATENT SEMANTIC ANALYSIS

Latent Semantic Indexing (LSI) Review

bull For natural Language Queries simple term matching does not work effectivelyndash Ambiguous terms ndash Same Queries vary due to personal styles

bull Latent semantic indexingndash Creates this lsquolatent semantic spacersquo (hidden meaning)

bull LSI puts documents together even if they donrsquot have common words if the docs share frequently co-occurring terms

bull Disadvantagesndash Statistical foundation is missing

pLSA ndash Probabilistic Latent Semantic Analysis

bull Automated Document Indexing and Information retrievalbull Identification of Latent Classes using an Expectation

Maximization (EM) Algorithmbull Shown to solve

ndash Polysemybull Java could mean ldquocoffeerdquo and also the ldquoPL Javardquobull Cricket is a ldquogamerdquo and also an ldquoinsectrdquo

ndash Synonymybull ldquocomputerrdquo ldquopcrdquo ldquodesktoprdquo all could mean the same

bull Has a better statistical foundation than LSA

pLSA

M

Nd

d

z

w

M

d

z1

w1

z2

w2

z3

w3

zN

wN

hellip

z1 hellip zN are variables ziє[1K]K is the number of latent topics

pLSA

d1

z1

w1

z2

w2

zN1

wN1

hellip

d2

z1

w1

z2

w2

zN2

wN2

hellip

dM

z1

w1

z2

w2

zNm

wNm

hellip

p(w|z=1) p(w|z=2) p(w | z=Nm) are shared for all documents

Likelihood

Joint Probability vs Likelihood

bull Joint probability

bull Likelihood (only for observed variables)

bull p(d) is assumed to be uniform

Document Decomposition

bull Each document can be decomposed as

bull This is similar to the matrix decomposition if we consider each discrete distribution as a vector

p(w|d) = ZVtimesk p(z|d)

bull With many documents we hope to find latent topics as common basis

pLSA ndash Objective Function

bull pLSA tries to maximize the log likelihood

bull Due to the summation over z inside log we have to resort to EM

EM Steps

bull E-Stepndash Expectation of the likelihood function is calculated with the current

parameter valuesbull M-Step

ndash Update the parameters with the calculated posterior probabilitiesndash Find the parameters that maximizes the likelihood function

Lower Bounding the Log Likelihood

EM Steps

bull The E-Step

bull The M-Step

Latent Subspace

pLSA vs LSAbull LSA and PLSA perform dimensionality reduction

ndash In LSA by keeping only K singular valuesndash In PLSA by having K aspects

bull Comparison to SVDndash U Matrix related to P(z|d) (doc to aspect)ndash V Matrix related to P(w|z) (aspect to term)ndash E Matrix related to P(z) (aspect strength)

pLSA vs LSAbull The main difference is the way the approximation is done

bull PLSA generates a model (aspect model) and maximizes its predictive power

bull Selecting the proper value of K is heuristic in LSA

bull Model selection in statistics can determine optimal K in PLSA

Applications

bull Text mining topic discovering

bull Scene Classification

Text Mining

Scene Classification

Classification Result

Reference

bull Thomas Hofmann Probabilistic latent semantic analysis In Proc of Uncertainty in Artificial Intelligence UAI99 Stockholm 1999

bull Bosch A Zisserman A and Munoz X Scene Classification via pLSAProceedings of the European Conference on Computer Vision (2006)

bull Sivic J Russell B C Efros A A Zisserman A and Freeman W T Discovering Object Categories in Image Collections MIT AI Lab Memo AIM-2005-005 February 2005

LDA ndash LATENT DIRICHILET ALLOCATION

Problems in pLSA

bull pLSA provides no probabilistic model at the document level Each doc has its own topic mixture proportion

bull The number of parameters in the model grows linearly with M (the number of documents in the training set)

Problems in pLSA

bull There is no constraint for distributions p(z|di)

bull Easy to lead to serious problems with over-fitting

d1

z1

w1

z2

w2

zN1

wN1

hellip

d2

z1

w1

z2

w2

zN2

wN2

hellip

dm

z1

w1

z2

w2

zNm

wNm

hellip

p(z|d1) p(z|d2) p(z|dm)

Dirichlet Distribution

bull In the LDA model the topic mixture proportions for each document are assumed to follow some distribution

bull Requirement for such a distributionndash The samples (mixture proportions) generated from it are K-tuples of non-

negative numbers that sum to one That is the samples are multinormialsndash Easy to optimize

bull Dirichlet Distribution is one of such distributions

bull The space of all of these multinomials has a nice geometric interpretation as a (k-1)-simplex

Dirichlet Distribution

bull Definition

bull The density is zero outside this open (K minus 1)-dimensional simplex

k

i ii

ki ik

i i

ki i

kk

xx

xxxxp i

1

11

1

12121

1 0 st

)Γ()(Γ)|(

bull Various parameter α

(6 2 2) (3 7 5)

(2 3 4) (6 2 6)

Example Dirichlet Distributions (K=3)

Example Dirichlet Distributions (K=3)

bull Equal αi different

α0=01 α0=1 α0=10

k

i i10

The LDA Model

z4z3z2z1

w4w3w2w1

b

z4z3z2z1

w4w3w2w1

z4z3z2z1

w4w3w2w1

bull For each documentbull Choose ~Dirichlet()bull For each of the N words wn

ndash Choose a topic znraquo Multinomial()

ndash Choose a word wn from p(wn|znb) a multinomial probability conditioned on the topic zn

The LDA Model

bull For each documentbull Choose ~Dirichlet()bull For each of the N words wn

ndash Choose a topic znraquo Multinomial()

ndash Choose a word wn from p(wn|znb) a multinomial probability conditioned on the topic zn

Joint Probability

bull Given parameter α and β

where

Likelihood

bull Joint Probability

bull Marginal distribution of a document

bull Likelihood over all the documents

Inference

bull The likelihood can be computed by summing each documentbull Jansenrsquos inequality in EM

Inference

bull In E-Step we need to compute the posterior distribution of the hidden variables

bull Unfortunately this distribution is intractable to compute in general

bull We have to resort to variational approach

Variational Inference

bull In variational inference we consider a simplified graphical model with variational parameters and minimize the KL Divergence between the variational and posterior distributions

Variantional Inference

bull The difference between the lower bound and the likelihood is the KL divergence

bull Maximizing the lower bound L(b) with respect to and is equivalent to minimizing the KL divergence

VBEM vs EM

bull Only different in the E-Step

bull In standard EM q(X) is directly set to p(X|Dθ) and let KL=0bull In VBEM it is intractable to compute p(X|Dθ) Instead it

approximates p(X|Dθ) by a variational distribution q(X) by minimizing KL(q(X) | P(X|D θ)

bull This is also equivalent to maximizing the lower bound L(θ)

Parameter Estimation

bull Given a corpus of documents we would like to find the parameters and b which maximize the likelihood of the observed data

bull Strategy (Variational EM)

bull Lower bound log p(w|b) by a function L(b)bull Repeat until convergence

ndash E Maximize L(b) with respect to the variational parameters ndash M Maximize the bound with respect to parameters and b

Parameter Estimation

bull E-Step Variational Inference ndash repeat until convergence

bull M-Step Parameter estimation

β

α can be implemented using the Newton-Raphson method

Topic Examples in a 100-topic LDA Model)bull 16000 documents from a subset of the TREC AP corpus

Classification (50-topic LDA + SVM)

bull Reuters-21578 dataset ndash contains 8000 documents and 15818 words

(a) EARN vs NOT EARN (b) GRAIN vs NOT GRAIN

Problems in LDA

bull Dirichlet Distribution is helpful to avoid over-fitting But the assumption might be too strong

z4z3z2z1

w4w3w2w1

b

z4z3z2z1

w4w3w2w1

z4z3z2z1

w4w3w2w1

A Bayesian Hierarchical Model for Learning Natural Scene Categories

bull Incorporating category information

MNd

π

z

x

θ

β

Codebookbull 174 Local Image Patches

bull DetectionEvenly Sampled GridRandom SamplingSaliency DetectorLowersquos DoG Detector

bull RepresentationNormalized 11x11 gray values128-dim SIFT

Topic Distribution in Different Categories

Topic Hierarchical Clustering

More Topic Models

bull Dynamic topic models ICML 2006bull Correlated Topic Model NIPS 2005bull Hierarchical Dirichlet Process Journal of the American

Statistical Association 2003bull Nonparametric Bayes pachinko allocation UAI 2007bull Supervised LDA NIPS 2007bull MedLDA ndash Maximum Margin Discrimant LDA ICML 2009bull hellip

Are you really into Graphical Models

bull Describing Visual Scenes using Transformed Dirichlet Processes E Sudderth A Torralba W Freeman and A Willsky NIPS Dec 2005

Reference

bull David M Blei Andrew Y Ng Michael I Jordan Latent Dirichlet Allocation Journal of Machine Learning Research (JMLR) 2003

bull Matthew J Beal Variational Algorithms for Approximate Bayesian Inference PhD Thesis University of Cambridge 1998

bull L Fei-Fei and P Perona A Bayesian Hierarchical Model for Learning Natural Scene Categories CVPR 2005

Outline

bull Matrix Decompositionndash PCA SVD NMFndash LDA ICA Sparse Coding etc

bull Graphical Modelndash Basic concepts in probabilistic machine learningndash EMndash pLSAndash LDA

bull Two Applicationsndash Document decomposition for ldquolong queryrdquo retrieval ICCV 2009ndash Modeling Threaded Discussions SIGIR 2009

Large-Scale Indexing for ldquoLong Queryrdquo Retrieval (Similarity Search)

Xiao Zhang Zhiwei Li Lei Zhang Wei-Ying Ma and Heung-Yeung Shum

ICCV 2009

The Long Query Problem

bull If a query contains 1000 keywordsndash Need to access 1000 inverted listsndash The intersection of 1000 inverted lists may be emptyndash The union of 1000 inverted list may be the whole corpus

bull Dimension reductionTerm1 Term2 Term3 Term4 hellip TermN

Img1 1 2 0 0 hellip 2

f1 f2 hellip fM

Img1 02 01 hellip 003

Topic Projection

Dim = 1 million

Dim = 200

Key Idea Dimension Reduction + Residual Error Preservation

bull p original TF-IDF vector in vocabulary spacebull X projection matrix for dimension reductionbull w low dimensional feature vectorbull residual error

p Xw

An image = 0 01 02 03 04 05 06 07 08 09 1

0

2

4

6

8

10

12

14

16

+ a few words(10 words)

Orthogonal Decomposition

p Xw 1 11 1 1

2 21 2 1 2

1 1

1

k

k

W k W

W W Wk W

p x xp x x w

p wp x x

Base vector

Low dimensional representation

Residual

An image = 0 01 02 03 04 05 06 07 08 09 1

0

2

4

6

8

10

12

14

16

+ a few words(10 words)

T Tp q p q

p q

w w

X1 X2 X3 hellip Xk

A Probabilistic Implementation

x is a switch variable It controls a word generated from

bull a topic specific distribution

bull a document specific distribution

bull a background distribution

( | )p w d 1( 0 | ) ( | ) ( | )K

kp x d p w z k p z k d

( 1| ) ( | )p x d p w d ( 2 | ) ( | )p x d p w

C Chemudugunta etal Modeling General and Specific Aspects of Documents with a Probabilistic Topic Model NIPS 2006

Search (Online)

DS1 DS2 hellip

DS1 DS2 hellip

DS1 DS2 hellip

DS1 DS2 hellip

DS1 DS2 hellip

hellip

hellip

hellip

hellip

hellip

LSH Index

Doc 300 Doc 401 hellip

A query = 0 01 02 03 04 05 06 07 08 09 1

0

2

4

6

8

10

12

14

16

+ a few words

Re-rankingDoc 401 hellip

Doc 1

Doc 2

Doc 300

Doc 401

Doc N

Doc 300

Index 10M Images 46GBSearch Speed lt 100ms

Doc Meta

Search Example

Query Image

Search Example

Query Image

Simultaneously Modeling Semantics and Structure of Threaded Discussions A Sparse

Coding Approach and Its Applications

Chen LIN Jiang-Ming YANG Rui CAI Xin-jing WANG Wei WANG Lei ZHANG

SIGIR 2009

123

Semantic amp structure

124

SemanticTopics

StructureWho reply to who

Optimize them together

Model semantic

Model structure

125

Reply reconstruction

126

DocumentSimilarity

TopicSimilarity

StructureSimilarity

BaselinesNP

Reply to Nearest PostRR

Reply to RootDS

Document SimilarityLDA

Latent Dirichlet Allocation Project documents to topic space

SWB Special Words Topic Model with Background distribution Project documents to topic and junk topic space

127

Evaluation

method Slashdot Apple

All Posts Good Posts All Posts Good Posts

NP 0021 0012 0289 0239

RR 0183 0319 0269 0474

DS 0463 0643 0409 0628

LDA 0465 0644 0410 0648

SWB 0463 0644 0410 0641

SMSS 0524 0737 0517 0772

128

Expert finding

Reply reconstruction

Network construction

Expert finding

Methods

HITS

PageRank

hellip

129

Baselines

LMFormal Models for Expert Finding in Enterprise Corpora SIGIR

06Achieves stable performance in expert finding task using a

language modelPageRank

Benchmark nodal ranking methodHITS

Find hub nodes and authority nodeEABIF

Personalized Recommendation Driven by Information Flow SIGIR rsquo06

Find most influential node130

Evaluation

131

bull Bayesian estimate

Method MRR MAP P10

LM 0821 0698 0800

EABIF(ori) 0674 0362 0243

EABIF(rec) 0742 0318 0281

PageRank(ori) 0675 0377 0263

PageRank(rec) 0743 0321 0266

HITS(ori) 0906 0832 0900

HITS(rec) 0938 0822 0906

Summary

bull Matrix and probability are fundamental mathematics in information retrieval and computer visionndash Matrix decomposition ndash a good practice to learn matrixndash Graphical model ndash a good practice to learn probability

bull Graphical model is a good tool to analyze problems

bull The essence of decomposition is to discover a set of mid-level features to describe original documentsimages

bull It is more adaptable for various applications than matrix decomposition

  • An Introduction To Matrix Decomposition and Graphical Model
  • Outline
  • What Is Matrix Decomposition
  • Why We Need Matrix Decomposition
  • Why We Need Matrix Decomposition (2)
  • Principle Component Analysis
  • Definition ndash Eigenvalue amp Eigenvector
  • Definition ndash Principle Component Analysis
  • Principle Component Analysis (2)
  • Maximizing Variance
  • Optimization Problem
  • Property Data Decomposition
  • Face Recognition ndash Eigenface
  • Slide 14
  • Slide 15
  • PageRank ndash Power Iteration
  • Column-Stochastic amp Irreducible
  • Iterative PageRank Calculation
  • Convergence of the power iteration
  • Singular value decomposition
  • SVD - Definition
  • Singular Values And Singular Vectors
  • Matrix approximation
  • SVD and PCA
  • Example - LSI
  • SVD and PCA (2)
  • Latent Semantic Indexing (LSI)
  • Latent Semantic Indexing
  • Similarity Measures
  • Similarity Measures (2)
  • HITS (Hyperlink Induced Topic Search)
  • Power Iteration
  • HITS and SVD
  • HITS vs PageRank
  • NMF ndash Non-Negative Matrix Factorization
  • Definition
  • Motivation
  • Multiplicative Update Algorithm
  • Multiplicative Update Algorithm (2)
  • NMF vs PCA
  • Reference
  • Major Reference
  • Outline (2)
  • Not Included
  • Basic Concepts
  • What Is Machine Learning
  • Likelihood Function
  • Maximum Likelihood (ML)
  • IID ndash Independent Identically Distributed
  • Reference (2)
  • Expectation Maximization
  • Why We Need EM
  • More General
  • The Expectation Maximization (EM) Algorithm
  • Jensenrsquos Inequality
  • Lower Bounding the Log Likelihood
  • The E and M Steps of EM
  • The E Step
  • The M Step
  • EM Never Decreases the Likelihood
  • Reference (3)
  • Why Do We Need Graphical Model
  • Why Do We Need Graphical Models
  • Directed Acyclic Graphical Models (Bayesian Networks)
  • Directed Graphs for Statistical Models Plate Notation
  • pLSA ndash Probabilistic Latent Semantic Analysis
  • Latent Semantic Indexing (LSI) Review
  • pLSA ndash Probabilistic Latent Semantic Analysis (2)
  • pLSA
  • pLSA (2)
  • Joint Probability vs Likelihood
  • Document Decomposition
  • pLSA ndash Objective Function
  • EM Steps
  • Lower Bounding the Log Likelihood (2)
  • EM Steps (2)
  • Latent Subspace
  • pLSA vs LSA
  • pLSA vs LSA (2)
  • Applications
  • Text Mining
  • Scene Classification
  • Classification Result
  • Reference (4)
  • LDA ndash Latent Dirichilet Allocation
  • Problems in pLSA
  • Problems in pLSA (2)
  • Dirichlet Distribution
  • Dirichlet Distribution (2)
  • Example Dirichlet Distributions (K=3)
  • Example Dirichlet Distributions (K=3) (2)
  • The LDA Model
  • The LDA Model (2)
  • Joint Probability
  • Likelihood
  • Inference
  • Inference (2)
  • Variational Inference
  • Variantional Inference
  • VBEM vs EM
  • Parameter Estimation
  • Parameter Estimation (2)
  • Topic Examples in a 100-topic LDA Model)
  • Classification (50-topic LDA + SVM)
  • Problems in LDA
  • A Bayesian Hierarchical Model for Learning Natural Scene Catego
  • Codebook
  • Topic Distribution in Different Categories
  • Topic Hierarchical Clustering
  • More Topic Models
  • Are you really into Graphical Models
  • Reference (5)
  • Outline (3)
  • Slide 115
  • The Long Query Problem
  • Key Idea Dimension Reduction + Residual Error Preservation
  • Orthogonal Decomposition
  • A Probabilistic Implementation
  • Search (Online)
  • Search Example
  • Search Example (2)
  • Simultaneously Modeling Semantics and Structure of Threaded Dis
  • Semantic amp structure
  • Optimize them together
  • Reply reconstruction
  • Baselines
  • Evaluation
  • Expert finding
  • Baselines (2)
  • Evaluation (2)
  • Summary
Page 36: An Introduction To Matrix Decomposition and Graphical Model

Multiplicative Update Algorithmbull Cost function Euclidean distance

bull Multiplicative Update

Multiplicative Update Algorithmbull Cost function Divergence

ndash Reduce to Kullback-Leibler divergence whenndash A and B can be regarded as normalized probability distributions

bull Multiplicative update

bull PLSA is NMF with KL divergence

NMF vs PCA

bull n = 2429 faces m = 19x19 pixelsbull Positive values are illustrated with black pixels and negative values

with red pixels

bull NMF Parts-based representationbull PCA Holistic representations

Referencebull D D Lee and H S Seung Algorithms for non-negative matrix

factorization (pdf) NIPS 2001bull D D Lee and H S Seung Learning the parts of objects by non-negative

matrix factorization (pdf) Nature 401 788-791 (1999)

Major Referencebull Saara Hyvoumlnen Linear Algebra Methods for Data Mining Spring 2007

University of Helsinki (Highly recommend)

Outlinebull Basic concepts

ndash Likelihood iidndash ML MAP and Bayesian Inferencendash Expectation-Maximizationndash Mixture Gaussian Parameter estimation

bull pLSAndash Motivationndash Derivation amp Geometry propertiesndash Applications

bull LDAndash Motivation - Why to add a hyper parameterndash Dirichlet Distributionndash Variational EMndash Relations with other topic modalsndash Incorporating category information

bull Summary

Not Included

bull General graphical model theoriesbull Markov random field (belief propagation)bull Detailed derivation of LDA

BASIC CONCEPTS

What Is Machine Learning

Databull Let x = (x1 x2 xD) T denote a data point and D = x(1) x(2) x(N) a

data set D is sometimes associated with desired outputs y1 y2

Predictionsbull We are generally interested in predicting something based on the observed

data setbull Given D what can we say about x(N+1)

Modelbull To make predictions we need to make some assumptions We can often

express these assumptions in the form of a model with some parameters θbull Given data D we learn the model parameters from which we can predict

new data pointsbull The model can often be expressed as a probability distribution over data

points

Likelihood Function

bull Given a set of parameter values probability density function (PDF) will show that some data are more probable than other data

bull Inversely given the observed data and a model of interest Likelihood function is defined as

L(θ) = fθ(x|θ) = p(x|θ)

bull That is likelihood function L(θ) will show that some parameters are more likely to have produced the data

Maximum Likelihood (ML)

bull Maximum likelihood will find the best model parameters that make the data ldquomost likelyrdquo generated from this model

bull Suppose we are given n data samples (x1 x2 hellip xn)

bull Maximum likelihood will find θ that maximize L(θ)

bull Predictive distribution

IID ndash Independent Identically Distributed

bull IID means

bull The problem is considerably simplified as

bull Usually log likehood is used

Reference

bull Zoubin Ghahramani Machine Learning (4F13) 2006 Cambridge(Introduction to Machine Learning Lectures 1-2 Slides)

bull Gregor Heinrich Parameter estimation for text analysis Technical Note 2005-2008

EXPECTATION MAXIMIZATION

Why We Need EM

bull The ExpectationndashMaximization (EM) algorithm is a method for ML learning of parameters in latent variable models

bull Why we need latent variables

bull To describe complex model Gaussian Mixture Model

bull To discover the intrinsic structure inside a data set Topic Models such as pLSA LDA

More General

bull Data set bull Likelihood

bull Goal learn maximum likelihood (ML) parameter values

bull The maximum likelihood procedure finds parameters θ such that

bull Because of the integral (or sum) over latent variables the likelihood can be a very complicated and hard to optimize function

The Expectation Maximization (EM) Algorithm

bull The EM algorithm finds a (local) maximum of a latent variable model likelihood It starts from arbitrary values of the parameters and iterates two steps

bull E step Fill in values of latent variables according to posterior given data

bull M step Maximize likelihood as if latent variables were not hidden

bull Decomposes difficult problems into series of tractable steps

Jensenrsquos Inequality

Lower Bounding the Log Likelihoodbull Observed data D = yn Latent variables X = xn Parameters θbull Goal Maximize the log likelihood (ie ML learning) wrt θ

bull Any distribution q(X) over the hidden variables can be used to obtain a lower bound on the log likelihood using Jensenrsquos inequality

bull where H[q] is the entropy of q(X)

The E and M Steps of EM

bull The lower bound on the log likelihood is given by

bull EM alternates betweenbull E step optimize wrt distribution over hidden variables

holding params fixed

bull M step maximize wrt parameters holding hidden distribution fixed

The E Step

bull E step for fixed θ

bull The second term is the Kullback-Leibler divergencebull This means that for fixed θ F is bounded above by L and achieves

that bound whenbull So the E step simply sets

The M Step

bull M step maximize wrt parameters holding hidden distribution q fixed

bull The second equality comes from fact that entropy of q(X) does not depend directly on θ

bull The specific form of the M step depends on the model Often the maximum wrt θ can be found analytically

EM Never Decreases the Likelihood

bull The E and M steps together never decrease the log likelihood

bull The E step brings F(q θ) to the likelihood L(θ) 顶bull The M-step maximizes F(q θ) wrt θ 抬bull F(q θ) lt L(θ) by Jensen ndash or equivalently from the non-

negativity of KL

Reference

bull Zoubin Ghahramani Machine Learning (4F13) 2006 Cambridge(Unsupervised learning Lecture 5 Slides)

bull Christopher M Bishop (2006) Pattern Recognition and Machine Learning Springer

WHY DO WE NEED GRAPHICAL MODEL

Why Do We Need Graphical Models

bull Consndash Graphical model is so complex even with a few circleshellipndash We have to make too many assumptions

bull Prosndash We do need probability to explain our world But joint probability is

hard to computendash Graphical model can help us analyze and understand our problemsndash Graphs are an intuitive way of representing and visualizing the

relationships between many variablesndash With a graphical model we can decouple joint probability to

conditional probabilities which are usually easier

Directed Acyclic Graphical Models (Bayesian Networks)

bull A DAG Model Bayesian network corresponds to a factorization of the joint probability distribution

p(ABCDE) = p(A)p(B)p(C|AB)p(D|BC)p(E|CD)

bull In general

bull where pa(i) are the parents of node i

Directed Graphs for Statistical ModelsPlate Notation

bull A data set of N points generated from a Gaussian

PLSA ndash PROBABILISTIC LATENT SEMANTIC ANALYSIS

Latent Semantic Indexing (LSI) Review

bull For natural Language Queries simple term matching does not work effectivelyndash Ambiguous terms ndash Same Queries vary due to personal styles

bull Latent semantic indexingndash Creates this lsquolatent semantic spacersquo (hidden meaning)

bull LSI puts documents together even if they donrsquot have common words if the docs share frequently co-occurring terms

bull Disadvantagesndash Statistical foundation is missing

pLSA ndash Probabilistic Latent Semantic Analysis

bull Automated Document Indexing and Information retrievalbull Identification of Latent Classes using an Expectation

Maximization (EM) Algorithmbull Shown to solve

ndash Polysemybull Java could mean ldquocoffeerdquo and also the ldquoPL Javardquobull Cricket is a ldquogamerdquo and also an ldquoinsectrdquo

ndash Synonymybull ldquocomputerrdquo ldquopcrdquo ldquodesktoprdquo all could mean the same

bull Has a better statistical foundation than LSA

pLSA

M

Nd

d

z

w

M

d

z1

w1

z2

w2

z3

w3

zN

wN

hellip

z1 hellip zN are variables ziє[1K]K is the number of latent topics

pLSA

d1

z1

w1

z2

w2

zN1

wN1

hellip

d2

z1

w1

z2

w2

zN2

wN2

hellip

dM

z1

w1

z2

w2

zNm

wNm

hellip

p(w|z=1) p(w|z=2) p(w | z=Nm) are shared for all documents

Likelihood

Joint Probability vs Likelihood

bull Joint probability

bull Likelihood (only for observed variables)

bull p(d) is assumed to be uniform

Document Decomposition

bull Each document can be decomposed as

bull This is similar to the matrix decomposition if we consider each discrete distribution as a vector

p(w|d) = ZVtimesk p(z|d)

bull With many documents we hope to find latent topics as common basis

pLSA ndash Objective Function

bull pLSA tries to maximize the log likelihood

bull Due to the summation over z inside log we have to resort to EM

EM Steps

bull E-Stepndash Expectation of the likelihood function is calculated with the current

parameter valuesbull M-Step

ndash Update the parameters with the calculated posterior probabilitiesndash Find the parameters that maximizes the likelihood function

Lower Bounding the Log Likelihood

EM Steps

bull The E-Step

bull The M-Step

Latent Subspace

pLSA vs LSAbull LSA and PLSA perform dimensionality reduction

ndash In LSA by keeping only K singular valuesndash In PLSA by having K aspects

bull Comparison to SVDndash U Matrix related to P(z|d) (doc to aspect)ndash V Matrix related to P(w|z) (aspect to term)ndash E Matrix related to P(z) (aspect strength)

pLSA vs LSAbull The main difference is the way the approximation is done

bull PLSA generates a model (aspect model) and maximizes its predictive power

bull Selecting the proper value of K is heuristic in LSA

bull Model selection in statistics can determine optimal K in PLSA

Applications

bull Text mining topic discovering

bull Scene Classification

Text Mining

Scene Classification

Classification Result

Reference

bull Thomas Hofmann Probabilistic latent semantic analysis In Proc of Uncertainty in Artificial Intelligence UAI99 Stockholm 1999

bull Bosch A Zisserman A and Munoz X Scene Classification via pLSAProceedings of the European Conference on Computer Vision (2006)

bull Sivic J Russell B C Efros A A Zisserman A and Freeman W T Discovering Object Categories in Image Collections MIT AI Lab Memo AIM-2005-005 February 2005

LDA ndash LATENT DIRICHILET ALLOCATION

Problems in pLSA

bull pLSA provides no probabilistic model at the document level Each doc has its own topic mixture proportion

bull The number of parameters in the model grows linearly with M (the number of documents in the training set)

Problems in pLSA

bull There is no constraint for distributions p(z|di)

bull Easy to lead to serious problems with over-fitting

d1

z1

w1

z2

w2

zN1

wN1

hellip

d2

z1

w1

z2

w2

zN2

wN2

hellip

dm

z1

w1

z2

w2

zNm

wNm

hellip

p(z|d1) p(z|d2) p(z|dm)

Dirichlet Distribution

bull In the LDA model the topic mixture proportions for each document are assumed to follow some distribution

bull Requirement for such a distributionndash The samples (mixture proportions) generated from it are K-tuples of non-

negative numbers that sum to one That is the samples are multinormialsndash Easy to optimize

bull Dirichlet Distribution is one of such distributions

bull The space of all of these multinomials has a nice geometric interpretation as a (k-1)-simplex

Dirichlet Distribution

bull Definition

bull The density is zero outside this open (K minus 1)-dimensional simplex

k

i ii

ki ik

i i

ki i

kk

xx

xxxxp i

1

11

1

12121

1 0 st

)Γ()(Γ)|(

bull Various parameter α

(6 2 2) (3 7 5)

(2 3 4) (6 2 6)

Example Dirichlet Distributions (K=3)

Example Dirichlet Distributions (K=3)

bull Equal αi different

α0=01 α0=1 α0=10

k

i i10

The LDA Model

z4z3z2z1

w4w3w2w1

b

z4z3z2z1

w4w3w2w1

z4z3z2z1

w4w3w2w1

bull For each documentbull Choose ~Dirichlet()bull For each of the N words wn

ndash Choose a topic znraquo Multinomial()

ndash Choose a word wn from p(wn|znb) a multinomial probability conditioned on the topic zn

The LDA Model

bull For each documentbull Choose ~Dirichlet()bull For each of the N words wn

ndash Choose a topic znraquo Multinomial()

ndash Choose a word wn from p(wn|znb) a multinomial probability conditioned on the topic zn

Joint Probability

bull Given parameter α and β

where

Likelihood

bull Joint Probability

bull Marginal distribution of a document

bull Likelihood over all the documents

Inference

bull The likelihood can be computed by summing each documentbull Jansenrsquos inequality in EM

Inference

bull In E-Step we need to compute the posterior distribution of the hidden variables

bull Unfortunately this distribution is intractable to compute in general

bull We have to resort to variational approach

Variational Inference

bull In variational inference we consider a simplified graphical model with variational parameters and minimize the KL Divergence between the variational and posterior distributions

Variantional Inference

bull The difference between the lower bound and the likelihood is the KL divergence

bull Maximizing the lower bound L(b) with respect to and is equivalent to minimizing the KL divergence

VBEM vs EM

bull Only different in the E-Step

bull In standard EM q(X) is directly set to p(X|Dθ) and let KL=0bull In VBEM it is intractable to compute p(X|Dθ) Instead it

approximates p(X|Dθ) by a variational distribution q(X) by minimizing KL(q(X) | P(X|D θ)

bull This is also equivalent to maximizing the lower bound L(θ)

Parameter Estimation

bull Given a corpus of documents we would like to find the parameters and b which maximize the likelihood of the observed data

bull Strategy (Variational EM)

bull Lower bound log p(w|b) by a function L(b)bull Repeat until convergence

ndash E Maximize L(b) with respect to the variational parameters ndash M Maximize the bound with respect to parameters and b

Parameter Estimation

bull E-Step Variational Inference ndash repeat until convergence

bull M-Step Parameter estimation

β

α can be implemented using the Newton-Raphson method

Topic Examples in a 100-topic LDA Model)bull 16000 documents from a subset of the TREC AP corpus

Classification (50-topic LDA + SVM)

bull Reuters-21578 dataset ndash contains 8000 documents and 15818 words

(a) EARN vs NOT EARN (b) GRAIN vs NOT GRAIN

Problems in LDA

bull Dirichlet Distribution is helpful to avoid over-fitting But the assumption might be too strong

z4z3z2z1

w4w3w2w1

b

z4z3z2z1

w4w3w2w1

z4z3z2z1

w4w3w2w1

A Bayesian Hierarchical Model for Learning Natural Scene Categories

bull Incorporating category information

MNd

π

z

x

θ

β

Codebookbull 174 Local Image Patches

bull DetectionEvenly Sampled GridRandom SamplingSaliency DetectorLowersquos DoG Detector

bull RepresentationNormalized 11x11 gray values128-dim SIFT

Topic Distribution in Different Categories

Topic Hierarchical Clustering

More Topic Models

bull Dynamic topic models ICML 2006bull Correlated Topic Model NIPS 2005bull Hierarchical Dirichlet Process Journal of the American

Statistical Association 2003bull Nonparametric Bayes pachinko allocation UAI 2007bull Supervised LDA NIPS 2007bull MedLDA ndash Maximum Margin Discrimant LDA ICML 2009bull hellip

Are you really into Graphical Models

bull Describing Visual Scenes using Transformed Dirichlet Processes E Sudderth A Torralba W Freeman and A Willsky NIPS Dec 2005

Reference

bull David M Blei Andrew Y Ng Michael I Jordan Latent Dirichlet Allocation Journal of Machine Learning Research (JMLR) 2003

bull Matthew J Beal Variational Algorithms for Approximate Bayesian Inference PhD Thesis University of Cambridge 1998

bull L Fei-Fei and P Perona A Bayesian Hierarchical Model for Learning Natural Scene Categories CVPR 2005

Outline

bull Matrix Decompositionndash PCA SVD NMFndash LDA ICA Sparse Coding etc

bull Graphical Modelndash Basic concepts in probabilistic machine learningndash EMndash pLSAndash LDA

bull Two Applicationsndash Document decomposition for ldquolong queryrdquo retrieval ICCV 2009ndash Modeling Threaded Discussions SIGIR 2009

Large-Scale Indexing for ldquoLong Queryrdquo Retrieval (Similarity Search)

Xiao Zhang Zhiwei Li Lei Zhang Wei-Ying Ma and Heung-Yeung Shum

ICCV 2009

The Long Query Problem

bull If a query contains 1000 keywordsndash Need to access 1000 inverted listsndash The intersection of 1000 inverted lists may be emptyndash The union of 1000 inverted list may be the whole corpus

bull Dimension reductionTerm1 Term2 Term3 Term4 hellip TermN

Img1 1 2 0 0 hellip 2

f1 f2 hellip fM

Img1 02 01 hellip 003

Topic Projection

Dim = 1 million

Dim = 200

Key Idea Dimension Reduction + Residual Error Preservation

bull p original TF-IDF vector in vocabulary spacebull X projection matrix for dimension reductionbull w low dimensional feature vectorbull residual error

p Xw

An image = 0 01 02 03 04 05 06 07 08 09 1

0

2

4

6

8

10

12

14

16

+ a few words(10 words)

Orthogonal Decomposition

p Xw 1 11 1 1

2 21 2 1 2

1 1

1

k

k

W k W

W W Wk W

p x xp x x w

p wp x x

Base vector

Low dimensional representation

Residual

An image = 0 01 02 03 04 05 06 07 08 09 1

0

2

4

6

8

10

12

14

16

+ a few words(10 words)

T Tp q p q

p q

w w

X1 X2 X3 hellip Xk

A Probabilistic Implementation

x is a switch variable It controls a word generated from

bull a topic specific distribution

bull a document specific distribution

bull a background distribution

( | )p w d 1( 0 | ) ( | ) ( | )K

kp x d p w z k p z k d

( 1| ) ( | )p x d p w d ( 2 | ) ( | )p x d p w

C Chemudugunta etal Modeling General and Specific Aspects of Documents with a Probabilistic Topic Model NIPS 2006

Search (Online)

DS1 DS2 hellip

DS1 DS2 hellip

DS1 DS2 hellip

DS1 DS2 hellip

DS1 DS2 hellip

hellip

hellip

hellip

hellip

hellip

LSH Index

Doc 300 Doc 401 hellip

A query = 0 01 02 03 04 05 06 07 08 09 1

0

2

4

6

8

10

12

14

16

+ a few words

Re-rankingDoc 401 hellip

Doc 1

Doc 2

Doc 300

Doc 401

Doc N

Doc 300

Index 10M Images 46GBSearch Speed lt 100ms

Doc Meta

Search Example

Query Image

Search Example

Query Image

Simultaneously Modeling Semantics and Structure of Threaded Discussions A Sparse

Coding Approach and Its Applications

Chen LIN Jiang-Ming YANG Rui CAI Xin-jing WANG Wei WANG Lei ZHANG

SIGIR 2009

123

Semantic amp structure

124

SemanticTopics

StructureWho reply to who

Optimize them together

Model semantic

Model structure

125

Reply reconstruction

126

DocumentSimilarity

TopicSimilarity

StructureSimilarity

BaselinesNP

Reply to Nearest PostRR

Reply to RootDS

Document SimilarityLDA

Latent Dirichlet Allocation Project documents to topic space

SWB Special Words Topic Model with Background distribution Project documents to topic and junk topic space

127

Evaluation

method Slashdot Apple

All Posts Good Posts All Posts Good Posts

NP 0021 0012 0289 0239

RR 0183 0319 0269 0474

DS 0463 0643 0409 0628

LDA 0465 0644 0410 0648

SWB 0463 0644 0410 0641

SMSS 0524 0737 0517 0772

128

Expert finding

Reply reconstruction

Network construction

Expert finding

Methods

HITS

PageRank

hellip

129

Baselines

LMFormal Models for Expert Finding in Enterprise Corpora SIGIR

06Achieves stable performance in expert finding task using a

language modelPageRank

Benchmark nodal ranking methodHITS

Find hub nodes and authority nodeEABIF

Personalized Recommendation Driven by Information Flow SIGIR rsquo06

Find most influential node130

Evaluation

131

bull Bayesian estimate

Method MRR MAP P10

LM 0821 0698 0800

EABIF(ori) 0674 0362 0243

EABIF(rec) 0742 0318 0281

PageRank(ori) 0675 0377 0263

PageRank(rec) 0743 0321 0266

HITS(ori) 0906 0832 0900

HITS(rec) 0938 0822 0906

Summary

bull Matrix and probability are fundamental mathematics in information retrieval and computer visionndash Matrix decomposition ndash a good practice to learn matrixndash Graphical model ndash a good practice to learn probability

bull Graphical model is a good tool to analyze problems

bull The essence of decomposition is to discover a set of mid-level features to describe original documentsimages

bull It is more adaptable for various applications than matrix decomposition

  • An Introduction To Matrix Decomposition and Graphical Model
  • Outline
  • What Is Matrix Decomposition
  • Why We Need Matrix Decomposition
  • Why We Need Matrix Decomposition (2)
  • Principle Component Analysis
  • Definition ndash Eigenvalue amp Eigenvector
  • Definition ndash Principle Component Analysis
  • Principle Component Analysis (2)
  • Maximizing Variance
  • Optimization Problem
  • Property Data Decomposition
  • Face Recognition ndash Eigenface
  • Slide 14
  • Slide 15
  • PageRank ndash Power Iteration
  • Column-Stochastic amp Irreducible
  • Iterative PageRank Calculation
  • Convergence of the power iteration
  • Singular value decomposition
  • SVD - Definition
  • Singular Values And Singular Vectors
  • Matrix approximation
  • SVD and PCA
  • Example - LSI
  • SVD and PCA (2)
  • Latent Semantic Indexing (LSI)
  • Latent Semantic Indexing
  • Similarity Measures
  • Similarity Measures (2)
  • HITS (Hyperlink Induced Topic Search)
  • Power Iteration
  • HITS and SVD
  • HITS vs PageRank
  • NMF ndash Non-Negative Matrix Factorization
  • Definition
  • Motivation
  • Multiplicative Update Algorithm
  • Multiplicative Update Algorithm (2)
  • NMF vs PCA
  • Reference
  • Major Reference
  • Outline (2)
  • Not Included
  • Basic Concepts
  • What Is Machine Learning
  • Likelihood Function
  • Maximum Likelihood (ML)
  • IID ndash Independent Identically Distributed
  • Reference (2)
  • Expectation Maximization
  • Why We Need EM
  • More General
  • The Expectation Maximization (EM) Algorithm
  • Jensenrsquos Inequality
  • Lower Bounding the Log Likelihood
  • The E and M Steps of EM
  • The E Step
  • The M Step
  • EM Never Decreases the Likelihood
  • Reference (3)
  • Why Do We Need Graphical Model
  • Why Do We Need Graphical Models
  • Directed Acyclic Graphical Models (Bayesian Networks)
  • Directed Graphs for Statistical Models Plate Notation
  • pLSA ndash Probabilistic Latent Semantic Analysis
  • Latent Semantic Indexing (LSI) Review
  • pLSA ndash Probabilistic Latent Semantic Analysis (2)
  • pLSA
  • pLSA (2)
  • Joint Probability vs Likelihood
  • Document Decomposition
  • pLSA ndash Objective Function
  • EM Steps
  • Lower Bounding the Log Likelihood (2)
  • EM Steps (2)
  • Latent Subspace
  • pLSA vs LSA
  • pLSA vs LSA (2)
  • Applications
  • Text Mining
  • Scene Classification
  • Classification Result
  • Reference (4)
  • LDA ndash Latent Dirichilet Allocation
  • Problems in pLSA
  • Problems in pLSA (2)
  • Dirichlet Distribution
  • Dirichlet Distribution (2)
  • Example Dirichlet Distributions (K=3)
  • Example Dirichlet Distributions (K=3) (2)
  • The LDA Model
  • The LDA Model (2)
  • Joint Probability
  • Likelihood
  • Inference
  • Inference (2)
  • Variational Inference
  • Variantional Inference
  • VBEM vs EM
  • Parameter Estimation
  • Parameter Estimation (2)
  • Topic Examples in a 100-topic LDA Model)
  • Classification (50-topic LDA + SVM)
  • Problems in LDA
  • A Bayesian Hierarchical Model for Learning Natural Scene Catego
  • Codebook
  • Topic Distribution in Different Categories
  • Topic Hierarchical Clustering
  • More Topic Models
  • Are you really into Graphical Models
  • Reference (5)
  • Outline (3)
  • Slide 115
  • The Long Query Problem
  • Key Idea Dimension Reduction + Residual Error Preservation
  • Orthogonal Decomposition
  • A Probabilistic Implementation
  • Search (Online)
  • Search Example
  • Search Example (2)
  • Simultaneously Modeling Semantics and Structure of Threaded Dis
  • Semantic amp structure
  • Optimize them together
  • Reply reconstruction
  • Baselines
  • Evaluation
  • Expert finding
  • Baselines (2)
  • Evaluation (2)
  • Summary
Page 37: An Introduction To Matrix Decomposition and Graphical Model

Multiplicative Update Algorithmbull Cost function Divergence

ndash Reduce to Kullback-Leibler divergence whenndash A and B can be regarded as normalized probability distributions

bull Multiplicative update

bull PLSA is NMF with KL divergence

NMF vs PCA

bull n = 2429 faces m = 19x19 pixelsbull Positive values are illustrated with black pixels and negative values

with red pixels

bull NMF Parts-based representationbull PCA Holistic representations

Referencebull D D Lee and H S Seung Algorithms for non-negative matrix

factorization (pdf) NIPS 2001bull D D Lee and H S Seung Learning the parts of objects by non-negative

matrix factorization (pdf) Nature 401 788-791 (1999)

Major Referencebull Saara Hyvoumlnen Linear Algebra Methods for Data Mining Spring 2007

University of Helsinki (Highly recommend)

Outlinebull Basic concepts

ndash Likelihood iidndash ML MAP and Bayesian Inferencendash Expectation-Maximizationndash Mixture Gaussian Parameter estimation

bull pLSAndash Motivationndash Derivation amp Geometry propertiesndash Applications

bull LDAndash Motivation - Why to add a hyper parameterndash Dirichlet Distributionndash Variational EMndash Relations with other topic modalsndash Incorporating category information

bull Summary

Not Included

bull General graphical model theoriesbull Markov random field (belief propagation)bull Detailed derivation of LDA

BASIC CONCEPTS

What Is Machine Learning

Databull Let x = (x1 x2 xD) T denote a data point and D = x(1) x(2) x(N) a

data set D is sometimes associated with desired outputs y1 y2

Predictionsbull We are generally interested in predicting something based on the observed

data setbull Given D what can we say about x(N+1)

Modelbull To make predictions we need to make some assumptions We can often

express these assumptions in the form of a model with some parameters θbull Given data D we learn the model parameters from which we can predict

new data pointsbull The model can often be expressed as a probability distribution over data

points

Likelihood Function

bull Given a set of parameter values probability density function (PDF) will show that some data are more probable than other data

bull Inversely given the observed data and a model of interest Likelihood function is defined as

L(θ) = fθ(x|θ) = p(x|θ)

bull That is likelihood function L(θ) will show that some parameters are more likely to have produced the data

Maximum Likelihood (ML)

bull Maximum likelihood will find the best model parameters that make the data ldquomost likelyrdquo generated from this model

bull Suppose we are given n data samples (x1 x2 hellip xn)

bull Maximum likelihood will find θ that maximize L(θ)

bull Predictive distribution

IID ndash Independent Identically Distributed

bull IID means

bull The problem is considerably simplified as

bull Usually log likehood is used

Reference

bull Zoubin Ghahramani Machine Learning (4F13) 2006 Cambridge(Introduction to Machine Learning Lectures 1-2 Slides)

bull Gregor Heinrich Parameter estimation for text analysis Technical Note 2005-2008

EXPECTATION MAXIMIZATION

Why We Need EM

bull The ExpectationndashMaximization (EM) algorithm is a method for ML learning of parameters in latent variable models

bull Why we need latent variables

bull To describe complex model Gaussian Mixture Model

bull To discover the intrinsic structure inside a data set Topic Models such as pLSA LDA

More General

bull Data set bull Likelihood

bull Goal learn maximum likelihood (ML) parameter values

bull The maximum likelihood procedure finds parameters θ such that

bull Because of the integral (or sum) over latent variables the likelihood can be a very complicated and hard to optimize function

The Expectation Maximization (EM) Algorithm

bull The EM algorithm finds a (local) maximum of a latent variable model likelihood It starts from arbitrary values of the parameters and iterates two steps

bull E step Fill in values of latent variables according to posterior given data

bull M step Maximize likelihood as if latent variables were not hidden

bull Decomposes difficult problems into series of tractable steps

Jensenrsquos Inequality

Lower Bounding the Log Likelihoodbull Observed data D = yn Latent variables X = xn Parameters θbull Goal Maximize the log likelihood (ie ML learning) wrt θ

bull Any distribution q(X) over the hidden variables can be used to obtain a lower bound on the log likelihood using Jensenrsquos inequality

bull where H[q] is the entropy of q(X)

The E and M Steps of EM

bull The lower bound on the log likelihood is given by

bull EM alternates betweenbull E step optimize wrt distribution over hidden variables

holding params fixed

bull M step maximize wrt parameters holding hidden distribution fixed

The E Step

bull E step for fixed θ

bull The second term is the Kullback-Leibler divergencebull This means that for fixed θ F is bounded above by L and achieves

that bound whenbull So the E step simply sets

The M Step

bull M step maximize wrt parameters holding hidden distribution q fixed

bull The second equality comes from fact that entropy of q(X) does not depend directly on θ

bull The specific form of the M step depends on the model Often the maximum wrt θ can be found analytically

EM Never Decreases the Likelihood

bull The E and M steps together never decrease the log likelihood

bull The E step brings F(q θ) to the likelihood L(θ) 顶bull The M-step maximizes F(q θ) wrt θ 抬bull F(q θ) lt L(θ) by Jensen ndash or equivalently from the non-

negativity of KL

Reference

bull Zoubin Ghahramani Machine Learning (4F13) 2006 Cambridge(Unsupervised learning Lecture 5 Slides)

bull Christopher M Bishop (2006) Pattern Recognition and Machine Learning Springer

WHY DO WE NEED GRAPHICAL MODEL

Why Do We Need Graphical Models

bull Consndash Graphical model is so complex even with a few circleshellipndash We have to make too many assumptions

bull Prosndash We do need probability to explain our world But joint probability is

hard to computendash Graphical model can help us analyze and understand our problemsndash Graphs are an intuitive way of representing and visualizing the

relationships between many variablesndash With a graphical model we can decouple joint probability to

conditional probabilities which are usually easier

Directed Acyclic Graphical Models (Bayesian Networks)

bull A DAG Model Bayesian network corresponds to a factorization of the joint probability distribution

p(ABCDE) = p(A)p(B)p(C|AB)p(D|BC)p(E|CD)

bull In general

bull where pa(i) are the parents of node i

Directed Graphs for Statistical ModelsPlate Notation

bull A data set of N points generated from a Gaussian

PLSA ndash PROBABILISTIC LATENT SEMANTIC ANALYSIS

Latent Semantic Indexing (LSI) Review

bull For natural Language Queries simple term matching does not work effectivelyndash Ambiguous terms ndash Same Queries vary due to personal styles

bull Latent semantic indexingndash Creates this lsquolatent semantic spacersquo (hidden meaning)

bull LSI puts documents together even if they donrsquot have common words if the docs share frequently co-occurring terms

bull Disadvantagesndash Statistical foundation is missing

pLSA ndash Probabilistic Latent Semantic Analysis

bull Automated Document Indexing and Information retrievalbull Identification of Latent Classes using an Expectation

Maximization (EM) Algorithmbull Shown to solve

ndash Polysemybull Java could mean ldquocoffeerdquo and also the ldquoPL Javardquobull Cricket is a ldquogamerdquo and also an ldquoinsectrdquo

ndash Synonymybull ldquocomputerrdquo ldquopcrdquo ldquodesktoprdquo all could mean the same

bull Has a better statistical foundation than LSA

pLSA

M

Nd

d

z

w

M

d

z1

w1

z2

w2

z3

w3

zN

wN

hellip

z1 hellip zN are variables ziє[1K]K is the number of latent topics

pLSA

d1

z1

w1

z2

w2

zN1

wN1

hellip

d2

z1

w1

z2

w2

zN2

wN2

hellip

dM

z1

w1

z2

w2

zNm

wNm

hellip

p(w|z=1) p(w|z=2) p(w | z=Nm) are shared for all documents

Likelihood

Joint Probability vs Likelihood

bull Joint probability

bull Likelihood (only for observed variables)

bull p(d) is assumed to be uniform

Document Decomposition

bull Each document can be decomposed as

bull This is similar to the matrix decomposition if we consider each discrete distribution as a vector

p(w|d) = ZVtimesk p(z|d)

bull With many documents we hope to find latent topics as common basis

pLSA ndash Objective Function

bull pLSA tries to maximize the log likelihood

bull Due to the summation over z inside log we have to resort to EM

EM Steps

bull E-Stepndash Expectation of the likelihood function is calculated with the current

parameter valuesbull M-Step

ndash Update the parameters with the calculated posterior probabilitiesndash Find the parameters that maximizes the likelihood function

Lower Bounding the Log Likelihood

EM Steps

bull The E-Step

bull The M-Step

Latent Subspace

pLSA vs LSAbull LSA and PLSA perform dimensionality reduction

ndash In LSA by keeping only K singular valuesndash In PLSA by having K aspects

bull Comparison to SVDndash U Matrix related to P(z|d) (doc to aspect)ndash V Matrix related to P(w|z) (aspect to term)ndash E Matrix related to P(z) (aspect strength)

pLSA vs LSAbull The main difference is the way the approximation is done

bull PLSA generates a model (aspect model) and maximizes its predictive power

bull Selecting the proper value of K is heuristic in LSA

bull Model selection in statistics can determine optimal K in PLSA

Applications

bull Text mining topic discovering

bull Scene Classification

Text Mining

Scene Classification

Classification Result

Reference

bull Thomas Hofmann Probabilistic latent semantic analysis In Proc of Uncertainty in Artificial Intelligence UAI99 Stockholm 1999

bull Bosch A Zisserman A and Munoz X Scene Classification via pLSAProceedings of the European Conference on Computer Vision (2006)

bull Sivic J Russell B C Efros A A Zisserman A and Freeman W T Discovering Object Categories in Image Collections MIT AI Lab Memo AIM-2005-005 February 2005

LDA ndash LATENT DIRICHILET ALLOCATION

Problems in pLSA

bull pLSA provides no probabilistic model at the document level Each doc has its own topic mixture proportion

bull The number of parameters in the model grows linearly with M (the number of documents in the training set)

Problems in pLSA

bull There is no constraint for distributions p(z|di)

bull Easy to lead to serious problems with over-fitting

d1

z1

w1

z2

w2

zN1

wN1

hellip

d2

z1

w1

z2

w2

zN2

wN2

hellip

dm

z1

w1

z2

w2

zNm

wNm

hellip

p(z|d1) p(z|d2) p(z|dm)

Dirichlet Distribution

bull In the LDA model the topic mixture proportions for each document are assumed to follow some distribution

bull Requirement for such a distributionndash The samples (mixture proportions) generated from it are K-tuples of non-

negative numbers that sum to one That is the samples are multinormialsndash Easy to optimize

bull Dirichlet Distribution is one of such distributions

bull The space of all of these multinomials has a nice geometric interpretation as a (k-1)-simplex

Dirichlet Distribution

bull Definition

bull The density is zero outside this open (K minus 1)-dimensional simplex

k

i ii

ki ik

i i

ki i

kk

xx

xxxxp i

1

11

1

12121

1 0 st

)Γ()(Γ)|(

bull Various parameter α

(6 2 2) (3 7 5)

(2 3 4) (6 2 6)

Example Dirichlet Distributions (K=3)

Example Dirichlet Distributions (K=3)

bull Equal αi different

α0=01 α0=1 α0=10

k

i i10

The LDA Model

z4z3z2z1

w4w3w2w1

b

z4z3z2z1

w4w3w2w1

z4z3z2z1

w4w3w2w1

bull For each documentbull Choose ~Dirichlet()bull For each of the N words wn

ndash Choose a topic znraquo Multinomial()

ndash Choose a word wn from p(wn|znb) a multinomial probability conditioned on the topic zn

The LDA Model

bull For each documentbull Choose ~Dirichlet()bull For each of the N words wn

ndash Choose a topic znraquo Multinomial()

ndash Choose a word wn from p(wn|znb) a multinomial probability conditioned on the topic zn

Joint Probability

bull Given parameter α and β

where

Likelihood

bull Joint Probability

bull Marginal distribution of a document

bull Likelihood over all the documents

Inference

bull The likelihood can be computed by summing each documentbull Jansenrsquos inequality in EM

Inference

bull In E-Step we need to compute the posterior distribution of the hidden variables

bull Unfortunately this distribution is intractable to compute in general

bull We have to resort to variational approach

Variational Inference

bull In variational inference we consider a simplified graphical model with variational parameters and minimize the KL Divergence between the variational and posterior distributions

Variantional Inference

bull The difference between the lower bound and the likelihood is the KL divergence

bull Maximizing the lower bound L(b) with respect to and is equivalent to minimizing the KL divergence

VBEM vs EM

bull Only different in the E-Step

bull In standard EM q(X) is directly set to p(X|Dθ) and let KL=0bull In VBEM it is intractable to compute p(X|Dθ) Instead it

approximates p(X|Dθ) by a variational distribution q(X) by minimizing KL(q(X) | P(X|D θ)

bull This is also equivalent to maximizing the lower bound L(θ)

Parameter Estimation

bull Given a corpus of documents we would like to find the parameters and b which maximize the likelihood of the observed data

bull Strategy (Variational EM)

bull Lower bound log p(w|b) by a function L(b)bull Repeat until convergence

ndash E Maximize L(b) with respect to the variational parameters ndash M Maximize the bound with respect to parameters and b

Parameter Estimation

bull E-Step Variational Inference ndash repeat until convergence

bull M-Step Parameter estimation

β

α can be implemented using the Newton-Raphson method

Topic Examples in a 100-topic LDA Model)bull 16000 documents from a subset of the TREC AP corpus

Classification (50-topic LDA + SVM)

bull Reuters-21578 dataset ndash contains 8000 documents and 15818 words

(a) EARN vs NOT EARN (b) GRAIN vs NOT GRAIN

Problems in LDA

bull Dirichlet Distribution is helpful to avoid over-fitting But the assumption might be too strong

z4z3z2z1

w4w3w2w1

b

z4z3z2z1

w4w3w2w1

z4z3z2z1

w4w3w2w1

A Bayesian Hierarchical Model for Learning Natural Scene Categories

bull Incorporating category information

MNd

π

z

x

θ

β

Codebookbull 174 Local Image Patches

bull DetectionEvenly Sampled GridRandom SamplingSaliency DetectorLowersquos DoG Detector

bull RepresentationNormalized 11x11 gray values128-dim SIFT

Topic Distribution in Different Categories

Topic Hierarchical Clustering

More Topic Models

bull Dynamic topic models ICML 2006bull Correlated Topic Model NIPS 2005bull Hierarchical Dirichlet Process Journal of the American

Statistical Association 2003bull Nonparametric Bayes pachinko allocation UAI 2007bull Supervised LDA NIPS 2007bull MedLDA ndash Maximum Margin Discrimant LDA ICML 2009bull hellip

Are you really into Graphical Models

bull Describing Visual Scenes using Transformed Dirichlet Processes E Sudderth A Torralba W Freeman and A Willsky NIPS Dec 2005

Reference

bull David M Blei Andrew Y Ng Michael I Jordan Latent Dirichlet Allocation Journal of Machine Learning Research (JMLR) 2003

bull Matthew J Beal Variational Algorithms for Approximate Bayesian Inference PhD Thesis University of Cambridge 1998

bull L Fei-Fei and P Perona A Bayesian Hierarchical Model for Learning Natural Scene Categories CVPR 2005

Outline

bull Matrix Decompositionndash PCA SVD NMFndash LDA ICA Sparse Coding etc

bull Graphical Modelndash Basic concepts in probabilistic machine learningndash EMndash pLSAndash LDA

bull Two Applicationsndash Document decomposition for ldquolong queryrdquo retrieval ICCV 2009ndash Modeling Threaded Discussions SIGIR 2009

Large-Scale Indexing for ldquoLong Queryrdquo Retrieval (Similarity Search)

Xiao Zhang Zhiwei Li Lei Zhang Wei-Ying Ma and Heung-Yeung Shum

ICCV 2009

The Long Query Problem

bull If a query contains 1000 keywordsndash Need to access 1000 inverted listsndash The intersection of 1000 inverted lists may be emptyndash The union of 1000 inverted list may be the whole corpus

bull Dimension reductionTerm1 Term2 Term3 Term4 hellip TermN

Img1 1 2 0 0 hellip 2

f1 f2 hellip fM

Img1 02 01 hellip 003

Topic Projection

Dim = 1 million

Dim = 200

Key Idea Dimension Reduction + Residual Error Preservation

bull p original TF-IDF vector in vocabulary spacebull X projection matrix for dimension reductionbull w low dimensional feature vectorbull residual error

p Xw

An image = 0 01 02 03 04 05 06 07 08 09 1

0

2

4

6

8

10

12

14

16

+ a few words(10 words)

Orthogonal Decomposition

p Xw 1 11 1 1

2 21 2 1 2

1 1

1

k

k

W k W

W W Wk W

p x xp x x w

p wp x x

Base vector

Low dimensional representation

Residual

An image = 0 01 02 03 04 05 06 07 08 09 1

0

2

4

6

8

10

12

14

16

+ a few words(10 words)

T Tp q p q

p q

w w

X1 X2 X3 hellip Xk

A Probabilistic Implementation

x is a switch variable It controls a word generated from

bull a topic specific distribution

bull a document specific distribution

bull a background distribution

( | )p w d 1( 0 | ) ( | ) ( | )K

kp x d p w z k p z k d

( 1| ) ( | )p x d p w d ( 2 | ) ( | )p x d p w

C Chemudugunta etal Modeling General and Specific Aspects of Documents with a Probabilistic Topic Model NIPS 2006

Search (Online)

DS1 DS2 hellip

DS1 DS2 hellip

DS1 DS2 hellip

DS1 DS2 hellip

DS1 DS2 hellip

hellip

hellip

hellip

hellip

hellip

LSH Index

Doc 300 Doc 401 hellip

A query = 0 01 02 03 04 05 06 07 08 09 1

0

2

4

6

8

10

12

14

16

+ a few words

Re-rankingDoc 401 hellip

Doc 1

Doc 2

Doc 300

Doc 401

Doc N

Doc 300

Index 10M Images 46GBSearch Speed lt 100ms

Doc Meta

Search Example

Query Image

Search Example

Query Image

Simultaneously Modeling Semantics and Structure of Threaded Discussions A Sparse

Coding Approach and Its Applications

Chen LIN Jiang-Ming YANG Rui CAI Xin-jing WANG Wei WANG Lei ZHANG

SIGIR 2009

123

Semantic amp structure

124

SemanticTopics

StructureWho reply to who

Optimize them together

Model semantic

Model structure

125

Reply reconstruction

126

DocumentSimilarity

TopicSimilarity

StructureSimilarity

BaselinesNP

Reply to Nearest PostRR

Reply to RootDS

Document SimilarityLDA

Latent Dirichlet Allocation Project documents to topic space

SWB Special Words Topic Model with Background distribution Project documents to topic and junk topic space

127

Evaluation

method Slashdot Apple

All Posts Good Posts All Posts Good Posts

NP 0021 0012 0289 0239

RR 0183 0319 0269 0474

DS 0463 0643 0409 0628

LDA 0465 0644 0410 0648

SWB 0463 0644 0410 0641

SMSS 0524 0737 0517 0772

128

Expert finding

Reply reconstruction

Network construction

Expert finding

Methods

HITS

PageRank

hellip

129

Baselines

LMFormal Models for Expert Finding in Enterprise Corpora SIGIR

06Achieves stable performance in expert finding task using a

language modelPageRank

Benchmark nodal ranking methodHITS

Find hub nodes and authority nodeEABIF

Personalized Recommendation Driven by Information Flow SIGIR rsquo06

Find most influential node130

Evaluation

131

bull Bayesian estimate

Method MRR MAP P10

LM 0821 0698 0800

EABIF(ori) 0674 0362 0243

EABIF(rec) 0742 0318 0281

PageRank(ori) 0675 0377 0263

PageRank(rec) 0743 0321 0266

HITS(ori) 0906 0832 0900

HITS(rec) 0938 0822 0906

Summary

bull Matrix and probability are fundamental mathematics in information retrieval and computer visionndash Matrix decomposition ndash a good practice to learn matrixndash Graphical model ndash a good practice to learn probability

bull Graphical model is a good tool to analyze problems

bull The essence of decomposition is to discover a set of mid-level features to describe original documentsimages

bull It is more adaptable for various applications than matrix decomposition

  • An Introduction To Matrix Decomposition and Graphical Model
  • Outline
  • What Is Matrix Decomposition
  • Why We Need Matrix Decomposition
  • Why We Need Matrix Decomposition (2)
  • Principle Component Analysis
  • Definition ndash Eigenvalue amp Eigenvector
  • Definition ndash Principle Component Analysis
  • Principle Component Analysis (2)
  • Maximizing Variance
  • Optimization Problem
  • Property Data Decomposition
  • Face Recognition ndash Eigenface
  • Slide 14
  • Slide 15
  • PageRank ndash Power Iteration
  • Column-Stochastic amp Irreducible
  • Iterative PageRank Calculation
  • Convergence of the power iteration
  • Singular value decomposition
  • SVD - Definition
  • Singular Values And Singular Vectors
  • Matrix approximation
  • SVD and PCA
  • Example - LSI
  • SVD and PCA (2)
  • Latent Semantic Indexing (LSI)
  • Latent Semantic Indexing
  • Similarity Measures
  • Similarity Measures (2)
  • HITS (Hyperlink Induced Topic Search)
  • Power Iteration
  • HITS and SVD
  • HITS vs PageRank
  • NMF ndash Non-Negative Matrix Factorization
  • Definition
  • Motivation
  • Multiplicative Update Algorithm
  • Multiplicative Update Algorithm (2)
  • NMF vs PCA
  • Reference
  • Major Reference
  • Outline (2)
  • Not Included
  • Basic Concepts
  • What Is Machine Learning
  • Likelihood Function
  • Maximum Likelihood (ML)
  • IID ndash Independent Identically Distributed
  • Reference (2)
  • Expectation Maximization
  • Why We Need EM
  • More General
  • The Expectation Maximization (EM) Algorithm
  • Jensenrsquos Inequality
  • Lower Bounding the Log Likelihood
  • The E and M Steps of EM
  • The E Step
  • The M Step
  • EM Never Decreases the Likelihood
  • Reference (3)
  • Why Do We Need Graphical Model
  • Why Do We Need Graphical Models
  • Directed Acyclic Graphical Models (Bayesian Networks)
  • Directed Graphs for Statistical Models Plate Notation
  • pLSA ndash Probabilistic Latent Semantic Analysis
  • Latent Semantic Indexing (LSI) Review
  • pLSA ndash Probabilistic Latent Semantic Analysis (2)
  • pLSA
  • pLSA (2)
  • Joint Probability vs Likelihood
  • Document Decomposition
  • pLSA ndash Objective Function
  • EM Steps
  • Lower Bounding the Log Likelihood (2)
  • EM Steps (2)
  • Latent Subspace
  • pLSA vs LSA
  • pLSA vs LSA (2)
  • Applications
  • Text Mining
  • Scene Classification
  • Classification Result
  • Reference (4)
  • LDA ndash Latent Dirichilet Allocation
  • Problems in pLSA
  • Problems in pLSA (2)
  • Dirichlet Distribution
  • Dirichlet Distribution (2)
  • Example Dirichlet Distributions (K=3)
  • Example Dirichlet Distributions (K=3) (2)
  • The LDA Model
  • The LDA Model (2)
  • Joint Probability
  • Likelihood
  • Inference
  • Inference (2)
  • Variational Inference
  • Variantional Inference
  • VBEM vs EM
  • Parameter Estimation
  • Parameter Estimation (2)
  • Topic Examples in a 100-topic LDA Model)
  • Classification (50-topic LDA + SVM)
  • Problems in LDA
  • A Bayesian Hierarchical Model for Learning Natural Scene Catego
  • Codebook
  • Topic Distribution in Different Categories
  • Topic Hierarchical Clustering
  • More Topic Models
  • Are you really into Graphical Models
  • Reference (5)
  • Outline (3)
  • Slide 115
  • The Long Query Problem
  • Key Idea Dimension Reduction + Residual Error Preservation
  • Orthogonal Decomposition
  • A Probabilistic Implementation
  • Search (Online)
  • Search Example
  • Search Example (2)
  • Simultaneously Modeling Semantics and Structure of Threaded Dis
  • Semantic amp structure
  • Optimize them together
  • Reply reconstruction
  • Baselines
  • Evaluation
  • Expert finding
  • Baselines (2)
  • Evaluation (2)
  • Summary
Page 38: An Introduction To Matrix Decomposition and Graphical Model

NMF vs PCA

bull n = 2429 faces m = 19x19 pixelsbull Positive values are illustrated with black pixels and negative values

with red pixels

bull NMF Parts-based representationbull PCA Holistic representations

Referencebull D D Lee and H S Seung Algorithms for non-negative matrix

factorization (pdf) NIPS 2001bull D D Lee and H S Seung Learning the parts of objects by non-negative

matrix factorization (pdf) Nature 401 788-791 (1999)

Major Referencebull Saara Hyvoumlnen Linear Algebra Methods for Data Mining Spring 2007

University of Helsinki (Highly recommend)

Outlinebull Basic concepts

ndash Likelihood iidndash ML MAP and Bayesian Inferencendash Expectation-Maximizationndash Mixture Gaussian Parameter estimation

bull pLSAndash Motivationndash Derivation amp Geometry propertiesndash Applications

bull LDAndash Motivation - Why to add a hyper parameterndash Dirichlet Distributionndash Variational EMndash Relations with other topic modalsndash Incorporating category information

bull Summary

Not Included

bull General graphical model theoriesbull Markov random field (belief propagation)bull Detailed derivation of LDA

BASIC CONCEPTS

What Is Machine Learning

Databull Let x = (x1 x2 xD) T denote a data point and D = x(1) x(2) x(N) a

data set D is sometimes associated with desired outputs y1 y2

Predictionsbull We are generally interested in predicting something based on the observed

data setbull Given D what can we say about x(N+1)

Modelbull To make predictions we need to make some assumptions We can often

express these assumptions in the form of a model with some parameters θbull Given data D we learn the model parameters from which we can predict

new data pointsbull The model can often be expressed as a probability distribution over data

points

Likelihood Function

bull Given a set of parameter values probability density function (PDF) will show that some data are more probable than other data

bull Inversely given the observed data and a model of interest Likelihood function is defined as

L(θ) = fθ(x|θ) = p(x|θ)

bull That is likelihood function L(θ) will show that some parameters are more likely to have produced the data

Maximum Likelihood (ML)

bull Maximum likelihood will find the best model parameters that make the data ldquomost likelyrdquo generated from this model

bull Suppose we are given n data samples (x1 x2 hellip xn)

bull Maximum likelihood will find θ that maximize L(θ)

bull Predictive distribution

IID ndash Independent Identically Distributed

bull IID means

bull The problem is considerably simplified as

bull Usually log likehood is used

Reference

bull Zoubin Ghahramani Machine Learning (4F13) 2006 Cambridge(Introduction to Machine Learning Lectures 1-2 Slides)

bull Gregor Heinrich Parameter estimation for text analysis Technical Note 2005-2008

EXPECTATION MAXIMIZATION

Why We Need EM

bull The ExpectationndashMaximization (EM) algorithm is a method for ML learning of parameters in latent variable models

bull Why we need latent variables

bull To describe complex model Gaussian Mixture Model

bull To discover the intrinsic structure inside a data set Topic Models such as pLSA LDA

More General

bull Data set bull Likelihood

bull Goal learn maximum likelihood (ML) parameter values

bull The maximum likelihood procedure finds parameters θ such that

bull Because of the integral (or sum) over latent variables the likelihood can be a very complicated and hard to optimize function

The Expectation Maximization (EM) Algorithm

bull The EM algorithm finds a (local) maximum of a latent variable model likelihood It starts from arbitrary values of the parameters and iterates two steps

bull E step Fill in values of latent variables according to posterior given data

bull M step Maximize likelihood as if latent variables were not hidden

bull Decomposes difficult problems into series of tractable steps

Jensenrsquos Inequality

Lower Bounding the Log Likelihoodbull Observed data D = yn Latent variables X = xn Parameters θbull Goal Maximize the log likelihood (ie ML learning) wrt θ

bull Any distribution q(X) over the hidden variables can be used to obtain a lower bound on the log likelihood using Jensenrsquos inequality

bull where H[q] is the entropy of q(X)

The E and M Steps of EM

bull The lower bound on the log likelihood is given by

bull EM alternates betweenbull E step optimize wrt distribution over hidden variables

holding params fixed

bull M step maximize wrt parameters holding hidden distribution fixed

The E Step

bull E step for fixed θ

bull The second term is the Kullback-Leibler divergencebull This means that for fixed θ F is bounded above by L and achieves

that bound whenbull So the E step simply sets

The M Step

bull M step maximize wrt parameters holding hidden distribution q fixed

bull The second equality comes from fact that entropy of q(X) does not depend directly on θ

bull The specific form of the M step depends on the model Often the maximum wrt θ can be found analytically

EM Never Decreases the Likelihood

bull The E and M steps together never decrease the log likelihood

bull The E step brings F(q θ) to the likelihood L(θ) 顶bull The M-step maximizes F(q θ) wrt θ 抬bull F(q θ) lt L(θ) by Jensen ndash or equivalently from the non-

negativity of KL

Reference

bull Zoubin Ghahramani Machine Learning (4F13) 2006 Cambridge(Unsupervised learning Lecture 5 Slides)

bull Christopher M Bishop (2006) Pattern Recognition and Machine Learning Springer

WHY DO WE NEED GRAPHICAL MODEL

Why Do We Need Graphical Models

bull Consndash Graphical model is so complex even with a few circleshellipndash We have to make too many assumptions

bull Prosndash We do need probability to explain our world But joint probability is

hard to computendash Graphical model can help us analyze and understand our problemsndash Graphs are an intuitive way of representing and visualizing the

relationships between many variablesndash With a graphical model we can decouple joint probability to

conditional probabilities which are usually easier

Directed Acyclic Graphical Models (Bayesian Networks)

bull A DAG Model Bayesian network corresponds to a factorization of the joint probability distribution

p(ABCDE) = p(A)p(B)p(C|AB)p(D|BC)p(E|CD)

bull In general

bull where pa(i) are the parents of node i

Directed Graphs for Statistical ModelsPlate Notation

bull A data set of N points generated from a Gaussian

PLSA ndash PROBABILISTIC LATENT SEMANTIC ANALYSIS

Latent Semantic Indexing (LSI) Review

bull For natural Language Queries simple term matching does not work effectivelyndash Ambiguous terms ndash Same Queries vary due to personal styles

bull Latent semantic indexingndash Creates this lsquolatent semantic spacersquo (hidden meaning)

bull LSI puts documents together even if they donrsquot have common words if the docs share frequently co-occurring terms

bull Disadvantagesndash Statistical foundation is missing

pLSA ndash Probabilistic Latent Semantic Analysis

bull Automated Document Indexing and Information retrievalbull Identification of Latent Classes using an Expectation

Maximization (EM) Algorithmbull Shown to solve

ndash Polysemybull Java could mean ldquocoffeerdquo and also the ldquoPL Javardquobull Cricket is a ldquogamerdquo and also an ldquoinsectrdquo

ndash Synonymybull ldquocomputerrdquo ldquopcrdquo ldquodesktoprdquo all could mean the same

bull Has a better statistical foundation than LSA

pLSA

M

Nd

d

z

w

M

d

z1

w1

z2

w2

z3

w3

zN

wN

hellip

z1 hellip zN are variables ziє[1K]K is the number of latent topics

pLSA

d1

z1

w1

z2

w2

zN1

wN1

hellip

d2

z1

w1

z2

w2

zN2

wN2

hellip

dM

z1

w1

z2

w2

zNm

wNm

hellip

p(w|z=1) p(w|z=2) p(w | z=Nm) are shared for all documents

Likelihood

Joint Probability vs Likelihood

bull Joint probability

bull Likelihood (only for observed variables)

bull p(d) is assumed to be uniform

Document Decomposition

bull Each document can be decomposed as

bull This is similar to the matrix decomposition if we consider each discrete distribution as a vector

p(w|d) = ZVtimesk p(z|d)

bull With many documents we hope to find latent topics as common basis

pLSA ndash Objective Function

bull pLSA tries to maximize the log likelihood

bull Due to the summation over z inside log we have to resort to EM

EM Steps

bull E-Stepndash Expectation of the likelihood function is calculated with the current

parameter valuesbull M-Step

ndash Update the parameters with the calculated posterior probabilitiesndash Find the parameters that maximizes the likelihood function

Lower Bounding the Log Likelihood

EM Steps

bull The E-Step

bull The M-Step

Latent Subspace

pLSA vs LSAbull LSA and PLSA perform dimensionality reduction

ndash In LSA by keeping only K singular valuesndash In PLSA by having K aspects

bull Comparison to SVDndash U Matrix related to P(z|d) (doc to aspect)ndash V Matrix related to P(w|z) (aspect to term)ndash E Matrix related to P(z) (aspect strength)

pLSA vs LSAbull The main difference is the way the approximation is done

bull PLSA generates a model (aspect model) and maximizes its predictive power

bull Selecting the proper value of K is heuristic in LSA

bull Model selection in statistics can determine optimal K in PLSA

Applications

bull Text mining topic discovering

bull Scene Classification

Text Mining

Scene Classification

Classification Result

Reference

bull Thomas Hofmann Probabilistic latent semantic analysis In Proc of Uncertainty in Artificial Intelligence UAI99 Stockholm 1999

bull Bosch A Zisserman A and Munoz X Scene Classification via pLSAProceedings of the European Conference on Computer Vision (2006)

bull Sivic J Russell B C Efros A A Zisserman A and Freeman W T Discovering Object Categories in Image Collections MIT AI Lab Memo AIM-2005-005 February 2005

LDA ndash LATENT DIRICHILET ALLOCATION

Problems in pLSA

bull pLSA provides no probabilistic model at the document level Each doc has its own topic mixture proportion

bull The number of parameters in the model grows linearly with M (the number of documents in the training set)

Problems in pLSA

bull There is no constraint for distributions p(z|di)

bull Easy to lead to serious problems with over-fitting

d1

z1

w1

z2

w2

zN1

wN1

hellip

d2

z1

w1

z2

w2

zN2

wN2

hellip

dm

z1

w1

z2

w2

zNm

wNm

hellip

p(z|d1) p(z|d2) p(z|dm)

Dirichlet Distribution

bull In the LDA model the topic mixture proportions for each document are assumed to follow some distribution

bull Requirement for such a distributionndash The samples (mixture proportions) generated from it are K-tuples of non-

negative numbers that sum to one That is the samples are multinormialsndash Easy to optimize

bull Dirichlet Distribution is one of such distributions

bull The space of all of these multinomials has a nice geometric interpretation as a (k-1)-simplex

Dirichlet Distribution

bull Definition

bull The density is zero outside this open (K minus 1)-dimensional simplex

k

i ii

ki ik

i i

ki i

kk

xx

xxxxp i

1

11

1

12121

1 0 st

)Γ()(Γ)|(

bull Various parameter α

(6 2 2) (3 7 5)

(2 3 4) (6 2 6)

Example Dirichlet Distributions (K=3)

Example Dirichlet Distributions (K=3)

bull Equal αi different

α0=01 α0=1 α0=10

k

i i10

The LDA Model

z4z3z2z1

w4w3w2w1

b

z4z3z2z1

w4w3w2w1

z4z3z2z1

w4w3w2w1

bull For each documentbull Choose ~Dirichlet()bull For each of the N words wn

ndash Choose a topic znraquo Multinomial()

ndash Choose a word wn from p(wn|znb) a multinomial probability conditioned on the topic zn

The LDA Model

bull For each documentbull Choose ~Dirichlet()bull For each of the N words wn

ndash Choose a topic znraquo Multinomial()

ndash Choose a word wn from p(wn|znb) a multinomial probability conditioned on the topic zn

Joint Probability

bull Given parameter α and β

where

Likelihood

bull Joint Probability

bull Marginal distribution of a document

bull Likelihood over all the documents

Inference

bull The likelihood can be computed by summing each documentbull Jansenrsquos inequality in EM

Inference

bull In E-Step we need to compute the posterior distribution of the hidden variables

bull Unfortunately this distribution is intractable to compute in general

bull We have to resort to variational approach

Variational Inference

bull In variational inference we consider a simplified graphical model with variational parameters and minimize the KL Divergence between the variational and posterior distributions

Variantional Inference

bull The difference between the lower bound and the likelihood is the KL divergence

bull Maximizing the lower bound L(b) with respect to and is equivalent to minimizing the KL divergence

VBEM vs EM

bull Only different in the E-Step

bull In standard EM q(X) is directly set to p(X|Dθ) and let KL=0bull In VBEM it is intractable to compute p(X|Dθ) Instead it

approximates p(X|Dθ) by a variational distribution q(X) by minimizing KL(q(X) | P(X|D θ)

bull This is also equivalent to maximizing the lower bound L(θ)

Parameter Estimation

bull Given a corpus of documents we would like to find the parameters and b which maximize the likelihood of the observed data

bull Strategy (Variational EM)

bull Lower bound log p(w|b) by a function L(b)bull Repeat until convergence

ndash E Maximize L(b) with respect to the variational parameters ndash M Maximize the bound with respect to parameters and b

Parameter Estimation

bull E-Step Variational Inference ndash repeat until convergence

bull M-Step Parameter estimation

β

α can be implemented using the Newton-Raphson method

Topic Examples in a 100-topic LDA Model)bull 16000 documents from a subset of the TREC AP corpus

Classification (50-topic LDA + SVM)

bull Reuters-21578 dataset ndash contains 8000 documents and 15818 words

(a) EARN vs NOT EARN (b) GRAIN vs NOT GRAIN

Problems in LDA

bull Dirichlet Distribution is helpful to avoid over-fitting But the assumption might be too strong

z4z3z2z1

w4w3w2w1

b

z4z3z2z1

w4w3w2w1

z4z3z2z1

w4w3w2w1

A Bayesian Hierarchical Model for Learning Natural Scene Categories

bull Incorporating category information

MNd

π

z

x

θ

β

Codebookbull 174 Local Image Patches

bull DetectionEvenly Sampled GridRandom SamplingSaliency DetectorLowersquos DoG Detector

bull RepresentationNormalized 11x11 gray values128-dim SIFT

Topic Distribution in Different Categories

Topic Hierarchical Clustering

More Topic Models

bull Dynamic topic models ICML 2006bull Correlated Topic Model NIPS 2005bull Hierarchical Dirichlet Process Journal of the American

Statistical Association 2003bull Nonparametric Bayes pachinko allocation UAI 2007bull Supervised LDA NIPS 2007bull MedLDA ndash Maximum Margin Discrimant LDA ICML 2009bull hellip

Are you really into Graphical Models

bull Describing Visual Scenes using Transformed Dirichlet Processes E Sudderth A Torralba W Freeman and A Willsky NIPS Dec 2005

Reference

bull David M Blei Andrew Y Ng Michael I Jordan Latent Dirichlet Allocation Journal of Machine Learning Research (JMLR) 2003

bull Matthew J Beal Variational Algorithms for Approximate Bayesian Inference PhD Thesis University of Cambridge 1998

bull L Fei-Fei and P Perona A Bayesian Hierarchical Model for Learning Natural Scene Categories CVPR 2005

Outline

bull Matrix Decompositionndash PCA SVD NMFndash LDA ICA Sparse Coding etc

bull Graphical Modelndash Basic concepts in probabilistic machine learningndash EMndash pLSAndash LDA

bull Two Applicationsndash Document decomposition for ldquolong queryrdquo retrieval ICCV 2009ndash Modeling Threaded Discussions SIGIR 2009

Large-Scale Indexing for ldquoLong Queryrdquo Retrieval (Similarity Search)

Xiao Zhang Zhiwei Li Lei Zhang Wei-Ying Ma and Heung-Yeung Shum

ICCV 2009

The Long Query Problem

bull If a query contains 1000 keywordsndash Need to access 1000 inverted listsndash The intersection of 1000 inverted lists may be emptyndash The union of 1000 inverted list may be the whole corpus

bull Dimension reductionTerm1 Term2 Term3 Term4 hellip TermN

Img1 1 2 0 0 hellip 2

f1 f2 hellip fM

Img1 02 01 hellip 003

Topic Projection

Dim = 1 million

Dim = 200

Key Idea Dimension Reduction + Residual Error Preservation

bull p original TF-IDF vector in vocabulary spacebull X projection matrix for dimension reductionbull w low dimensional feature vectorbull residual error

p Xw

An image = 0 01 02 03 04 05 06 07 08 09 1

0

2

4

6

8

10

12

14

16

+ a few words(10 words)

Orthogonal Decomposition

p Xw 1 11 1 1

2 21 2 1 2

1 1

1

k

k

W k W

W W Wk W

p x xp x x w

p wp x x

Base vector

Low dimensional representation

Residual

An image = 0 01 02 03 04 05 06 07 08 09 1

0

2

4

6

8

10

12

14

16

+ a few words(10 words)

T Tp q p q

p q

w w

X1 X2 X3 hellip Xk

A Probabilistic Implementation

x is a switch variable It controls a word generated from

bull a topic specific distribution

bull a document specific distribution

bull a background distribution

( | )p w d 1( 0 | ) ( | ) ( | )K

kp x d p w z k p z k d

( 1| ) ( | )p x d p w d ( 2 | ) ( | )p x d p w

C Chemudugunta etal Modeling General and Specific Aspects of Documents with a Probabilistic Topic Model NIPS 2006

Search (Online)

DS1 DS2 hellip

DS1 DS2 hellip

DS1 DS2 hellip

DS1 DS2 hellip

DS1 DS2 hellip

hellip

hellip

hellip

hellip

hellip

LSH Index

Doc 300 Doc 401 hellip

A query = 0 01 02 03 04 05 06 07 08 09 1

0

2

4

6

8

10

12

14

16

+ a few words

Re-rankingDoc 401 hellip

Doc 1

Doc 2

Doc 300

Doc 401

Doc N

Doc 300

Index 10M Images 46GBSearch Speed lt 100ms

Doc Meta

Search Example

Query Image

Search Example

Query Image

Simultaneously Modeling Semantics and Structure of Threaded Discussions A Sparse

Coding Approach and Its Applications

Chen LIN Jiang-Ming YANG Rui CAI Xin-jing WANG Wei WANG Lei ZHANG

SIGIR 2009

123

Semantic amp structure

124

SemanticTopics

StructureWho reply to who

Optimize them together

Model semantic

Model structure

125

Reply reconstruction

126

DocumentSimilarity

TopicSimilarity

StructureSimilarity

BaselinesNP

Reply to Nearest PostRR

Reply to RootDS

Document SimilarityLDA

Latent Dirichlet Allocation Project documents to topic space

SWB Special Words Topic Model with Background distribution Project documents to topic and junk topic space

127

Evaluation

method Slashdot Apple

All Posts Good Posts All Posts Good Posts

NP 0021 0012 0289 0239

RR 0183 0319 0269 0474

DS 0463 0643 0409 0628

LDA 0465 0644 0410 0648

SWB 0463 0644 0410 0641

SMSS 0524 0737 0517 0772

128

Expert finding

Reply reconstruction

Network construction

Expert finding

Methods

HITS

PageRank

hellip

129

Baselines

LMFormal Models for Expert Finding in Enterprise Corpora SIGIR

06Achieves stable performance in expert finding task using a

language modelPageRank

Benchmark nodal ranking methodHITS

Find hub nodes and authority nodeEABIF

Personalized Recommendation Driven by Information Flow SIGIR rsquo06

Find most influential node130

Evaluation

131

bull Bayesian estimate

Method MRR MAP P10

LM 0821 0698 0800

EABIF(ori) 0674 0362 0243

EABIF(rec) 0742 0318 0281

PageRank(ori) 0675 0377 0263

PageRank(rec) 0743 0321 0266

HITS(ori) 0906 0832 0900

HITS(rec) 0938 0822 0906

Summary

bull Matrix and probability are fundamental mathematics in information retrieval and computer visionndash Matrix decomposition ndash a good practice to learn matrixndash Graphical model ndash a good practice to learn probability

bull Graphical model is a good tool to analyze problems

bull The essence of decomposition is to discover a set of mid-level features to describe original documentsimages

bull It is more adaptable for various applications than matrix decomposition

  • An Introduction To Matrix Decomposition and Graphical Model
  • Outline
  • What Is Matrix Decomposition
  • Why We Need Matrix Decomposition
  • Why We Need Matrix Decomposition (2)
  • Principle Component Analysis
  • Definition ndash Eigenvalue amp Eigenvector
  • Definition ndash Principle Component Analysis
  • Principle Component Analysis (2)
  • Maximizing Variance
  • Optimization Problem
  • Property Data Decomposition
  • Face Recognition ndash Eigenface
  • Slide 14
  • Slide 15
  • PageRank ndash Power Iteration
  • Column-Stochastic amp Irreducible
  • Iterative PageRank Calculation
  • Convergence of the power iteration
  • Singular value decomposition
  • SVD - Definition
  • Singular Values And Singular Vectors
  • Matrix approximation
  • SVD and PCA
  • Example - LSI
  • SVD and PCA (2)
  • Latent Semantic Indexing (LSI)
  • Latent Semantic Indexing
  • Similarity Measures
  • Similarity Measures (2)
  • HITS (Hyperlink Induced Topic Search)
  • Power Iteration
  • HITS and SVD
  • HITS vs PageRank
  • NMF ndash Non-Negative Matrix Factorization
  • Definition
  • Motivation
  • Multiplicative Update Algorithm
  • Multiplicative Update Algorithm (2)
  • NMF vs PCA
  • Reference
  • Major Reference
  • Outline (2)
  • Not Included
  • Basic Concepts
  • What Is Machine Learning
  • Likelihood Function
  • Maximum Likelihood (ML)
  • IID ndash Independent Identically Distributed
  • Reference (2)
  • Expectation Maximization
  • Why We Need EM
  • More General
  • The Expectation Maximization (EM) Algorithm
  • Jensenrsquos Inequality
  • Lower Bounding the Log Likelihood
  • The E and M Steps of EM
  • The E Step
  • The M Step
  • EM Never Decreases the Likelihood
  • Reference (3)
  • Why Do We Need Graphical Model
  • Why Do We Need Graphical Models
  • Directed Acyclic Graphical Models (Bayesian Networks)
  • Directed Graphs for Statistical Models Plate Notation
  • pLSA ndash Probabilistic Latent Semantic Analysis
  • Latent Semantic Indexing (LSI) Review
  • pLSA ndash Probabilistic Latent Semantic Analysis (2)
  • pLSA
  • pLSA (2)
  • Joint Probability vs Likelihood
  • Document Decomposition
  • pLSA ndash Objective Function
  • EM Steps
  • Lower Bounding the Log Likelihood (2)
  • EM Steps (2)
  • Latent Subspace
  • pLSA vs LSA
  • pLSA vs LSA (2)
  • Applications
  • Text Mining
  • Scene Classification
  • Classification Result
  • Reference (4)
  • LDA ndash Latent Dirichilet Allocation
  • Problems in pLSA
  • Problems in pLSA (2)
  • Dirichlet Distribution
  • Dirichlet Distribution (2)
  • Example Dirichlet Distributions (K=3)
  • Example Dirichlet Distributions (K=3) (2)
  • The LDA Model
  • The LDA Model (2)
  • Joint Probability
  • Likelihood
  • Inference
  • Inference (2)
  • Variational Inference
  • Variantional Inference
  • VBEM vs EM
  • Parameter Estimation
  • Parameter Estimation (2)
  • Topic Examples in a 100-topic LDA Model)
  • Classification (50-topic LDA + SVM)
  • Problems in LDA
  • A Bayesian Hierarchical Model for Learning Natural Scene Catego
  • Codebook
  • Topic Distribution in Different Categories
  • Topic Hierarchical Clustering
  • More Topic Models
  • Are you really into Graphical Models
  • Reference (5)
  • Outline (3)
  • Slide 115
  • The Long Query Problem
  • Key Idea Dimension Reduction + Residual Error Preservation
  • Orthogonal Decomposition
  • A Probabilistic Implementation
  • Search (Online)
  • Search Example
  • Search Example (2)
  • Simultaneously Modeling Semantics and Structure of Threaded Dis
  • Semantic amp structure
  • Optimize them together
  • Reply reconstruction
  • Baselines
  • Evaluation
  • Expert finding
  • Baselines (2)
  • Evaluation (2)
  • Summary
Page 39: An Introduction To Matrix Decomposition and Graphical Model

Referencebull D D Lee and H S Seung Algorithms for non-negative matrix

factorization (pdf) NIPS 2001bull D D Lee and H S Seung Learning the parts of objects by non-negative

matrix factorization (pdf) Nature 401 788-791 (1999)

Major Referencebull Saara Hyvoumlnen Linear Algebra Methods for Data Mining Spring 2007

University of Helsinki (Highly recommend)

Outlinebull Basic concepts

ndash Likelihood iidndash ML MAP and Bayesian Inferencendash Expectation-Maximizationndash Mixture Gaussian Parameter estimation

bull pLSAndash Motivationndash Derivation amp Geometry propertiesndash Applications

bull LDAndash Motivation - Why to add a hyper parameterndash Dirichlet Distributionndash Variational EMndash Relations with other topic modalsndash Incorporating category information

bull Summary

Not Included

bull General graphical model theoriesbull Markov random field (belief propagation)bull Detailed derivation of LDA

BASIC CONCEPTS

What Is Machine Learning

Databull Let x = (x1 x2 xD) T denote a data point and D = x(1) x(2) x(N) a

data set D is sometimes associated with desired outputs y1 y2

Predictionsbull We are generally interested in predicting something based on the observed

data setbull Given D what can we say about x(N+1)

Modelbull To make predictions we need to make some assumptions We can often

express these assumptions in the form of a model with some parameters θbull Given data D we learn the model parameters from which we can predict

new data pointsbull The model can often be expressed as a probability distribution over data

points

Likelihood Function

bull Given a set of parameter values probability density function (PDF) will show that some data are more probable than other data

bull Inversely given the observed data and a model of interest Likelihood function is defined as

L(θ) = fθ(x|θ) = p(x|θ)

bull That is likelihood function L(θ) will show that some parameters are more likely to have produced the data

Maximum Likelihood (ML)

bull Maximum likelihood will find the best model parameters that make the data ldquomost likelyrdquo generated from this model

bull Suppose we are given n data samples (x1 x2 hellip xn)

bull Maximum likelihood will find θ that maximize L(θ)

bull Predictive distribution

IID ndash Independent Identically Distributed

bull IID means

bull The problem is considerably simplified as

bull Usually log likehood is used

Reference

bull Zoubin Ghahramani Machine Learning (4F13) 2006 Cambridge(Introduction to Machine Learning Lectures 1-2 Slides)

bull Gregor Heinrich Parameter estimation for text analysis Technical Note 2005-2008

EXPECTATION MAXIMIZATION

Why We Need EM

bull The ExpectationndashMaximization (EM) algorithm is a method for ML learning of parameters in latent variable models

bull Why we need latent variables

bull To describe complex model Gaussian Mixture Model

bull To discover the intrinsic structure inside a data set Topic Models such as pLSA LDA

More General

bull Data set bull Likelihood

bull Goal learn maximum likelihood (ML) parameter values

bull The maximum likelihood procedure finds parameters θ such that

bull Because of the integral (or sum) over latent variables the likelihood can be a very complicated and hard to optimize function

The Expectation Maximization (EM) Algorithm

bull The EM algorithm finds a (local) maximum of a latent variable model likelihood It starts from arbitrary values of the parameters and iterates two steps

bull E step Fill in values of latent variables according to posterior given data

bull M step Maximize likelihood as if latent variables were not hidden

bull Decomposes difficult problems into series of tractable steps

Jensenrsquos Inequality

Lower Bounding the Log Likelihoodbull Observed data D = yn Latent variables X = xn Parameters θbull Goal Maximize the log likelihood (ie ML learning) wrt θ

bull Any distribution q(X) over the hidden variables can be used to obtain a lower bound on the log likelihood using Jensenrsquos inequality

bull where H[q] is the entropy of q(X)

The E and M Steps of EM

bull The lower bound on the log likelihood is given by

bull EM alternates betweenbull E step optimize wrt distribution over hidden variables

holding params fixed

bull M step maximize wrt parameters holding hidden distribution fixed

The E Step

bull E step for fixed θ

bull The second term is the Kullback-Leibler divergencebull This means that for fixed θ F is bounded above by L and achieves

that bound whenbull So the E step simply sets

The M Step

bull M step maximize wrt parameters holding hidden distribution q fixed

bull The second equality comes from fact that entropy of q(X) does not depend directly on θ

bull The specific form of the M step depends on the model Often the maximum wrt θ can be found analytically

EM Never Decreases the Likelihood

bull The E and M steps together never decrease the log likelihood

bull The E step brings F(q θ) to the likelihood L(θ) 顶bull The M-step maximizes F(q θ) wrt θ 抬bull F(q θ) lt L(θ) by Jensen ndash or equivalently from the non-

negativity of KL

Reference

bull Zoubin Ghahramani Machine Learning (4F13) 2006 Cambridge(Unsupervised learning Lecture 5 Slides)

bull Christopher M Bishop (2006) Pattern Recognition and Machine Learning Springer

WHY DO WE NEED GRAPHICAL MODEL

Why Do We Need Graphical Models

bull Consndash Graphical model is so complex even with a few circleshellipndash We have to make too many assumptions

bull Prosndash We do need probability to explain our world But joint probability is

hard to computendash Graphical model can help us analyze and understand our problemsndash Graphs are an intuitive way of representing and visualizing the

relationships between many variablesndash With a graphical model we can decouple joint probability to

conditional probabilities which are usually easier

Directed Acyclic Graphical Models (Bayesian Networks)

bull A DAG Model Bayesian network corresponds to a factorization of the joint probability distribution

p(ABCDE) = p(A)p(B)p(C|AB)p(D|BC)p(E|CD)

bull In general

bull where pa(i) are the parents of node i

Directed Graphs for Statistical ModelsPlate Notation

bull A data set of N points generated from a Gaussian

PLSA ndash PROBABILISTIC LATENT SEMANTIC ANALYSIS

Latent Semantic Indexing (LSI) Review

bull For natural Language Queries simple term matching does not work effectivelyndash Ambiguous terms ndash Same Queries vary due to personal styles

bull Latent semantic indexingndash Creates this lsquolatent semantic spacersquo (hidden meaning)

bull LSI puts documents together even if they donrsquot have common words if the docs share frequently co-occurring terms

bull Disadvantagesndash Statistical foundation is missing

pLSA ndash Probabilistic Latent Semantic Analysis

bull Automated Document Indexing and Information retrievalbull Identification of Latent Classes using an Expectation

Maximization (EM) Algorithmbull Shown to solve

ndash Polysemybull Java could mean ldquocoffeerdquo and also the ldquoPL Javardquobull Cricket is a ldquogamerdquo and also an ldquoinsectrdquo

ndash Synonymybull ldquocomputerrdquo ldquopcrdquo ldquodesktoprdquo all could mean the same

bull Has a better statistical foundation than LSA

pLSA

M

Nd

d

z

w

M

d

z1

w1

z2

w2

z3

w3

zN

wN

hellip

z1 hellip zN are variables ziє[1K]K is the number of latent topics

pLSA

d1

z1

w1

z2

w2

zN1

wN1

hellip

d2

z1

w1

z2

w2

zN2

wN2

hellip

dM

z1

w1

z2

w2

zNm

wNm

hellip

p(w|z=1) p(w|z=2) p(w | z=Nm) are shared for all documents

Likelihood

Joint Probability vs Likelihood

bull Joint probability

bull Likelihood (only for observed variables)

bull p(d) is assumed to be uniform

Document Decomposition

bull Each document can be decomposed as

bull This is similar to the matrix decomposition if we consider each discrete distribution as a vector

p(w|d) = ZVtimesk p(z|d)

bull With many documents we hope to find latent topics as common basis

pLSA ndash Objective Function

bull pLSA tries to maximize the log likelihood

bull Due to the summation over z inside log we have to resort to EM

EM Steps

bull E-Stepndash Expectation of the likelihood function is calculated with the current

parameter valuesbull M-Step

ndash Update the parameters with the calculated posterior probabilitiesndash Find the parameters that maximizes the likelihood function

Lower Bounding the Log Likelihood

EM Steps

bull The E-Step

bull The M-Step

Latent Subspace

pLSA vs LSAbull LSA and PLSA perform dimensionality reduction

ndash In LSA by keeping only K singular valuesndash In PLSA by having K aspects

bull Comparison to SVDndash U Matrix related to P(z|d) (doc to aspect)ndash V Matrix related to P(w|z) (aspect to term)ndash E Matrix related to P(z) (aspect strength)

pLSA vs LSAbull The main difference is the way the approximation is done

bull PLSA generates a model (aspect model) and maximizes its predictive power

bull Selecting the proper value of K is heuristic in LSA

bull Model selection in statistics can determine optimal K in PLSA

Applications

bull Text mining topic discovering

bull Scene Classification

Text Mining

Scene Classification

Classification Result

Reference

bull Thomas Hofmann Probabilistic latent semantic analysis In Proc of Uncertainty in Artificial Intelligence UAI99 Stockholm 1999

bull Bosch A Zisserman A and Munoz X Scene Classification via pLSAProceedings of the European Conference on Computer Vision (2006)

bull Sivic J Russell B C Efros A A Zisserman A and Freeman W T Discovering Object Categories in Image Collections MIT AI Lab Memo AIM-2005-005 February 2005

LDA ndash LATENT DIRICHILET ALLOCATION

Problems in pLSA

bull pLSA provides no probabilistic model at the document level Each doc has its own topic mixture proportion

bull The number of parameters in the model grows linearly with M (the number of documents in the training set)

Problems in pLSA

bull There is no constraint for distributions p(z|di)

bull Easy to lead to serious problems with over-fitting

d1

z1

w1

z2

w2

zN1

wN1

hellip

d2

z1

w1

z2

w2

zN2

wN2

hellip

dm

z1

w1

z2

w2

zNm

wNm

hellip

p(z|d1) p(z|d2) p(z|dm)

Dirichlet Distribution

bull In the LDA model the topic mixture proportions for each document are assumed to follow some distribution

bull Requirement for such a distributionndash The samples (mixture proportions) generated from it are K-tuples of non-

negative numbers that sum to one That is the samples are multinormialsndash Easy to optimize

bull Dirichlet Distribution is one of such distributions

bull The space of all of these multinomials has a nice geometric interpretation as a (k-1)-simplex

Dirichlet Distribution

bull Definition

bull The density is zero outside this open (K minus 1)-dimensional simplex

k

i ii

ki ik

i i

ki i

kk

xx

xxxxp i

1

11

1

12121

1 0 st

)Γ()(Γ)|(

bull Various parameter α

(6 2 2) (3 7 5)

(2 3 4) (6 2 6)

Example Dirichlet Distributions (K=3)

Example Dirichlet Distributions (K=3)

bull Equal αi different

α0=01 α0=1 α0=10

k

i i10

The LDA Model

z4z3z2z1

w4w3w2w1

b

z4z3z2z1

w4w3w2w1

z4z3z2z1

w4w3w2w1

bull For each documentbull Choose ~Dirichlet()bull For each of the N words wn

ndash Choose a topic znraquo Multinomial()

ndash Choose a word wn from p(wn|znb) a multinomial probability conditioned on the topic zn

The LDA Model

bull For each documentbull Choose ~Dirichlet()bull For each of the N words wn

ndash Choose a topic znraquo Multinomial()

ndash Choose a word wn from p(wn|znb) a multinomial probability conditioned on the topic zn

Joint Probability

bull Given parameter α and β

where

Likelihood

bull Joint Probability

bull Marginal distribution of a document

bull Likelihood over all the documents

Inference

bull The likelihood can be computed by summing each documentbull Jansenrsquos inequality in EM

Inference

bull In E-Step we need to compute the posterior distribution of the hidden variables

bull Unfortunately this distribution is intractable to compute in general

bull We have to resort to variational approach

Variational Inference

bull In variational inference we consider a simplified graphical model with variational parameters and minimize the KL Divergence between the variational and posterior distributions

Variantional Inference

bull The difference between the lower bound and the likelihood is the KL divergence

bull Maximizing the lower bound L(b) with respect to and is equivalent to minimizing the KL divergence

VBEM vs EM

bull Only different in the E-Step

bull In standard EM q(X) is directly set to p(X|Dθ) and let KL=0bull In VBEM it is intractable to compute p(X|Dθ) Instead it

approximates p(X|Dθ) by a variational distribution q(X) by minimizing KL(q(X) | P(X|D θ)

bull This is also equivalent to maximizing the lower bound L(θ)

Parameter Estimation

bull Given a corpus of documents we would like to find the parameters and b which maximize the likelihood of the observed data

bull Strategy (Variational EM)

bull Lower bound log p(w|b) by a function L(b)bull Repeat until convergence

ndash E Maximize L(b) with respect to the variational parameters ndash M Maximize the bound with respect to parameters and b

Parameter Estimation

bull E-Step Variational Inference ndash repeat until convergence

bull M-Step Parameter estimation

β

α can be implemented using the Newton-Raphson method

Topic Examples in a 100-topic LDA Model)bull 16000 documents from a subset of the TREC AP corpus

Classification (50-topic LDA + SVM)

bull Reuters-21578 dataset ndash contains 8000 documents and 15818 words

(a) EARN vs NOT EARN (b) GRAIN vs NOT GRAIN

Problems in LDA

bull Dirichlet Distribution is helpful to avoid over-fitting But the assumption might be too strong

z4z3z2z1

w4w3w2w1

b

z4z3z2z1

w4w3w2w1

z4z3z2z1

w4w3w2w1

A Bayesian Hierarchical Model for Learning Natural Scene Categories

bull Incorporating category information

MNd

π

z

x

θ

β

Codebookbull 174 Local Image Patches

bull DetectionEvenly Sampled GridRandom SamplingSaliency DetectorLowersquos DoG Detector

bull RepresentationNormalized 11x11 gray values128-dim SIFT

Topic Distribution in Different Categories

Topic Hierarchical Clustering

More Topic Models

bull Dynamic topic models ICML 2006bull Correlated Topic Model NIPS 2005bull Hierarchical Dirichlet Process Journal of the American

Statistical Association 2003bull Nonparametric Bayes pachinko allocation UAI 2007bull Supervised LDA NIPS 2007bull MedLDA ndash Maximum Margin Discrimant LDA ICML 2009bull hellip

Are you really into Graphical Models

bull Describing Visual Scenes using Transformed Dirichlet Processes E Sudderth A Torralba W Freeman and A Willsky NIPS Dec 2005

Reference

bull David M Blei Andrew Y Ng Michael I Jordan Latent Dirichlet Allocation Journal of Machine Learning Research (JMLR) 2003

bull Matthew J Beal Variational Algorithms for Approximate Bayesian Inference PhD Thesis University of Cambridge 1998

bull L Fei-Fei and P Perona A Bayesian Hierarchical Model for Learning Natural Scene Categories CVPR 2005

Outline

bull Matrix Decompositionndash PCA SVD NMFndash LDA ICA Sparse Coding etc

bull Graphical Modelndash Basic concepts in probabilistic machine learningndash EMndash pLSAndash LDA

bull Two Applicationsndash Document decomposition for ldquolong queryrdquo retrieval ICCV 2009ndash Modeling Threaded Discussions SIGIR 2009

Large-Scale Indexing for ldquoLong Queryrdquo Retrieval (Similarity Search)

Xiao Zhang Zhiwei Li Lei Zhang Wei-Ying Ma and Heung-Yeung Shum

ICCV 2009

The Long Query Problem

bull If a query contains 1000 keywordsndash Need to access 1000 inverted listsndash The intersection of 1000 inverted lists may be emptyndash The union of 1000 inverted list may be the whole corpus

bull Dimension reductionTerm1 Term2 Term3 Term4 hellip TermN

Img1 1 2 0 0 hellip 2

f1 f2 hellip fM

Img1 02 01 hellip 003

Topic Projection

Dim = 1 million

Dim = 200

Key Idea Dimension Reduction + Residual Error Preservation

bull p original TF-IDF vector in vocabulary spacebull X projection matrix for dimension reductionbull w low dimensional feature vectorbull residual error

p Xw

An image = 0 01 02 03 04 05 06 07 08 09 1

0

2

4

6

8

10

12

14

16

+ a few words(10 words)

Orthogonal Decomposition

p Xw 1 11 1 1

2 21 2 1 2

1 1

1

k

k

W k W

W W Wk W

p x xp x x w

p wp x x

Base vector

Low dimensional representation

Residual

An image = 0 01 02 03 04 05 06 07 08 09 1

0

2

4

6

8

10

12

14

16

+ a few words(10 words)

T Tp q p q

p q

w w

X1 X2 X3 hellip Xk

A Probabilistic Implementation

x is a switch variable It controls a word generated from

bull a topic specific distribution

bull a document specific distribution

bull a background distribution

( | )p w d 1( 0 | ) ( | ) ( | )K

kp x d p w z k p z k d

( 1| ) ( | )p x d p w d ( 2 | ) ( | )p x d p w

C Chemudugunta etal Modeling General and Specific Aspects of Documents with a Probabilistic Topic Model NIPS 2006

Search (Online)

DS1 DS2 hellip

DS1 DS2 hellip

DS1 DS2 hellip

DS1 DS2 hellip

DS1 DS2 hellip

hellip

hellip

hellip

hellip

hellip

LSH Index

Doc 300 Doc 401 hellip

A query = 0 01 02 03 04 05 06 07 08 09 1

0

2

4

6

8

10

12

14

16

+ a few words

Re-rankingDoc 401 hellip

Doc 1

Doc 2

Doc 300

Doc 401

Doc N

Doc 300

Index 10M Images 46GBSearch Speed lt 100ms

Doc Meta

Search Example

Query Image

Search Example

Query Image

Simultaneously Modeling Semantics and Structure of Threaded Discussions A Sparse

Coding Approach and Its Applications

Chen LIN Jiang-Ming YANG Rui CAI Xin-jing WANG Wei WANG Lei ZHANG

SIGIR 2009

123

Semantic amp structure

124

SemanticTopics

StructureWho reply to who

Optimize them together

Model semantic

Model structure

125

Reply reconstruction

126

DocumentSimilarity

TopicSimilarity

StructureSimilarity

BaselinesNP

Reply to Nearest PostRR

Reply to RootDS

Document SimilarityLDA

Latent Dirichlet Allocation Project documents to topic space

SWB Special Words Topic Model with Background distribution Project documents to topic and junk topic space

127

Evaluation

method Slashdot Apple

All Posts Good Posts All Posts Good Posts

NP 0021 0012 0289 0239

RR 0183 0319 0269 0474

DS 0463 0643 0409 0628

LDA 0465 0644 0410 0648

SWB 0463 0644 0410 0641

SMSS 0524 0737 0517 0772

128

Expert finding

Reply reconstruction

Network construction

Expert finding

Methods

HITS

PageRank

hellip

129

Baselines

LMFormal Models for Expert Finding in Enterprise Corpora SIGIR

06Achieves stable performance in expert finding task using a

language modelPageRank

Benchmark nodal ranking methodHITS

Find hub nodes and authority nodeEABIF

Personalized Recommendation Driven by Information Flow SIGIR rsquo06

Find most influential node130

Evaluation

131

bull Bayesian estimate

Method MRR MAP P10

LM 0821 0698 0800

EABIF(ori) 0674 0362 0243

EABIF(rec) 0742 0318 0281

PageRank(ori) 0675 0377 0263

PageRank(rec) 0743 0321 0266

HITS(ori) 0906 0832 0900

HITS(rec) 0938 0822 0906

Summary

bull Matrix and probability are fundamental mathematics in information retrieval and computer visionndash Matrix decomposition ndash a good practice to learn matrixndash Graphical model ndash a good practice to learn probability

bull Graphical model is a good tool to analyze problems

bull The essence of decomposition is to discover a set of mid-level features to describe original documentsimages

bull It is more adaptable for various applications than matrix decomposition

  • An Introduction To Matrix Decomposition and Graphical Model
  • Outline
  • What Is Matrix Decomposition
  • Why We Need Matrix Decomposition
  • Why We Need Matrix Decomposition (2)
  • Principle Component Analysis
  • Definition ndash Eigenvalue amp Eigenvector
  • Definition ndash Principle Component Analysis
  • Principle Component Analysis (2)
  • Maximizing Variance
  • Optimization Problem
  • Property Data Decomposition
  • Face Recognition ndash Eigenface
  • Slide 14
  • Slide 15
  • PageRank ndash Power Iteration
  • Column-Stochastic amp Irreducible
  • Iterative PageRank Calculation
  • Convergence of the power iteration
  • Singular value decomposition
  • SVD - Definition
  • Singular Values And Singular Vectors
  • Matrix approximation
  • SVD and PCA
  • Example - LSI
  • SVD and PCA (2)
  • Latent Semantic Indexing (LSI)
  • Latent Semantic Indexing
  • Similarity Measures
  • Similarity Measures (2)
  • HITS (Hyperlink Induced Topic Search)
  • Power Iteration
  • HITS and SVD
  • HITS vs PageRank
  • NMF ndash Non-Negative Matrix Factorization
  • Definition
  • Motivation
  • Multiplicative Update Algorithm
  • Multiplicative Update Algorithm (2)
  • NMF vs PCA
  • Reference
  • Major Reference
  • Outline (2)
  • Not Included
  • Basic Concepts
  • What Is Machine Learning
  • Likelihood Function
  • Maximum Likelihood (ML)
  • IID ndash Independent Identically Distributed
  • Reference (2)
  • Expectation Maximization
  • Why We Need EM
  • More General
  • The Expectation Maximization (EM) Algorithm
  • Jensenrsquos Inequality
  • Lower Bounding the Log Likelihood
  • The E and M Steps of EM
  • The E Step
  • The M Step
  • EM Never Decreases the Likelihood
  • Reference (3)
  • Why Do We Need Graphical Model
  • Why Do We Need Graphical Models
  • Directed Acyclic Graphical Models (Bayesian Networks)
  • Directed Graphs for Statistical Models Plate Notation
  • pLSA ndash Probabilistic Latent Semantic Analysis
  • Latent Semantic Indexing (LSI) Review
  • pLSA ndash Probabilistic Latent Semantic Analysis (2)
  • pLSA
  • pLSA (2)
  • Joint Probability vs Likelihood
  • Document Decomposition
  • pLSA ndash Objective Function
  • EM Steps
  • Lower Bounding the Log Likelihood (2)
  • EM Steps (2)
  • Latent Subspace
  • pLSA vs LSA
  • pLSA vs LSA (2)
  • Applications
  • Text Mining
  • Scene Classification
  • Classification Result
  • Reference (4)
  • LDA ndash Latent Dirichilet Allocation
  • Problems in pLSA
  • Problems in pLSA (2)
  • Dirichlet Distribution
  • Dirichlet Distribution (2)
  • Example Dirichlet Distributions (K=3)
  • Example Dirichlet Distributions (K=3) (2)
  • The LDA Model
  • The LDA Model (2)
  • Joint Probability
  • Likelihood
  • Inference
  • Inference (2)
  • Variational Inference
  • Variantional Inference
  • VBEM vs EM
  • Parameter Estimation
  • Parameter Estimation (2)
  • Topic Examples in a 100-topic LDA Model)
  • Classification (50-topic LDA + SVM)
  • Problems in LDA
  • A Bayesian Hierarchical Model for Learning Natural Scene Catego
  • Codebook
  • Topic Distribution in Different Categories
  • Topic Hierarchical Clustering
  • More Topic Models
  • Are you really into Graphical Models
  • Reference (5)
  • Outline (3)
  • Slide 115
  • The Long Query Problem
  • Key Idea Dimension Reduction + Residual Error Preservation
  • Orthogonal Decomposition
  • A Probabilistic Implementation
  • Search (Online)
  • Search Example
  • Search Example (2)
  • Simultaneously Modeling Semantics and Structure of Threaded Dis
  • Semantic amp structure
  • Optimize them together
  • Reply reconstruction
  • Baselines
  • Evaluation
  • Expert finding
  • Baselines (2)
  • Evaluation (2)
  • Summary
Page 40: An Introduction To Matrix Decomposition and Graphical Model

Major Referencebull Saara Hyvoumlnen Linear Algebra Methods for Data Mining Spring 2007

University of Helsinki (Highly recommend)

Outlinebull Basic concepts

ndash Likelihood iidndash ML MAP and Bayesian Inferencendash Expectation-Maximizationndash Mixture Gaussian Parameter estimation

bull pLSAndash Motivationndash Derivation amp Geometry propertiesndash Applications

bull LDAndash Motivation - Why to add a hyper parameterndash Dirichlet Distributionndash Variational EMndash Relations with other topic modalsndash Incorporating category information

bull Summary

Not Included

bull General graphical model theoriesbull Markov random field (belief propagation)bull Detailed derivation of LDA

BASIC CONCEPTS

What Is Machine Learning

Databull Let x = (x1 x2 xD) T denote a data point and D = x(1) x(2) x(N) a

data set D is sometimes associated with desired outputs y1 y2

Predictionsbull We are generally interested in predicting something based on the observed

data setbull Given D what can we say about x(N+1)

Modelbull To make predictions we need to make some assumptions We can often

express these assumptions in the form of a model with some parameters θbull Given data D we learn the model parameters from which we can predict

new data pointsbull The model can often be expressed as a probability distribution over data

points

Likelihood Function

bull Given a set of parameter values probability density function (PDF) will show that some data are more probable than other data

bull Inversely given the observed data and a model of interest Likelihood function is defined as

L(θ) = fθ(x|θ) = p(x|θ)

bull That is likelihood function L(θ) will show that some parameters are more likely to have produced the data

Maximum Likelihood (ML)

bull Maximum likelihood will find the best model parameters that make the data ldquomost likelyrdquo generated from this model

bull Suppose we are given n data samples (x1 x2 hellip xn)

bull Maximum likelihood will find θ that maximize L(θ)

bull Predictive distribution

IID ndash Independent Identically Distributed

bull IID means

bull The problem is considerably simplified as

bull Usually log likehood is used

Reference

bull Zoubin Ghahramani Machine Learning (4F13) 2006 Cambridge(Introduction to Machine Learning Lectures 1-2 Slides)

bull Gregor Heinrich Parameter estimation for text analysis Technical Note 2005-2008

EXPECTATION MAXIMIZATION

Why We Need EM

bull The ExpectationndashMaximization (EM) algorithm is a method for ML learning of parameters in latent variable models

bull Why we need latent variables

bull To describe complex model Gaussian Mixture Model

bull To discover the intrinsic structure inside a data set Topic Models such as pLSA LDA

More General

bull Data set bull Likelihood

bull Goal learn maximum likelihood (ML) parameter values

bull The maximum likelihood procedure finds parameters θ such that

bull Because of the integral (or sum) over latent variables the likelihood can be a very complicated and hard to optimize function

The Expectation Maximization (EM) Algorithm

bull The EM algorithm finds a (local) maximum of a latent variable model likelihood It starts from arbitrary values of the parameters and iterates two steps

bull E step Fill in values of latent variables according to posterior given data

bull M step Maximize likelihood as if latent variables were not hidden

bull Decomposes difficult problems into series of tractable steps

Jensenrsquos Inequality

Lower Bounding the Log Likelihoodbull Observed data D = yn Latent variables X = xn Parameters θbull Goal Maximize the log likelihood (ie ML learning) wrt θ

bull Any distribution q(X) over the hidden variables can be used to obtain a lower bound on the log likelihood using Jensenrsquos inequality

bull where H[q] is the entropy of q(X)

The E and M Steps of EM

bull The lower bound on the log likelihood is given by

bull EM alternates betweenbull E step optimize wrt distribution over hidden variables

holding params fixed

bull M step maximize wrt parameters holding hidden distribution fixed

The E Step

bull E step for fixed θ

bull The second term is the Kullback-Leibler divergencebull This means that for fixed θ F is bounded above by L and achieves

that bound whenbull So the E step simply sets

The M Step

bull M step maximize wrt parameters holding hidden distribution q fixed

bull The second equality comes from fact that entropy of q(X) does not depend directly on θ

bull The specific form of the M step depends on the model Often the maximum wrt θ can be found analytically

EM Never Decreases the Likelihood

bull The E and M steps together never decrease the log likelihood

bull The E step brings F(q θ) to the likelihood L(θ) 顶bull The M-step maximizes F(q θ) wrt θ 抬bull F(q θ) lt L(θ) by Jensen ndash or equivalently from the non-

negativity of KL

Reference

bull Zoubin Ghahramani Machine Learning (4F13) 2006 Cambridge(Unsupervised learning Lecture 5 Slides)

bull Christopher M Bishop (2006) Pattern Recognition and Machine Learning Springer

WHY DO WE NEED GRAPHICAL MODEL

Why Do We Need Graphical Models

bull Consndash Graphical model is so complex even with a few circleshellipndash We have to make too many assumptions

bull Prosndash We do need probability to explain our world But joint probability is

hard to computendash Graphical model can help us analyze and understand our problemsndash Graphs are an intuitive way of representing and visualizing the

relationships between many variablesndash With a graphical model we can decouple joint probability to

conditional probabilities which are usually easier

Directed Acyclic Graphical Models (Bayesian Networks)

bull A DAG Model Bayesian network corresponds to a factorization of the joint probability distribution

p(ABCDE) = p(A)p(B)p(C|AB)p(D|BC)p(E|CD)

bull In general

bull where pa(i) are the parents of node i

Directed Graphs for Statistical ModelsPlate Notation

bull A data set of N points generated from a Gaussian

PLSA ndash PROBABILISTIC LATENT SEMANTIC ANALYSIS

Latent Semantic Indexing (LSI) Review

bull For natural Language Queries simple term matching does not work effectivelyndash Ambiguous terms ndash Same Queries vary due to personal styles

bull Latent semantic indexingndash Creates this lsquolatent semantic spacersquo (hidden meaning)

bull LSI puts documents together even if they donrsquot have common words if the docs share frequently co-occurring terms

bull Disadvantagesndash Statistical foundation is missing

pLSA ndash Probabilistic Latent Semantic Analysis

bull Automated Document Indexing and Information retrievalbull Identification of Latent Classes using an Expectation

Maximization (EM) Algorithmbull Shown to solve

ndash Polysemybull Java could mean ldquocoffeerdquo and also the ldquoPL Javardquobull Cricket is a ldquogamerdquo and also an ldquoinsectrdquo

ndash Synonymybull ldquocomputerrdquo ldquopcrdquo ldquodesktoprdquo all could mean the same

bull Has a better statistical foundation than LSA

pLSA

M

Nd

d

z

w

M

d

z1

w1

z2

w2

z3

w3

zN

wN

hellip

z1 hellip zN are variables ziє[1K]K is the number of latent topics

pLSA

d1

z1

w1

z2

w2

zN1

wN1

hellip

d2

z1

w1

z2

w2

zN2

wN2

hellip

dM

z1

w1

z2

w2

zNm

wNm

hellip

p(w|z=1) p(w|z=2) p(w | z=Nm) are shared for all documents

Likelihood

Joint Probability vs Likelihood

bull Joint probability

bull Likelihood (only for observed variables)

bull p(d) is assumed to be uniform

Document Decomposition

bull Each document can be decomposed as

bull This is similar to the matrix decomposition if we consider each discrete distribution as a vector

p(w|d) = ZVtimesk p(z|d)

bull With many documents we hope to find latent topics as common basis

pLSA ndash Objective Function

bull pLSA tries to maximize the log likelihood

bull Due to the summation over z inside log we have to resort to EM

EM Steps

bull E-Stepndash Expectation of the likelihood function is calculated with the current

parameter valuesbull M-Step

ndash Update the parameters with the calculated posterior probabilitiesndash Find the parameters that maximizes the likelihood function

Lower Bounding the Log Likelihood

EM Steps

bull The E-Step

bull The M-Step

Latent Subspace

pLSA vs LSAbull LSA and PLSA perform dimensionality reduction

ndash In LSA by keeping only K singular valuesndash In PLSA by having K aspects

bull Comparison to SVDndash U Matrix related to P(z|d) (doc to aspect)ndash V Matrix related to P(w|z) (aspect to term)ndash E Matrix related to P(z) (aspect strength)

pLSA vs LSAbull The main difference is the way the approximation is done

bull PLSA generates a model (aspect model) and maximizes its predictive power

bull Selecting the proper value of K is heuristic in LSA

bull Model selection in statistics can determine optimal K in PLSA

Applications

bull Text mining topic discovering

bull Scene Classification

Text Mining

Scene Classification

Classification Result

Reference

bull Thomas Hofmann Probabilistic latent semantic analysis In Proc of Uncertainty in Artificial Intelligence UAI99 Stockholm 1999

bull Bosch A Zisserman A and Munoz X Scene Classification via pLSAProceedings of the European Conference on Computer Vision (2006)

bull Sivic J Russell B C Efros A A Zisserman A and Freeman W T Discovering Object Categories in Image Collections MIT AI Lab Memo AIM-2005-005 February 2005

LDA ndash LATENT DIRICHILET ALLOCATION

Problems in pLSA

bull pLSA provides no probabilistic model at the document level Each doc has its own topic mixture proportion

bull The number of parameters in the model grows linearly with M (the number of documents in the training set)

Problems in pLSA

bull There is no constraint for distributions p(z|di)

bull Easy to lead to serious problems with over-fitting

d1

z1

w1

z2

w2

zN1

wN1

hellip

d2

z1

w1

z2

w2

zN2

wN2

hellip

dm

z1

w1

z2

w2

zNm

wNm

hellip

p(z|d1) p(z|d2) p(z|dm)

Dirichlet Distribution

bull In the LDA model the topic mixture proportions for each document are assumed to follow some distribution

bull Requirement for such a distributionndash The samples (mixture proportions) generated from it are K-tuples of non-

negative numbers that sum to one That is the samples are multinormialsndash Easy to optimize

bull Dirichlet Distribution is one of such distributions

bull The space of all of these multinomials has a nice geometric interpretation as a (k-1)-simplex

Dirichlet Distribution

bull Definition

bull The density is zero outside this open (K minus 1)-dimensional simplex

k

i ii

ki ik

i i

ki i

kk

xx

xxxxp i

1

11

1

12121

1 0 st

)Γ()(Γ)|(

bull Various parameter α

(6 2 2) (3 7 5)

(2 3 4) (6 2 6)

Example Dirichlet Distributions (K=3)

Example Dirichlet Distributions (K=3)

bull Equal αi different

α0=01 α0=1 α0=10

k

i i10

The LDA Model

z4z3z2z1

w4w3w2w1

b

z4z3z2z1

w4w3w2w1

z4z3z2z1

w4w3w2w1

bull For each documentbull Choose ~Dirichlet()bull For each of the N words wn

ndash Choose a topic znraquo Multinomial()

ndash Choose a word wn from p(wn|znb) a multinomial probability conditioned on the topic zn

The LDA Model

bull For each documentbull Choose ~Dirichlet()bull For each of the N words wn

ndash Choose a topic znraquo Multinomial()

ndash Choose a word wn from p(wn|znb) a multinomial probability conditioned on the topic zn

Joint Probability

bull Given parameter α and β

where

Likelihood

bull Joint Probability

bull Marginal distribution of a document

bull Likelihood over all the documents

Inference

bull The likelihood can be computed by summing each documentbull Jansenrsquos inequality in EM

Inference

bull In E-Step we need to compute the posterior distribution of the hidden variables

bull Unfortunately this distribution is intractable to compute in general

bull We have to resort to variational approach

Variational Inference

bull In variational inference we consider a simplified graphical model with variational parameters and minimize the KL Divergence between the variational and posterior distributions

Variantional Inference

bull The difference between the lower bound and the likelihood is the KL divergence

bull Maximizing the lower bound L(b) with respect to and is equivalent to minimizing the KL divergence

VBEM vs EM

bull Only different in the E-Step

bull In standard EM q(X) is directly set to p(X|Dθ) and let KL=0bull In VBEM it is intractable to compute p(X|Dθ) Instead it

approximates p(X|Dθ) by a variational distribution q(X) by minimizing KL(q(X) | P(X|D θ)

bull This is also equivalent to maximizing the lower bound L(θ)

Parameter Estimation

bull Given a corpus of documents we would like to find the parameters and b which maximize the likelihood of the observed data

bull Strategy (Variational EM)

bull Lower bound log p(w|b) by a function L(b)bull Repeat until convergence

ndash E Maximize L(b) with respect to the variational parameters ndash M Maximize the bound with respect to parameters and b

Parameter Estimation

bull E-Step Variational Inference ndash repeat until convergence

bull M-Step Parameter estimation

β

α can be implemented using the Newton-Raphson method

Topic Examples in a 100-topic LDA Model)bull 16000 documents from a subset of the TREC AP corpus

Classification (50-topic LDA + SVM)

bull Reuters-21578 dataset ndash contains 8000 documents and 15818 words

(a) EARN vs NOT EARN (b) GRAIN vs NOT GRAIN

Problems in LDA

bull Dirichlet Distribution is helpful to avoid over-fitting But the assumption might be too strong

z4z3z2z1

w4w3w2w1

b

z4z3z2z1

w4w3w2w1

z4z3z2z1

w4w3w2w1

A Bayesian Hierarchical Model for Learning Natural Scene Categories

bull Incorporating category information

MNd

π

z

x

θ

β

Codebookbull 174 Local Image Patches

bull DetectionEvenly Sampled GridRandom SamplingSaliency DetectorLowersquos DoG Detector

bull RepresentationNormalized 11x11 gray values128-dim SIFT

Topic Distribution in Different Categories

Topic Hierarchical Clustering

More Topic Models

bull Dynamic topic models ICML 2006bull Correlated Topic Model NIPS 2005bull Hierarchical Dirichlet Process Journal of the American

Statistical Association 2003bull Nonparametric Bayes pachinko allocation UAI 2007bull Supervised LDA NIPS 2007bull MedLDA ndash Maximum Margin Discrimant LDA ICML 2009bull hellip

Are you really into Graphical Models

bull Describing Visual Scenes using Transformed Dirichlet Processes E Sudderth A Torralba W Freeman and A Willsky NIPS Dec 2005

Reference

bull David M Blei Andrew Y Ng Michael I Jordan Latent Dirichlet Allocation Journal of Machine Learning Research (JMLR) 2003

bull Matthew J Beal Variational Algorithms for Approximate Bayesian Inference PhD Thesis University of Cambridge 1998

bull L Fei-Fei and P Perona A Bayesian Hierarchical Model for Learning Natural Scene Categories CVPR 2005

Outline

bull Matrix Decompositionndash PCA SVD NMFndash LDA ICA Sparse Coding etc

bull Graphical Modelndash Basic concepts in probabilistic machine learningndash EMndash pLSAndash LDA

bull Two Applicationsndash Document decomposition for ldquolong queryrdquo retrieval ICCV 2009ndash Modeling Threaded Discussions SIGIR 2009

Large-Scale Indexing for ldquoLong Queryrdquo Retrieval (Similarity Search)

Xiao Zhang Zhiwei Li Lei Zhang Wei-Ying Ma and Heung-Yeung Shum

ICCV 2009

The Long Query Problem

bull If a query contains 1000 keywordsndash Need to access 1000 inverted listsndash The intersection of 1000 inverted lists may be emptyndash The union of 1000 inverted list may be the whole corpus

bull Dimension reductionTerm1 Term2 Term3 Term4 hellip TermN

Img1 1 2 0 0 hellip 2

f1 f2 hellip fM

Img1 02 01 hellip 003

Topic Projection

Dim = 1 million

Dim = 200

Key Idea Dimension Reduction + Residual Error Preservation

bull p original TF-IDF vector in vocabulary spacebull X projection matrix for dimension reductionbull w low dimensional feature vectorbull residual error

p Xw

An image = 0 01 02 03 04 05 06 07 08 09 1

0

2

4

6

8

10

12

14

16

+ a few words(10 words)

Orthogonal Decomposition

p Xw 1 11 1 1

2 21 2 1 2

1 1

1

k

k

W k W

W W Wk W

p x xp x x w

p wp x x

Base vector

Low dimensional representation

Residual

An image = 0 01 02 03 04 05 06 07 08 09 1

0

2

4

6

8

10

12

14

16

+ a few words(10 words)

T Tp q p q

p q

w w

X1 X2 X3 hellip Xk

A Probabilistic Implementation

x is a switch variable It controls a word generated from

bull a topic specific distribution

bull a document specific distribution

bull a background distribution

( | )p w d 1( 0 | ) ( | ) ( | )K

kp x d p w z k p z k d

( 1| ) ( | )p x d p w d ( 2 | ) ( | )p x d p w

C Chemudugunta etal Modeling General and Specific Aspects of Documents with a Probabilistic Topic Model NIPS 2006

Search (Online)

DS1 DS2 hellip

DS1 DS2 hellip

DS1 DS2 hellip

DS1 DS2 hellip

DS1 DS2 hellip

hellip

hellip

hellip

hellip

hellip

LSH Index

Doc 300 Doc 401 hellip

A query = 0 01 02 03 04 05 06 07 08 09 1

0

2

4

6

8

10

12

14

16

+ a few words

Re-rankingDoc 401 hellip

Doc 1

Doc 2

Doc 300

Doc 401

Doc N

Doc 300

Index 10M Images 46GBSearch Speed lt 100ms

Doc Meta

Search Example

Query Image

Search Example

Query Image

Simultaneously Modeling Semantics and Structure of Threaded Discussions A Sparse

Coding Approach and Its Applications

Chen LIN Jiang-Ming YANG Rui CAI Xin-jing WANG Wei WANG Lei ZHANG

SIGIR 2009

123

Semantic amp structure

124

SemanticTopics

StructureWho reply to who

Optimize them together

Model semantic

Model structure

125

Reply reconstruction

126

DocumentSimilarity

TopicSimilarity

StructureSimilarity

BaselinesNP

Reply to Nearest PostRR

Reply to RootDS

Document SimilarityLDA

Latent Dirichlet Allocation Project documents to topic space

SWB Special Words Topic Model with Background distribution Project documents to topic and junk topic space

127

Evaluation

method Slashdot Apple

All Posts Good Posts All Posts Good Posts

NP 0021 0012 0289 0239

RR 0183 0319 0269 0474

DS 0463 0643 0409 0628

LDA 0465 0644 0410 0648

SWB 0463 0644 0410 0641

SMSS 0524 0737 0517 0772

128

Expert finding

Reply reconstruction

Network construction

Expert finding

Methods

HITS

PageRank

hellip

129

Baselines

LMFormal Models for Expert Finding in Enterprise Corpora SIGIR

06Achieves stable performance in expert finding task using a

language modelPageRank

Benchmark nodal ranking methodHITS

Find hub nodes and authority nodeEABIF

Personalized Recommendation Driven by Information Flow SIGIR rsquo06

Find most influential node130

Evaluation

131

bull Bayesian estimate

Method MRR MAP P10

LM 0821 0698 0800

EABIF(ori) 0674 0362 0243

EABIF(rec) 0742 0318 0281

PageRank(ori) 0675 0377 0263

PageRank(rec) 0743 0321 0266

HITS(ori) 0906 0832 0900

HITS(rec) 0938 0822 0906

Summary

bull Matrix and probability are fundamental mathematics in information retrieval and computer visionndash Matrix decomposition ndash a good practice to learn matrixndash Graphical model ndash a good practice to learn probability

bull Graphical model is a good tool to analyze problems

bull The essence of decomposition is to discover a set of mid-level features to describe original documentsimages

bull It is more adaptable for various applications than matrix decomposition

  • An Introduction To Matrix Decomposition and Graphical Model
  • Outline
  • What Is Matrix Decomposition
  • Why We Need Matrix Decomposition
  • Why We Need Matrix Decomposition (2)
  • Principle Component Analysis
  • Definition ndash Eigenvalue amp Eigenvector
  • Definition ndash Principle Component Analysis
  • Principle Component Analysis (2)
  • Maximizing Variance
  • Optimization Problem
  • Property Data Decomposition
  • Face Recognition ndash Eigenface
  • Slide 14
  • Slide 15
  • PageRank ndash Power Iteration
  • Column-Stochastic amp Irreducible
  • Iterative PageRank Calculation
  • Convergence of the power iteration
  • Singular value decomposition
  • SVD - Definition
  • Singular Values And Singular Vectors
  • Matrix approximation
  • SVD and PCA
  • Example - LSI
  • SVD and PCA (2)
  • Latent Semantic Indexing (LSI)
  • Latent Semantic Indexing
  • Similarity Measures
  • Similarity Measures (2)
  • HITS (Hyperlink Induced Topic Search)
  • Power Iteration
  • HITS and SVD
  • HITS vs PageRank
  • NMF ndash Non-Negative Matrix Factorization
  • Definition
  • Motivation
  • Multiplicative Update Algorithm
  • Multiplicative Update Algorithm (2)
  • NMF vs PCA
  • Reference
  • Major Reference
  • Outline (2)
  • Not Included
  • Basic Concepts
  • What Is Machine Learning
  • Likelihood Function
  • Maximum Likelihood (ML)
  • IID ndash Independent Identically Distributed
  • Reference (2)
  • Expectation Maximization
  • Why We Need EM
  • More General
  • The Expectation Maximization (EM) Algorithm
  • Jensenrsquos Inequality
  • Lower Bounding the Log Likelihood
  • The E and M Steps of EM
  • The E Step
  • The M Step
  • EM Never Decreases the Likelihood
  • Reference (3)
  • Why Do We Need Graphical Model
  • Why Do We Need Graphical Models
  • Directed Acyclic Graphical Models (Bayesian Networks)
  • Directed Graphs for Statistical Models Plate Notation
  • pLSA ndash Probabilistic Latent Semantic Analysis
  • Latent Semantic Indexing (LSI) Review
  • pLSA ndash Probabilistic Latent Semantic Analysis (2)
  • pLSA
  • pLSA (2)
  • Joint Probability vs Likelihood
  • Document Decomposition
  • pLSA ndash Objective Function
  • EM Steps
  • Lower Bounding the Log Likelihood (2)
  • EM Steps (2)
  • Latent Subspace
  • pLSA vs LSA
  • pLSA vs LSA (2)
  • Applications
  • Text Mining
  • Scene Classification
  • Classification Result
  • Reference (4)
  • LDA ndash Latent Dirichilet Allocation
  • Problems in pLSA
  • Problems in pLSA (2)
  • Dirichlet Distribution
  • Dirichlet Distribution (2)
  • Example Dirichlet Distributions (K=3)
  • Example Dirichlet Distributions (K=3) (2)
  • The LDA Model
  • The LDA Model (2)
  • Joint Probability
  • Likelihood
  • Inference
  • Inference (2)
  • Variational Inference
  • Variantional Inference
  • VBEM vs EM
  • Parameter Estimation
  • Parameter Estimation (2)
  • Topic Examples in a 100-topic LDA Model)
  • Classification (50-topic LDA + SVM)
  • Problems in LDA
  • A Bayesian Hierarchical Model for Learning Natural Scene Catego
  • Codebook
  • Topic Distribution in Different Categories
  • Topic Hierarchical Clustering
  • More Topic Models
  • Are you really into Graphical Models
  • Reference (5)
  • Outline (3)
  • Slide 115
  • The Long Query Problem
  • Key Idea Dimension Reduction + Residual Error Preservation
  • Orthogonal Decomposition
  • A Probabilistic Implementation
  • Search (Online)
  • Search Example
  • Search Example (2)
  • Simultaneously Modeling Semantics and Structure of Threaded Dis
  • Semantic amp structure
  • Optimize them together
  • Reply reconstruction
  • Baselines
  • Evaluation
  • Expert finding
  • Baselines (2)
  • Evaluation (2)
  • Summary
Page 41: An Introduction To Matrix Decomposition and Graphical Model

Outlinebull Basic concepts

ndash Likelihood iidndash ML MAP and Bayesian Inferencendash Expectation-Maximizationndash Mixture Gaussian Parameter estimation

bull pLSAndash Motivationndash Derivation amp Geometry propertiesndash Applications

bull LDAndash Motivation - Why to add a hyper parameterndash Dirichlet Distributionndash Variational EMndash Relations with other topic modalsndash Incorporating category information

bull Summary

Not Included

bull General graphical model theoriesbull Markov random field (belief propagation)bull Detailed derivation of LDA

BASIC CONCEPTS

What Is Machine Learning

Databull Let x = (x1 x2 xD) T denote a data point and D = x(1) x(2) x(N) a

data set D is sometimes associated with desired outputs y1 y2

Predictionsbull We are generally interested in predicting something based on the observed

data setbull Given D what can we say about x(N+1)

Modelbull To make predictions we need to make some assumptions We can often

express these assumptions in the form of a model with some parameters θbull Given data D we learn the model parameters from which we can predict

new data pointsbull The model can often be expressed as a probability distribution over data

points

Likelihood Function

bull Given a set of parameter values probability density function (PDF) will show that some data are more probable than other data

bull Inversely given the observed data and a model of interest Likelihood function is defined as

L(θ) = fθ(x|θ) = p(x|θ)

bull That is likelihood function L(θ) will show that some parameters are more likely to have produced the data

Maximum Likelihood (ML)

bull Maximum likelihood will find the best model parameters that make the data ldquomost likelyrdquo generated from this model

bull Suppose we are given n data samples (x1 x2 hellip xn)

bull Maximum likelihood will find θ that maximize L(θ)

bull Predictive distribution

IID ndash Independent Identically Distributed

bull IID means

bull The problem is considerably simplified as

bull Usually log likehood is used

Reference

bull Zoubin Ghahramani Machine Learning (4F13) 2006 Cambridge(Introduction to Machine Learning Lectures 1-2 Slides)

bull Gregor Heinrich Parameter estimation for text analysis Technical Note 2005-2008

EXPECTATION MAXIMIZATION

Why We Need EM

bull The ExpectationndashMaximization (EM) algorithm is a method for ML learning of parameters in latent variable models

bull Why we need latent variables

bull To describe complex model Gaussian Mixture Model

bull To discover the intrinsic structure inside a data set Topic Models such as pLSA LDA

More General

bull Data set bull Likelihood

bull Goal learn maximum likelihood (ML) parameter values

bull The maximum likelihood procedure finds parameters θ such that

bull Because of the integral (or sum) over latent variables the likelihood can be a very complicated and hard to optimize function

The Expectation Maximization (EM) Algorithm

bull The EM algorithm finds a (local) maximum of a latent variable model likelihood It starts from arbitrary values of the parameters and iterates two steps

bull E step Fill in values of latent variables according to posterior given data

bull M step Maximize likelihood as if latent variables were not hidden

bull Decomposes difficult problems into series of tractable steps

Jensenrsquos Inequality

Lower Bounding the Log Likelihoodbull Observed data D = yn Latent variables X = xn Parameters θbull Goal Maximize the log likelihood (ie ML learning) wrt θ

bull Any distribution q(X) over the hidden variables can be used to obtain a lower bound on the log likelihood using Jensenrsquos inequality

bull where H[q] is the entropy of q(X)

The E and M Steps of EM

bull The lower bound on the log likelihood is given by

bull EM alternates betweenbull E step optimize wrt distribution over hidden variables

holding params fixed

bull M step maximize wrt parameters holding hidden distribution fixed

The E Step

bull E step for fixed θ

bull The second term is the Kullback-Leibler divergencebull This means that for fixed θ F is bounded above by L and achieves

that bound whenbull So the E step simply sets

The M Step

bull M step maximize wrt parameters holding hidden distribution q fixed

bull The second equality comes from fact that entropy of q(X) does not depend directly on θ

bull The specific form of the M step depends on the model Often the maximum wrt θ can be found analytically

EM Never Decreases the Likelihood

bull The E and M steps together never decrease the log likelihood

bull The E step brings F(q θ) to the likelihood L(θ) 顶bull The M-step maximizes F(q θ) wrt θ 抬bull F(q θ) lt L(θ) by Jensen ndash or equivalently from the non-

negativity of KL

Reference

bull Zoubin Ghahramani Machine Learning (4F13) 2006 Cambridge(Unsupervised learning Lecture 5 Slides)

bull Christopher M Bishop (2006) Pattern Recognition and Machine Learning Springer

WHY DO WE NEED GRAPHICAL MODEL

Why Do We Need Graphical Models

bull Consndash Graphical model is so complex even with a few circleshellipndash We have to make too many assumptions

bull Prosndash We do need probability to explain our world But joint probability is

hard to computendash Graphical model can help us analyze and understand our problemsndash Graphs are an intuitive way of representing and visualizing the

relationships between many variablesndash With a graphical model we can decouple joint probability to

conditional probabilities which are usually easier

Directed Acyclic Graphical Models (Bayesian Networks)

bull A DAG Model Bayesian network corresponds to a factorization of the joint probability distribution

p(ABCDE) = p(A)p(B)p(C|AB)p(D|BC)p(E|CD)

bull In general

bull where pa(i) are the parents of node i

Directed Graphs for Statistical ModelsPlate Notation

bull A data set of N points generated from a Gaussian

PLSA ndash PROBABILISTIC LATENT SEMANTIC ANALYSIS

Latent Semantic Indexing (LSI) Review

bull For natural Language Queries simple term matching does not work effectivelyndash Ambiguous terms ndash Same Queries vary due to personal styles

bull Latent semantic indexingndash Creates this lsquolatent semantic spacersquo (hidden meaning)

bull LSI puts documents together even if they donrsquot have common words if the docs share frequently co-occurring terms

bull Disadvantagesndash Statistical foundation is missing

pLSA ndash Probabilistic Latent Semantic Analysis

bull Automated Document Indexing and Information retrievalbull Identification of Latent Classes using an Expectation

Maximization (EM) Algorithmbull Shown to solve

ndash Polysemybull Java could mean ldquocoffeerdquo and also the ldquoPL Javardquobull Cricket is a ldquogamerdquo and also an ldquoinsectrdquo

ndash Synonymybull ldquocomputerrdquo ldquopcrdquo ldquodesktoprdquo all could mean the same

bull Has a better statistical foundation than LSA

pLSA

M

Nd

d

z

w

M

d

z1

w1

z2

w2

z3

w3

zN

wN

hellip

z1 hellip zN are variables ziє[1K]K is the number of latent topics

pLSA

d1

z1

w1

z2

w2

zN1

wN1

hellip

d2

z1

w1

z2

w2

zN2

wN2

hellip

dM

z1

w1

z2

w2

zNm

wNm

hellip

p(w|z=1) p(w|z=2) p(w | z=Nm) are shared for all documents

Likelihood

Joint Probability vs Likelihood

bull Joint probability

bull Likelihood (only for observed variables)

bull p(d) is assumed to be uniform

Document Decomposition

bull Each document can be decomposed as

bull This is similar to the matrix decomposition if we consider each discrete distribution as a vector

p(w|d) = ZVtimesk p(z|d)

bull With many documents we hope to find latent topics as common basis

pLSA ndash Objective Function

bull pLSA tries to maximize the log likelihood

bull Due to the summation over z inside log we have to resort to EM

EM Steps

bull E-Stepndash Expectation of the likelihood function is calculated with the current

parameter valuesbull M-Step

ndash Update the parameters with the calculated posterior probabilitiesndash Find the parameters that maximizes the likelihood function

Lower Bounding the Log Likelihood

EM Steps

bull The E-Step

bull The M-Step

Latent Subspace

pLSA vs LSAbull LSA and PLSA perform dimensionality reduction

ndash In LSA by keeping only K singular valuesndash In PLSA by having K aspects

bull Comparison to SVDndash U Matrix related to P(z|d) (doc to aspect)ndash V Matrix related to P(w|z) (aspect to term)ndash E Matrix related to P(z) (aspect strength)

pLSA vs LSAbull The main difference is the way the approximation is done

bull PLSA generates a model (aspect model) and maximizes its predictive power

bull Selecting the proper value of K is heuristic in LSA

bull Model selection in statistics can determine optimal K in PLSA

Applications

bull Text mining topic discovering

bull Scene Classification

Text Mining

Scene Classification

Classification Result

Reference

bull Thomas Hofmann Probabilistic latent semantic analysis In Proc of Uncertainty in Artificial Intelligence UAI99 Stockholm 1999

bull Bosch A Zisserman A and Munoz X Scene Classification via pLSAProceedings of the European Conference on Computer Vision (2006)

bull Sivic J Russell B C Efros A A Zisserman A and Freeman W T Discovering Object Categories in Image Collections MIT AI Lab Memo AIM-2005-005 February 2005

LDA ndash LATENT DIRICHILET ALLOCATION

Problems in pLSA

bull pLSA provides no probabilistic model at the document level Each doc has its own topic mixture proportion

bull The number of parameters in the model grows linearly with M (the number of documents in the training set)

Problems in pLSA

bull There is no constraint for distributions p(z|di)

bull Easy to lead to serious problems with over-fitting

d1

z1

w1

z2

w2

zN1

wN1

hellip

d2

z1

w1

z2

w2

zN2

wN2

hellip

dm

z1

w1

z2

w2

zNm

wNm

hellip

p(z|d1) p(z|d2) p(z|dm)

Dirichlet Distribution

bull In the LDA model the topic mixture proportions for each document are assumed to follow some distribution

bull Requirement for such a distributionndash The samples (mixture proportions) generated from it are K-tuples of non-

negative numbers that sum to one That is the samples are multinormialsndash Easy to optimize

bull Dirichlet Distribution is one of such distributions

bull The space of all of these multinomials has a nice geometric interpretation as a (k-1)-simplex

Dirichlet Distribution

bull Definition

bull The density is zero outside this open (K minus 1)-dimensional simplex

k

i ii

ki ik

i i

ki i

kk

xx

xxxxp i

1

11

1

12121

1 0 st

)Γ()(Γ)|(

bull Various parameter α

(6 2 2) (3 7 5)

(2 3 4) (6 2 6)

Example Dirichlet Distributions (K=3)

Example Dirichlet Distributions (K=3)

bull Equal αi different

α0=01 α0=1 α0=10

k

i i10

The LDA Model

z4z3z2z1

w4w3w2w1

b

z4z3z2z1

w4w3w2w1

z4z3z2z1

w4w3w2w1

bull For each documentbull Choose ~Dirichlet()bull For each of the N words wn

ndash Choose a topic znraquo Multinomial()

ndash Choose a word wn from p(wn|znb) a multinomial probability conditioned on the topic zn

The LDA Model

bull For each documentbull Choose ~Dirichlet()bull For each of the N words wn

ndash Choose a topic znraquo Multinomial()

ndash Choose a word wn from p(wn|znb) a multinomial probability conditioned on the topic zn

Joint Probability

bull Given parameter α and β

where

Likelihood

bull Joint Probability

bull Marginal distribution of a document

bull Likelihood over all the documents

Inference

bull The likelihood can be computed by summing each documentbull Jansenrsquos inequality in EM

Inference

bull In E-Step we need to compute the posterior distribution of the hidden variables

bull Unfortunately this distribution is intractable to compute in general

bull We have to resort to variational approach

Variational Inference

bull In variational inference we consider a simplified graphical model with variational parameters and minimize the KL Divergence between the variational and posterior distributions

Variantional Inference

bull The difference between the lower bound and the likelihood is the KL divergence

bull Maximizing the lower bound L(b) with respect to and is equivalent to minimizing the KL divergence

VBEM vs EM

bull Only different in the E-Step

bull In standard EM q(X) is directly set to p(X|Dθ) and let KL=0bull In VBEM it is intractable to compute p(X|Dθ) Instead it

approximates p(X|Dθ) by a variational distribution q(X) by minimizing KL(q(X) | P(X|D θ)

bull This is also equivalent to maximizing the lower bound L(θ)

Parameter Estimation

bull Given a corpus of documents we would like to find the parameters and b which maximize the likelihood of the observed data

bull Strategy (Variational EM)

bull Lower bound log p(w|b) by a function L(b)bull Repeat until convergence

ndash E Maximize L(b) with respect to the variational parameters ndash M Maximize the bound with respect to parameters and b

Parameter Estimation

bull E-Step Variational Inference ndash repeat until convergence

bull M-Step Parameter estimation

β

α can be implemented using the Newton-Raphson method

Topic Examples in a 100-topic LDA Model)bull 16000 documents from a subset of the TREC AP corpus

Classification (50-topic LDA + SVM)

bull Reuters-21578 dataset ndash contains 8000 documents and 15818 words

(a) EARN vs NOT EARN (b) GRAIN vs NOT GRAIN

Problems in LDA

bull Dirichlet Distribution is helpful to avoid over-fitting But the assumption might be too strong

z4z3z2z1

w4w3w2w1

b

z4z3z2z1

w4w3w2w1

z4z3z2z1

w4w3w2w1

A Bayesian Hierarchical Model for Learning Natural Scene Categories

bull Incorporating category information

MNd

π

z

x

θ

β

Codebookbull 174 Local Image Patches

bull DetectionEvenly Sampled GridRandom SamplingSaliency DetectorLowersquos DoG Detector

bull RepresentationNormalized 11x11 gray values128-dim SIFT

Topic Distribution in Different Categories

Topic Hierarchical Clustering

More Topic Models

bull Dynamic topic models ICML 2006bull Correlated Topic Model NIPS 2005bull Hierarchical Dirichlet Process Journal of the American

Statistical Association 2003bull Nonparametric Bayes pachinko allocation UAI 2007bull Supervised LDA NIPS 2007bull MedLDA ndash Maximum Margin Discrimant LDA ICML 2009bull hellip

Are you really into Graphical Models

bull Describing Visual Scenes using Transformed Dirichlet Processes E Sudderth A Torralba W Freeman and A Willsky NIPS Dec 2005

Reference

bull David M Blei Andrew Y Ng Michael I Jordan Latent Dirichlet Allocation Journal of Machine Learning Research (JMLR) 2003

bull Matthew J Beal Variational Algorithms for Approximate Bayesian Inference PhD Thesis University of Cambridge 1998

bull L Fei-Fei and P Perona A Bayesian Hierarchical Model for Learning Natural Scene Categories CVPR 2005

Outline

bull Matrix Decompositionndash PCA SVD NMFndash LDA ICA Sparse Coding etc

bull Graphical Modelndash Basic concepts in probabilistic machine learningndash EMndash pLSAndash LDA

bull Two Applicationsndash Document decomposition for ldquolong queryrdquo retrieval ICCV 2009ndash Modeling Threaded Discussions SIGIR 2009

Large-Scale Indexing for ldquoLong Queryrdquo Retrieval (Similarity Search)

Xiao Zhang Zhiwei Li Lei Zhang Wei-Ying Ma and Heung-Yeung Shum

ICCV 2009

The Long Query Problem

bull If a query contains 1000 keywordsndash Need to access 1000 inverted listsndash The intersection of 1000 inverted lists may be emptyndash The union of 1000 inverted list may be the whole corpus

bull Dimension reductionTerm1 Term2 Term3 Term4 hellip TermN

Img1 1 2 0 0 hellip 2

f1 f2 hellip fM

Img1 02 01 hellip 003

Topic Projection

Dim = 1 million

Dim = 200

Key Idea Dimension Reduction + Residual Error Preservation

bull p original TF-IDF vector in vocabulary spacebull X projection matrix for dimension reductionbull w low dimensional feature vectorbull residual error

p Xw

An image = 0 01 02 03 04 05 06 07 08 09 1

0

2

4

6

8

10

12

14

16

+ a few words(10 words)

Orthogonal Decomposition

p Xw 1 11 1 1

2 21 2 1 2

1 1

1

k

k

W k W

W W Wk W

p x xp x x w

p wp x x

Base vector

Low dimensional representation

Residual

An image = 0 01 02 03 04 05 06 07 08 09 1

0

2

4

6

8

10

12

14

16

+ a few words(10 words)

T Tp q p q

p q

w w

X1 X2 X3 hellip Xk

A Probabilistic Implementation

x is a switch variable It controls a word generated from

bull a topic specific distribution

bull a document specific distribution

bull a background distribution

( | )p w d 1( 0 | ) ( | ) ( | )K

kp x d p w z k p z k d

( 1| ) ( | )p x d p w d ( 2 | ) ( | )p x d p w

C Chemudugunta etal Modeling General and Specific Aspects of Documents with a Probabilistic Topic Model NIPS 2006

Search (Online)

DS1 DS2 hellip

DS1 DS2 hellip

DS1 DS2 hellip

DS1 DS2 hellip

DS1 DS2 hellip

hellip

hellip

hellip

hellip

hellip

LSH Index

Doc 300 Doc 401 hellip

A query = 0 01 02 03 04 05 06 07 08 09 1

0

2

4

6

8

10

12

14

16

+ a few words

Re-rankingDoc 401 hellip

Doc 1

Doc 2

Doc 300

Doc 401

Doc N

Doc 300

Index 10M Images 46GBSearch Speed lt 100ms

Doc Meta

Search Example

Query Image

Search Example

Query Image

Simultaneously Modeling Semantics and Structure of Threaded Discussions A Sparse

Coding Approach and Its Applications

Chen LIN Jiang-Ming YANG Rui CAI Xin-jing WANG Wei WANG Lei ZHANG

SIGIR 2009

123

Semantic amp structure

124

SemanticTopics

StructureWho reply to who

Optimize them together

Model semantic

Model structure

125

Reply reconstruction

126

DocumentSimilarity

TopicSimilarity

StructureSimilarity

BaselinesNP

Reply to Nearest PostRR

Reply to RootDS

Document SimilarityLDA

Latent Dirichlet Allocation Project documents to topic space

SWB Special Words Topic Model with Background distribution Project documents to topic and junk topic space

127

Evaluation

method Slashdot Apple

All Posts Good Posts All Posts Good Posts

NP 0021 0012 0289 0239

RR 0183 0319 0269 0474

DS 0463 0643 0409 0628

LDA 0465 0644 0410 0648

SWB 0463 0644 0410 0641

SMSS 0524 0737 0517 0772

128

Expert finding

Reply reconstruction

Network construction

Expert finding

Methods

HITS

PageRank

hellip

129

Baselines

LMFormal Models for Expert Finding in Enterprise Corpora SIGIR

06Achieves stable performance in expert finding task using a

language modelPageRank

Benchmark nodal ranking methodHITS

Find hub nodes and authority nodeEABIF

Personalized Recommendation Driven by Information Flow SIGIR rsquo06

Find most influential node130

Evaluation

131

bull Bayesian estimate

Method MRR MAP P10

LM 0821 0698 0800

EABIF(ori) 0674 0362 0243

EABIF(rec) 0742 0318 0281

PageRank(ori) 0675 0377 0263

PageRank(rec) 0743 0321 0266

HITS(ori) 0906 0832 0900

HITS(rec) 0938 0822 0906

Summary

bull Matrix and probability are fundamental mathematics in information retrieval and computer visionndash Matrix decomposition ndash a good practice to learn matrixndash Graphical model ndash a good practice to learn probability

bull Graphical model is a good tool to analyze problems

bull The essence of decomposition is to discover a set of mid-level features to describe original documentsimages

bull It is more adaptable for various applications than matrix decomposition

  • An Introduction To Matrix Decomposition and Graphical Model
  • Outline
  • What Is Matrix Decomposition
  • Why We Need Matrix Decomposition
  • Why We Need Matrix Decomposition (2)
  • Principle Component Analysis
  • Definition ndash Eigenvalue amp Eigenvector
  • Definition ndash Principle Component Analysis
  • Principle Component Analysis (2)
  • Maximizing Variance
  • Optimization Problem
  • Property Data Decomposition
  • Face Recognition ndash Eigenface
  • Slide 14
  • Slide 15
  • PageRank ndash Power Iteration
  • Column-Stochastic amp Irreducible
  • Iterative PageRank Calculation
  • Convergence of the power iteration
  • Singular value decomposition
  • SVD - Definition
  • Singular Values And Singular Vectors
  • Matrix approximation
  • SVD and PCA
  • Example - LSI
  • SVD and PCA (2)
  • Latent Semantic Indexing (LSI)
  • Latent Semantic Indexing
  • Similarity Measures
  • Similarity Measures (2)
  • HITS (Hyperlink Induced Topic Search)
  • Power Iteration
  • HITS and SVD
  • HITS vs PageRank
  • NMF ndash Non-Negative Matrix Factorization
  • Definition
  • Motivation
  • Multiplicative Update Algorithm
  • Multiplicative Update Algorithm (2)
  • NMF vs PCA
  • Reference
  • Major Reference
  • Outline (2)
  • Not Included
  • Basic Concepts
  • What Is Machine Learning
  • Likelihood Function
  • Maximum Likelihood (ML)
  • IID ndash Independent Identically Distributed
  • Reference (2)
  • Expectation Maximization
  • Why We Need EM
  • More General
  • The Expectation Maximization (EM) Algorithm
  • Jensenrsquos Inequality
  • Lower Bounding the Log Likelihood
  • The E and M Steps of EM
  • The E Step
  • The M Step
  • EM Never Decreases the Likelihood
  • Reference (3)
  • Why Do We Need Graphical Model
  • Why Do We Need Graphical Models
  • Directed Acyclic Graphical Models (Bayesian Networks)
  • Directed Graphs for Statistical Models Plate Notation
  • pLSA ndash Probabilistic Latent Semantic Analysis
  • Latent Semantic Indexing (LSI) Review
  • pLSA ndash Probabilistic Latent Semantic Analysis (2)
  • pLSA
  • pLSA (2)
  • Joint Probability vs Likelihood
  • Document Decomposition
  • pLSA ndash Objective Function
  • EM Steps
  • Lower Bounding the Log Likelihood (2)
  • EM Steps (2)
  • Latent Subspace
  • pLSA vs LSA
  • pLSA vs LSA (2)
  • Applications
  • Text Mining
  • Scene Classification
  • Classification Result
  • Reference (4)
  • LDA ndash Latent Dirichilet Allocation
  • Problems in pLSA
  • Problems in pLSA (2)
  • Dirichlet Distribution
  • Dirichlet Distribution (2)
  • Example Dirichlet Distributions (K=3)
  • Example Dirichlet Distributions (K=3) (2)
  • The LDA Model
  • The LDA Model (2)
  • Joint Probability
  • Likelihood
  • Inference
  • Inference (2)
  • Variational Inference
  • Variantional Inference
  • VBEM vs EM
  • Parameter Estimation
  • Parameter Estimation (2)
  • Topic Examples in a 100-topic LDA Model)
  • Classification (50-topic LDA + SVM)
  • Problems in LDA
  • A Bayesian Hierarchical Model for Learning Natural Scene Catego
  • Codebook
  • Topic Distribution in Different Categories
  • Topic Hierarchical Clustering
  • More Topic Models
  • Are you really into Graphical Models
  • Reference (5)
  • Outline (3)
  • Slide 115
  • The Long Query Problem
  • Key Idea Dimension Reduction + Residual Error Preservation
  • Orthogonal Decomposition
  • A Probabilistic Implementation
  • Search (Online)
  • Search Example
  • Search Example (2)
  • Simultaneously Modeling Semantics and Structure of Threaded Dis
  • Semantic amp structure
  • Optimize them together
  • Reply reconstruction
  • Baselines
  • Evaluation
  • Expert finding
  • Baselines (2)
  • Evaluation (2)
  • Summary
Page 42: An Introduction To Matrix Decomposition and Graphical Model

Not Included

bull General graphical model theoriesbull Markov random field (belief propagation)bull Detailed derivation of LDA

BASIC CONCEPTS

What Is Machine Learning

Databull Let x = (x1 x2 xD) T denote a data point and D = x(1) x(2) x(N) a

data set D is sometimes associated with desired outputs y1 y2

Predictionsbull We are generally interested in predicting something based on the observed

data setbull Given D what can we say about x(N+1)

Modelbull To make predictions we need to make some assumptions We can often

express these assumptions in the form of a model with some parameters θbull Given data D we learn the model parameters from which we can predict

new data pointsbull The model can often be expressed as a probability distribution over data

points

Likelihood Function

bull Given a set of parameter values probability density function (PDF) will show that some data are more probable than other data

bull Inversely given the observed data and a model of interest Likelihood function is defined as

L(θ) = fθ(x|θ) = p(x|θ)

bull That is likelihood function L(θ) will show that some parameters are more likely to have produced the data

Maximum Likelihood (ML)

bull Maximum likelihood will find the best model parameters that make the data ldquomost likelyrdquo generated from this model

bull Suppose we are given n data samples (x1 x2 hellip xn)

bull Maximum likelihood will find θ that maximize L(θ)

bull Predictive distribution

IID ndash Independent Identically Distributed

bull IID means

bull The problem is considerably simplified as

bull Usually log likehood is used

Reference

bull Zoubin Ghahramani Machine Learning (4F13) 2006 Cambridge(Introduction to Machine Learning Lectures 1-2 Slides)

bull Gregor Heinrich Parameter estimation for text analysis Technical Note 2005-2008

EXPECTATION MAXIMIZATION

Why We Need EM

bull The ExpectationndashMaximization (EM) algorithm is a method for ML learning of parameters in latent variable models

bull Why we need latent variables

bull To describe complex model Gaussian Mixture Model

bull To discover the intrinsic structure inside a data set Topic Models such as pLSA LDA

More General

bull Data set bull Likelihood

bull Goal learn maximum likelihood (ML) parameter values

bull The maximum likelihood procedure finds parameters θ such that

bull Because of the integral (or sum) over latent variables the likelihood can be a very complicated and hard to optimize function

The Expectation Maximization (EM) Algorithm

bull The EM algorithm finds a (local) maximum of a latent variable model likelihood It starts from arbitrary values of the parameters and iterates two steps

bull E step Fill in values of latent variables according to posterior given data

bull M step Maximize likelihood as if latent variables were not hidden

bull Decomposes difficult problems into series of tractable steps

Jensenrsquos Inequality

Lower Bounding the Log Likelihoodbull Observed data D = yn Latent variables X = xn Parameters θbull Goal Maximize the log likelihood (ie ML learning) wrt θ

bull Any distribution q(X) over the hidden variables can be used to obtain a lower bound on the log likelihood using Jensenrsquos inequality

bull where H[q] is the entropy of q(X)

The E and M Steps of EM

bull The lower bound on the log likelihood is given by

bull EM alternates betweenbull E step optimize wrt distribution over hidden variables

holding params fixed

bull M step maximize wrt parameters holding hidden distribution fixed

The E Step

bull E step for fixed θ

bull The second term is the Kullback-Leibler divergencebull This means that for fixed θ F is bounded above by L and achieves

that bound whenbull So the E step simply sets

The M Step

bull M step maximize wrt parameters holding hidden distribution q fixed

bull The second equality comes from fact that entropy of q(X) does not depend directly on θ

bull The specific form of the M step depends on the model Often the maximum wrt θ can be found analytically

EM Never Decreases the Likelihood

bull The E and M steps together never decrease the log likelihood

bull The E step brings F(q θ) to the likelihood L(θ) 顶bull The M-step maximizes F(q θ) wrt θ 抬bull F(q θ) lt L(θ) by Jensen ndash or equivalently from the non-

negativity of KL

Reference

bull Zoubin Ghahramani Machine Learning (4F13) 2006 Cambridge(Unsupervised learning Lecture 5 Slides)

bull Christopher M Bishop (2006) Pattern Recognition and Machine Learning Springer

WHY DO WE NEED GRAPHICAL MODEL

Why Do We Need Graphical Models

bull Consndash Graphical model is so complex even with a few circleshellipndash We have to make too many assumptions

bull Prosndash We do need probability to explain our world But joint probability is

hard to computendash Graphical model can help us analyze and understand our problemsndash Graphs are an intuitive way of representing and visualizing the

relationships between many variablesndash With a graphical model we can decouple joint probability to

conditional probabilities which are usually easier

Directed Acyclic Graphical Models (Bayesian Networks)

bull A DAG Model Bayesian network corresponds to a factorization of the joint probability distribution

p(ABCDE) = p(A)p(B)p(C|AB)p(D|BC)p(E|CD)

bull In general

bull where pa(i) are the parents of node i

Directed Graphs for Statistical ModelsPlate Notation

bull A data set of N points generated from a Gaussian

PLSA ndash PROBABILISTIC LATENT SEMANTIC ANALYSIS

Latent Semantic Indexing (LSI) Review

bull For natural Language Queries simple term matching does not work effectivelyndash Ambiguous terms ndash Same Queries vary due to personal styles

bull Latent semantic indexingndash Creates this lsquolatent semantic spacersquo (hidden meaning)

bull LSI puts documents together even if they donrsquot have common words if the docs share frequently co-occurring terms

bull Disadvantagesndash Statistical foundation is missing

pLSA ndash Probabilistic Latent Semantic Analysis

bull Automated Document Indexing and Information retrievalbull Identification of Latent Classes using an Expectation

Maximization (EM) Algorithmbull Shown to solve

ndash Polysemybull Java could mean ldquocoffeerdquo and also the ldquoPL Javardquobull Cricket is a ldquogamerdquo and also an ldquoinsectrdquo

ndash Synonymybull ldquocomputerrdquo ldquopcrdquo ldquodesktoprdquo all could mean the same

bull Has a better statistical foundation than LSA

pLSA

M

Nd

d

z

w

M

d

z1

w1

z2

w2

z3

w3

zN

wN

hellip

z1 hellip zN are variables ziє[1K]K is the number of latent topics

pLSA

d1

z1

w1

z2

w2

zN1

wN1

hellip

d2

z1

w1

z2

w2

zN2

wN2

hellip

dM

z1

w1

z2

w2

zNm

wNm

hellip

p(w|z=1) p(w|z=2) p(w | z=Nm) are shared for all documents

Likelihood

Joint Probability vs Likelihood

bull Joint probability

bull Likelihood (only for observed variables)

bull p(d) is assumed to be uniform

Document Decomposition

bull Each document can be decomposed as

bull This is similar to the matrix decomposition if we consider each discrete distribution as a vector

p(w|d) = ZVtimesk p(z|d)

bull With many documents we hope to find latent topics as common basis

pLSA ndash Objective Function

bull pLSA tries to maximize the log likelihood

bull Due to the summation over z inside log we have to resort to EM

EM Steps

bull E-Stepndash Expectation of the likelihood function is calculated with the current

parameter valuesbull M-Step

ndash Update the parameters with the calculated posterior probabilitiesndash Find the parameters that maximizes the likelihood function

Lower Bounding the Log Likelihood

EM Steps

bull The E-Step

bull The M-Step

Latent Subspace

pLSA vs LSAbull LSA and PLSA perform dimensionality reduction

ndash In LSA by keeping only K singular valuesndash In PLSA by having K aspects

bull Comparison to SVDndash U Matrix related to P(z|d) (doc to aspect)ndash V Matrix related to P(w|z) (aspect to term)ndash E Matrix related to P(z) (aspect strength)

pLSA vs LSAbull The main difference is the way the approximation is done

bull PLSA generates a model (aspect model) and maximizes its predictive power

bull Selecting the proper value of K is heuristic in LSA

bull Model selection in statistics can determine optimal K in PLSA

Applications

bull Text mining topic discovering

bull Scene Classification

Text Mining

Scene Classification

Classification Result

Reference

bull Thomas Hofmann Probabilistic latent semantic analysis In Proc of Uncertainty in Artificial Intelligence UAI99 Stockholm 1999

bull Bosch A Zisserman A and Munoz X Scene Classification via pLSAProceedings of the European Conference on Computer Vision (2006)

bull Sivic J Russell B C Efros A A Zisserman A and Freeman W T Discovering Object Categories in Image Collections MIT AI Lab Memo AIM-2005-005 February 2005

LDA ndash LATENT DIRICHILET ALLOCATION

Problems in pLSA

bull pLSA provides no probabilistic model at the document level Each doc has its own topic mixture proportion

bull The number of parameters in the model grows linearly with M (the number of documents in the training set)

Problems in pLSA

bull There is no constraint for distributions p(z|di)

bull Easy to lead to serious problems with over-fitting

d1

z1

w1

z2

w2

zN1

wN1

hellip

d2

z1

w1

z2

w2

zN2

wN2

hellip

dm

z1

w1

z2

w2

zNm

wNm

hellip

p(z|d1) p(z|d2) p(z|dm)

Dirichlet Distribution

bull In the LDA model the topic mixture proportions for each document are assumed to follow some distribution

bull Requirement for such a distributionndash The samples (mixture proportions) generated from it are K-tuples of non-

negative numbers that sum to one That is the samples are multinormialsndash Easy to optimize

bull Dirichlet Distribution is one of such distributions

bull The space of all of these multinomials has a nice geometric interpretation as a (k-1)-simplex

Dirichlet Distribution

bull Definition

bull The density is zero outside this open (K minus 1)-dimensional simplex

k

i ii

ki ik

i i

ki i

kk

xx

xxxxp i

1

11

1

12121

1 0 st

)Γ()(Γ)|(

bull Various parameter α

(6 2 2) (3 7 5)

(2 3 4) (6 2 6)

Example Dirichlet Distributions (K=3)

Example Dirichlet Distributions (K=3)

bull Equal αi different

α0=01 α0=1 α0=10

k

i i10

The LDA Model

z4z3z2z1

w4w3w2w1

b

z4z3z2z1

w4w3w2w1

z4z3z2z1

w4w3w2w1

bull For each documentbull Choose ~Dirichlet()bull For each of the N words wn

ndash Choose a topic znraquo Multinomial()

ndash Choose a word wn from p(wn|znb) a multinomial probability conditioned on the topic zn

The LDA Model

bull For each documentbull Choose ~Dirichlet()bull For each of the N words wn

ndash Choose a topic znraquo Multinomial()

ndash Choose a word wn from p(wn|znb) a multinomial probability conditioned on the topic zn

Joint Probability

bull Given parameter α and β

where

Likelihood

bull Joint Probability

bull Marginal distribution of a document

bull Likelihood over all the documents

Inference

bull The likelihood can be computed by summing each documentbull Jansenrsquos inequality in EM

Inference

bull In E-Step we need to compute the posterior distribution of the hidden variables

bull Unfortunately this distribution is intractable to compute in general

bull We have to resort to variational approach

Variational Inference

bull In variational inference we consider a simplified graphical model with variational parameters and minimize the KL Divergence between the variational and posterior distributions

Variantional Inference

bull The difference between the lower bound and the likelihood is the KL divergence

bull Maximizing the lower bound L(b) with respect to and is equivalent to minimizing the KL divergence

VBEM vs EM

bull Only different in the E-Step

bull In standard EM q(X) is directly set to p(X|Dθ) and let KL=0bull In VBEM it is intractable to compute p(X|Dθ) Instead it

approximates p(X|Dθ) by a variational distribution q(X) by minimizing KL(q(X) | P(X|D θ)

bull This is also equivalent to maximizing the lower bound L(θ)

Parameter Estimation

bull Given a corpus of documents we would like to find the parameters and b which maximize the likelihood of the observed data

bull Strategy (Variational EM)

bull Lower bound log p(w|b) by a function L(b)bull Repeat until convergence

ndash E Maximize L(b) with respect to the variational parameters ndash M Maximize the bound with respect to parameters and b

Parameter Estimation

bull E-Step Variational Inference ndash repeat until convergence

bull M-Step Parameter estimation

β

α can be implemented using the Newton-Raphson method

Topic Examples in a 100-topic LDA Model)bull 16000 documents from a subset of the TREC AP corpus

Classification (50-topic LDA + SVM)

bull Reuters-21578 dataset ndash contains 8000 documents and 15818 words

(a) EARN vs NOT EARN (b) GRAIN vs NOT GRAIN

Problems in LDA

bull Dirichlet Distribution is helpful to avoid over-fitting But the assumption might be too strong

z4z3z2z1

w4w3w2w1

b

z4z3z2z1

w4w3w2w1

z4z3z2z1

w4w3w2w1

A Bayesian Hierarchical Model for Learning Natural Scene Categories

bull Incorporating category information

MNd

π

z

x

θ

β

Codebookbull 174 Local Image Patches

bull DetectionEvenly Sampled GridRandom SamplingSaliency DetectorLowersquos DoG Detector

bull RepresentationNormalized 11x11 gray values128-dim SIFT

Topic Distribution in Different Categories

Topic Hierarchical Clustering

More Topic Models

bull Dynamic topic models ICML 2006bull Correlated Topic Model NIPS 2005bull Hierarchical Dirichlet Process Journal of the American

Statistical Association 2003bull Nonparametric Bayes pachinko allocation UAI 2007bull Supervised LDA NIPS 2007bull MedLDA ndash Maximum Margin Discrimant LDA ICML 2009bull hellip

Are you really into Graphical Models

bull Describing Visual Scenes using Transformed Dirichlet Processes E Sudderth A Torralba W Freeman and A Willsky NIPS Dec 2005

Reference

bull David M Blei Andrew Y Ng Michael I Jordan Latent Dirichlet Allocation Journal of Machine Learning Research (JMLR) 2003

bull Matthew J Beal Variational Algorithms for Approximate Bayesian Inference PhD Thesis University of Cambridge 1998

bull L Fei-Fei and P Perona A Bayesian Hierarchical Model for Learning Natural Scene Categories CVPR 2005

Outline

bull Matrix Decompositionndash PCA SVD NMFndash LDA ICA Sparse Coding etc

bull Graphical Modelndash Basic concepts in probabilistic machine learningndash EMndash pLSAndash LDA

bull Two Applicationsndash Document decomposition for ldquolong queryrdquo retrieval ICCV 2009ndash Modeling Threaded Discussions SIGIR 2009

Large-Scale Indexing for ldquoLong Queryrdquo Retrieval (Similarity Search)

Xiao Zhang Zhiwei Li Lei Zhang Wei-Ying Ma and Heung-Yeung Shum

ICCV 2009

The Long Query Problem

bull If a query contains 1000 keywordsndash Need to access 1000 inverted listsndash The intersection of 1000 inverted lists may be emptyndash The union of 1000 inverted list may be the whole corpus

bull Dimension reductionTerm1 Term2 Term3 Term4 hellip TermN

Img1 1 2 0 0 hellip 2

f1 f2 hellip fM

Img1 02 01 hellip 003

Topic Projection

Dim = 1 million

Dim = 200

Key Idea Dimension Reduction + Residual Error Preservation

bull p original TF-IDF vector in vocabulary spacebull X projection matrix for dimension reductionbull w low dimensional feature vectorbull residual error

p Xw

An image = 0 01 02 03 04 05 06 07 08 09 1

0

2

4

6

8

10

12

14

16

+ a few words(10 words)

Orthogonal Decomposition

p Xw 1 11 1 1

2 21 2 1 2

1 1

1

k

k

W k W

W W Wk W

p x xp x x w

p wp x x

Base vector

Low dimensional representation

Residual

An image = 0 01 02 03 04 05 06 07 08 09 1

0

2

4

6

8

10

12

14

16

+ a few words(10 words)

T Tp q p q

p q

w w

X1 X2 X3 hellip Xk

A Probabilistic Implementation

x is a switch variable It controls a word generated from

bull a topic specific distribution

bull a document specific distribution

bull a background distribution

( | )p w d 1( 0 | ) ( | ) ( | )K

kp x d p w z k p z k d

( 1| ) ( | )p x d p w d ( 2 | ) ( | )p x d p w

C Chemudugunta etal Modeling General and Specific Aspects of Documents with a Probabilistic Topic Model NIPS 2006

Search (Online)

DS1 DS2 hellip

DS1 DS2 hellip

DS1 DS2 hellip

DS1 DS2 hellip

DS1 DS2 hellip

hellip

hellip

hellip

hellip

hellip

LSH Index

Doc 300 Doc 401 hellip

A query = 0 01 02 03 04 05 06 07 08 09 1

0

2

4

6

8

10

12

14

16

+ a few words

Re-rankingDoc 401 hellip

Doc 1

Doc 2

Doc 300

Doc 401

Doc N

Doc 300

Index 10M Images 46GBSearch Speed lt 100ms

Doc Meta

Search Example

Query Image

Search Example

Query Image

Simultaneously Modeling Semantics and Structure of Threaded Discussions A Sparse

Coding Approach and Its Applications

Chen LIN Jiang-Ming YANG Rui CAI Xin-jing WANG Wei WANG Lei ZHANG

SIGIR 2009

123

Semantic amp structure

124

SemanticTopics

StructureWho reply to who

Optimize them together

Model semantic

Model structure

125

Reply reconstruction

126

DocumentSimilarity

TopicSimilarity

StructureSimilarity

BaselinesNP

Reply to Nearest PostRR

Reply to RootDS

Document SimilarityLDA

Latent Dirichlet Allocation Project documents to topic space

SWB Special Words Topic Model with Background distribution Project documents to topic and junk topic space

127

Evaluation

method Slashdot Apple

All Posts Good Posts All Posts Good Posts

NP 0021 0012 0289 0239

RR 0183 0319 0269 0474

DS 0463 0643 0409 0628

LDA 0465 0644 0410 0648

SWB 0463 0644 0410 0641

SMSS 0524 0737 0517 0772

128

Expert finding

Reply reconstruction

Network construction

Expert finding

Methods

HITS

PageRank

hellip

129

Baselines

LMFormal Models for Expert Finding in Enterprise Corpora SIGIR

06Achieves stable performance in expert finding task using a

language modelPageRank

Benchmark nodal ranking methodHITS

Find hub nodes and authority nodeEABIF

Personalized Recommendation Driven by Information Flow SIGIR rsquo06

Find most influential node130

Evaluation

131

bull Bayesian estimate

Method MRR MAP P10

LM 0821 0698 0800

EABIF(ori) 0674 0362 0243

EABIF(rec) 0742 0318 0281

PageRank(ori) 0675 0377 0263

PageRank(rec) 0743 0321 0266

HITS(ori) 0906 0832 0900

HITS(rec) 0938 0822 0906

Summary

bull Matrix and probability are fundamental mathematics in information retrieval and computer visionndash Matrix decomposition ndash a good practice to learn matrixndash Graphical model ndash a good practice to learn probability

bull Graphical model is a good tool to analyze problems

bull The essence of decomposition is to discover a set of mid-level features to describe original documentsimages

bull It is more adaptable for various applications than matrix decomposition

  • An Introduction To Matrix Decomposition and Graphical Model
  • Outline
  • What Is Matrix Decomposition
  • Why We Need Matrix Decomposition
  • Why We Need Matrix Decomposition (2)
  • Principle Component Analysis
  • Definition ndash Eigenvalue amp Eigenvector
  • Definition ndash Principle Component Analysis
  • Principle Component Analysis (2)
  • Maximizing Variance
  • Optimization Problem
  • Property Data Decomposition
  • Face Recognition ndash Eigenface
  • Slide 14
  • Slide 15
  • PageRank ndash Power Iteration
  • Column-Stochastic amp Irreducible
  • Iterative PageRank Calculation
  • Convergence of the power iteration
  • Singular value decomposition
  • SVD - Definition
  • Singular Values And Singular Vectors
  • Matrix approximation
  • SVD and PCA
  • Example - LSI
  • SVD and PCA (2)
  • Latent Semantic Indexing (LSI)
  • Latent Semantic Indexing
  • Similarity Measures
  • Similarity Measures (2)
  • HITS (Hyperlink Induced Topic Search)
  • Power Iteration
  • HITS and SVD
  • HITS vs PageRank
  • NMF ndash Non-Negative Matrix Factorization
  • Definition
  • Motivation
  • Multiplicative Update Algorithm
  • Multiplicative Update Algorithm (2)
  • NMF vs PCA
  • Reference
  • Major Reference
  • Outline (2)
  • Not Included
  • Basic Concepts
  • What Is Machine Learning
  • Likelihood Function
  • Maximum Likelihood (ML)
  • IID ndash Independent Identically Distributed
  • Reference (2)
  • Expectation Maximization
  • Why We Need EM
  • More General
  • The Expectation Maximization (EM) Algorithm
  • Jensenrsquos Inequality
  • Lower Bounding the Log Likelihood
  • The E and M Steps of EM
  • The E Step
  • The M Step
  • EM Never Decreases the Likelihood
  • Reference (3)
  • Why Do We Need Graphical Model
  • Why Do We Need Graphical Models
  • Directed Acyclic Graphical Models (Bayesian Networks)
  • Directed Graphs for Statistical Models Plate Notation
  • pLSA ndash Probabilistic Latent Semantic Analysis
  • Latent Semantic Indexing (LSI) Review
  • pLSA ndash Probabilistic Latent Semantic Analysis (2)
  • pLSA
  • pLSA (2)
  • Joint Probability vs Likelihood
  • Document Decomposition
  • pLSA ndash Objective Function
  • EM Steps
  • Lower Bounding the Log Likelihood (2)
  • EM Steps (2)
  • Latent Subspace
  • pLSA vs LSA
  • pLSA vs LSA (2)
  • Applications
  • Text Mining
  • Scene Classification
  • Classification Result
  • Reference (4)
  • LDA ndash Latent Dirichilet Allocation
  • Problems in pLSA
  • Problems in pLSA (2)
  • Dirichlet Distribution
  • Dirichlet Distribution (2)
  • Example Dirichlet Distributions (K=3)
  • Example Dirichlet Distributions (K=3) (2)
  • The LDA Model
  • The LDA Model (2)
  • Joint Probability
  • Likelihood
  • Inference
  • Inference (2)
  • Variational Inference
  • Variantional Inference
  • VBEM vs EM
  • Parameter Estimation
  • Parameter Estimation (2)
  • Topic Examples in a 100-topic LDA Model)
  • Classification (50-topic LDA + SVM)
  • Problems in LDA
  • A Bayesian Hierarchical Model for Learning Natural Scene Catego
  • Codebook
  • Topic Distribution in Different Categories
  • Topic Hierarchical Clustering
  • More Topic Models
  • Are you really into Graphical Models
  • Reference (5)
  • Outline (3)
  • Slide 115
  • The Long Query Problem
  • Key Idea Dimension Reduction + Residual Error Preservation
  • Orthogonal Decomposition
  • A Probabilistic Implementation
  • Search (Online)
  • Search Example
  • Search Example (2)
  • Simultaneously Modeling Semantics and Structure of Threaded Dis
  • Semantic amp structure
  • Optimize them together
  • Reply reconstruction
  • Baselines
  • Evaluation
  • Expert finding
  • Baselines (2)
  • Evaluation (2)
  • Summary
Page 43: An Introduction To Matrix Decomposition and Graphical Model

BASIC CONCEPTS

What Is Machine Learning

Databull Let x = (x1 x2 xD) T denote a data point and D = x(1) x(2) x(N) a

data set D is sometimes associated with desired outputs y1 y2

Predictionsbull We are generally interested in predicting something based on the observed

data setbull Given D what can we say about x(N+1)

Modelbull To make predictions we need to make some assumptions We can often

express these assumptions in the form of a model with some parameters θbull Given data D we learn the model parameters from which we can predict

new data pointsbull The model can often be expressed as a probability distribution over data

points

Likelihood Function

bull Given a set of parameter values probability density function (PDF) will show that some data are more probable than other data

bull Inversely given the observed data and a model of interest Likelihood function is defined as

L(θ) = fθ(x|θ) = p(x|θ)

bull That is likelihood function L(θ) will show that some parameters are more likely to have produced the data

Maximum Likelihood (ML)

bull Maximum likelihood will find the best model parameters that make the data ldquomost likelyrdquo generated from this model

bull Suppose we are given n data samples (x1 x2 hellip xn)

bull Maximum likelihood will find θ that maximize L(θ)

bull Predictive distribution

IID ndash Independent Identically Distributed

bull IID means

bull The problem is considerably simplified as

bull Usually log likehood is used

Reference

bull Zoubin Ghahramani Machine Learning (4F13) 2006 Cambridge(Introduction to Machine Learning Lectures 1-2 Slides)

bull Gregor Heinrich Parameter estimation for text analysis Technical Note 2005-2008

EXPECTATION MAXIMIZATION

Why We Need EM

bull The ExpectationndashMaximization (EM) algorithm is a method for ML learning of parameters in latent variable models

bull Why we need latent variables

bull To describe complex model Gaussian Mixture Model

bull To discover the intrinsic structure inside a data set Topic Models such as pLSA LDA

More General

bull Data set bull Likelihood

bull Goal learn maximum likelihood (ML) parameter values

bull The maximum likelihood procedure finds parameters θ such that

bull Because of the integral (or sum) over latent variables the likelihood can be a very complicated and hard to optimize function

The Expectation Maximization (EM) Algorithm

bull The EM algorithm finds a (local) maximum of a latent variable model likelihood It starts from arbitrary values of the parameters and iterates two steps

bull E step Fill in values of latent variables according to posterior given data

bull M step Maximize likelihood as if latent variables were not hidden

bull Decomposes difficult problems into series of tractable steps

Jensenrsquos Inequality

Lower Bounding the Log Likelihoodbull Observed data D = yn Latent variables X = xn Parameters θbull Goal Maximize the log likelihood (ie ML learning) wrt θ

bull Any distribution q(X) over the hidden variables can be used to obtain a lower bound on the log likelihood using Jensenrsquos inequality

bull where H[q] is the entropy of q(X)

The E and M Steps of EM

bull The lower bound on the log likelihood is given by

bull EM alternates betweenbull E step optimize wrt distribution over hidden variables

holding params fixed

bull M step maximize wrt parameters holding hidden distribution fixed

The E Step

bull E step for fixed θ

bull The second term is the Kullback-Leibler divergencebull This means that for fixed θ F is bounded above by L and achieves

that bound whenbull So the E step simply sets

The M Step

bull M step maximize wrt parameters holding hidden distribution q fixed

bull The second equality comes from fact that entropy of q(X) does not depend directly on θ

bull The specific form of the M step depends on the model Often the maximum wrt θ can be found analytically

EM Never Decreases the Likelihood

bull The E and M steps together never decrease the log likelihood

bull The E step brings F(q θ) to the likelihood L(θ) 顶bull The M-step maximizes F(q θ) wrt θ 抬bull F(q θ) lt L(θ) by Jensen ndash or equivalently from the non-

negativity of KL

Reference

bull Zoubin Ghahramani Machine Learning (4F13) 2006 Cambridge(Unsupervised learning Lecture 5 Slides)

bull Christopher M Bishop (2006) Pattern Recognition and Machine Learning Springer

WHY DO WE NEED GRAPHICAL MODEL

Why Do We Need Graphical Models

bull Consndash Graphical model is so complex even with a few circleshellipndash We have to make too many assumptions

bull Prosndash We do need probability to explain our world But joint probability is

hard to computendash Graphical model can help us analyze and understand our problemsndash Graphs are an intuitive way of representing and visualizing the

relationships between many variablesndash With a graphical model we can decouple joint probability to

conditional probabilities which are usually easier

Directed Acyclic Graphical Models (Bayesian Networks)

bull A DAG Model Bayesian network corresponds to a factorization of the joint probability distribution

p(ABCDE) = p(A)p(B)p(C|AB)p(D|BC)p(E|CD)

bull In general

bull where pa(i) are the parents of node i

Directed Graphs for Statistical ModelsPlate Notation

bull A data set of N points generated from a Gaussian

PLSA ndash PROBABILISTIC LATENT SEMANTIC ANALYSIS

Latent Semantic Indexing (LSI) Review

bull For natural Language Queries simple term matching does not work effectivelyndash Ambiguous terms ndash Same Queries vary due to personal styles

bull Latent semantic indexingndash Creates this lsquolatent semantic spacersquo (hidden meaning)

bull LSI puts documents together even if they donrsquot have common words if the docs share frequently co-occurring terms

bull Disadvantagesndash Statistical foundation is missing

pLSA ndash Probabilistic Latent Semantic Analysis

bull Automated Document Indexing and Information retrievalbull Identification of Latent Classes using an Expectation

Maximization (EM) Algorithmbull Shown to solve

ndash Polysemybull Java could mean ldquocoffeerdquo and also the ldquoPL Javardquobull Cricket is a ldquogamerdquo and also an ldquoinsectrdquo

ndash Synonymybull ldquocomputerrdquo ldquopcrdquo ldquodesktoprdquo all could mean the same

bull Has a better statistical foundation than LSA

pLSA

M

Nd

d

z

w

M

d

z1

w1

z2

w2

z3

w3

zN

wN

hellip

z1 hellip zN are variables ziє[1K]K is the number of latent topics

pLSA

d1

z1

w1

z2

w2

zN1

wN1

hellip

d2

z1

w1

z2

w2

zN2

wN2

hellip

dM

z1

w1

z2

w2

zNm

wNm

hellip

p(w|z=1) p(w|z=2) p(w | z=Nm) are shared for all documents

Likelihood

Joint Probability vs Likelihood

bull Joint probability

bull Likelihood (only for observed variables)

bull p(d) is assumed to be uniform

Document Decomposition

bull Each document can be decomposed as

bull This is similar to the matrix decomposition if we consider each discrete distribution as a vector

p(w|d) = ZVtimesk p(z|d)

bull With many documents we hope to find latent topics as common basis

pLSA ndash Objective Function

bull pLSA tries to maximize the log likelihood

bull Due to the summation over z inside log we have to resort to EM

EM Steps

bull E-Stepndash Expectation of the likelihood function is calculated with the current

parameter valuesbull M-Step

ndash Update the parameters with the calculated posterior probabilitiesndash Find the parameters that maximizes the likelihood function

Lower Bounding the Log Likelihood

EM Steps

bull The E-Step

bull The M-Step

Latent Subspace

pLSA vs LSAbull LSA and PLSA perform dimensionality reduction

ndash In LSA by keeping only K singular valuesndash In PLSA by having K aspects

bull Comparison to SVDndash U Matrix related to P(z|d) (doc to aspect)ndash V Matrix related to P(w|z) (aspect to term)ndash E Matrix related to P(z) (aspect strength)

pLSA vs LSAbull The main difference is the way the approximation is done

bull PLSA generates a model (aspect model) and maximizes its predictive power

bull Selecting the proper value of K is heuristic in LSA

bull Model selection in statistics can determine optimal K in PLSA

Applications

bull Text mining topic discovering

bull Scene Classification

Text Mining

Scene Classification

Classification Result

Reference

bull Thomas Hofmann Probabilistic latent semantic analysis In Proc of Uncertainty in Artificial Intelligence UAI99 Stockholm 1999

bull Bosch A Zisserman A and Munoz X Scene Classification via pLSAProceedings of the European Conference on Computer Vision (2006)

bull Sivic J Russell B C Efros A A Zisserman A and Freeman W T Discovering Object Categories in Image Collections MIT AI Lab Memo AIM-2005-005 February 2005

LDA ndash LATENT DIRICHILET ALLOCATION

Problems in pLSA

bull pLSA provides no probabilistic model at the document level Each doc has its own topic mixture proportion

bull The number of parameters in the model grows linearly with M (the number of documents in the training set)

Problems in pLSA

bull There is no constraint for distributions p(z|di)

bull Easy to lead to serious problems with over-fitting

d1

z1

w1

z2

w2

zN1

wN1

hellip

d2

z1

w1

z2

w2

zN2

wN2

hellip

dm

z1

w1

z2

w2

zNm

wNm

hellip

p(z|d1) p(z|d2) p(z|dm)

Dirichlet Distribution

bull In the LDA model the topic mixture proportions for each document are assumed to follow some distribution

bull Requirement for such a distributionndash The samples (mixture proportions) generated from it are K-tuples of non-

negative numbers that sum to one That is the samples are multinormialsndash Easy to optimize

bull Dirichlet Distribution is one of such distributions

bull The space of all of these multinomials has a nice geometric interpretation as a (k-1)-simplex

Dirichlet Distribution

bull Definition

bull The density is zero outside this open (K minus 1)-dimensional simplex

k

i ii

ki ik

i i

ki i

kk

xx

xxxxp i

1

11

1

12121

1 0 st

)Γ()(Γ)|(

bull Various parameter α

(6 2 2) (3 7 5)

(2 3 4) (6 2 6)

Example Dirichlet Distributions (K=3)

Example Dirichlet Distributions (K=3)

bull Equal αi different

α0=01 α0=1 α0=10

k

i i10

The LDA Model

z4z3z2z1

w4w3w2w1

b

z4z3z2z1

w4w3w2w1

z4z3z2z1

w4w3w2w1

bull For each documentbull Choose ~Dirichlet()bull For each of the N words wn

ndash Choose a topic znraquo Multinomial()

ndash Choose a word wn from p(wn|znb) a multinomial probability conditioned on the topic zn

The LDA Model

bull For each documentbull Choose ~Dirichlet()bull For each of the N words wn

ndash Choose a topic znraquo Multinomial()

ndash Choose a word wn from p(wn|znb) a multinomial probability conditioned on the topic zn

Joint Probability

bull Given parameter α and β

where

Likelihood

bull Joint Probability

bull Marginal distribution of a document

bull Likelihood over all the documents

Inference

bull The likelihood can be computed by summing each documentbull Jansenrsquos inequality in EM

Inference

bull In E-Step we need to compute the posterior distribution of the hidden variables

bull Unfortunately this distribution is intractable to compute in general

bull We have to resort to variational approach

Variational Inference

bull In variational inference we consider a simplified graphical model with variational parameters and minimize the KL Divergence between the variational and posterior distributions

Variantional Inference

bull The difference between the lower bound and the likelihood is the KL divergence

bull Maximizing the lower bound L(b) with respect to and is equivalent to minimizing the KL divergence

VBEM vs EM

bull Only different in the E-Step

bull In standard EM q(X) is directly set to p(X|Dθ) and let KL=0bull In VBEM it is intractable to compute p(X|Dθ) Instead it

approximates p(X|Dθ) by a variational distribution q(X) by minimizing KL(q(X) | P(X|D θ)

bull This is also equivalent to maximizing the lower bound L(θ)

Parameter Estimation

bull Given a corpus of documents we would like to find the parameters and b which maximize the likelihood of the observed data

bull Strategy (Variational EM)

bull Lower bound log p(w|b) by a function L(b)bull Repeat until convergence

ndash E Maximize L(b) with respect to the variational parameters ndash M Maximize the bound with respect to parameters and b

Parameter Estimation

bull E-Step Variational Inference ndash repeat until convergence

bull M-Step Parameter estimation

β

α can be implemented using the Newton-Raphson method

Topic Examples in a 100-topic LDA Model)bull 16000 documents from a subset of the TREC AP corpus

Classification (50-topic LDA + SVM)

bull Reuters-21578 dataset ndash contains 8000 documents and 15818 words

(a) EARN vs NOT EARN (b) GRAIN vs NOT GRAIN

Problems in LDA

bull Dirichlet Distribution is helpful to avoid over-fitting But the assumption might be too strong

z4z3z2z1

w4w3w2w1

b

z4z3z2z1

w4w3w2w1

z4z3z2z1

w4w3w2w1

A Bayesian Hierarchical Model for Learning Natural Scene Categories

bull Incorporating category information

MNd

π

z

x

θ

β

Codebookbull 174 Local Image Patches

bull DetectionEvenly Sampled GridRandom SamplingSaliency DetectorLowersquos DoG Detector

bull RepresentationNormalized 11x11 gray values128-dim SIFT

Topic Distribution in Different Categories

Topic Hierarchical Clustering

More Topic Models

bull Dynamic topic models ICML 2006bull Correlated Topic Model NIPS 2005bull Hierarchical Dirichlet Process Journal of the American

Statistical Association 2003bull Nonparametric Bayes pachinko allocation UAI 2007bull Supervised LDA NIPS 2007bull MedLDA ndash Maximum Margin Discrimant LDA ICML 2009bull hellip

Are you really into Graphical Models

bull Describing Visual Scenes using Transformed Dirichlet Processes E Sudderth A Torralba W Freeman and A Willsky NIPS Dec 2005

Reference

bull David M Blei Andrew Y Ng Michael I Jordan Latent Dirichlet Allocation Journal of Machine Learning Research (JMLR) 2003

bull Matthew J Beal Variational Algorithms for Approximate Bayesian Inference PhD Thesis University of Cambridge 1998

bull L Fei-Fei and P Perona A Bayesian Hierarchical Model for Learning Natural Scene Categories CVPR 2005

Outline

bull Matrix Decompositionndash PCA SVD NMFndash LDA ICA Sparse Coding etc

bull Graphical Modelndash Basic concepts in probabilistic machine learningndash EMndash pLSAndash LDA

bull Two Applicationsndash Document decomposition for ldquolong queryrdquo retrieval ICCV 2009ndash Modeling Threaded Discussions SIGIR 2009

Large-Scale Indexing for ldquoLong Queryrdquo Retrieval (Similarity Search)

Xiao Zhang Zhiwei Li Lei Zhang Wei-Ying Ma and Heung-Yeung Shum

ICCV 2009

The Long Query Problem

bull If a query contains 1000 keywordsndash Need to access 1000 inverted listsndash The intersection of 1000 inverted lists may be emptyndash The union of 1000 inverted list may be the whole corpus

bull Dimension reductionTerm1 Term2 Term3 Term4 hellip TermN

Img1 1 2 0 0 hellip 2

f1 f2 hellip fM

Img1 02 01 hellip 003

Topic Projection

Dim = 1 million

Dim = 200

Key Idea Dimension Reduction + Residual Error Preservation

bull p original TF-IDF vector in vocabulary spacebull X projection matrix for dimension reductionbull w low dimensional feature vectorbull residual error

p Xw

An image = 0 01 02 03 04 05 06 07 08 09 1

0

2

4

6

8

10

12

14

16

+ a few words(10 words)

Orthogonal Decomposition

p Xw 1 11 1 1

2 21 2 1 2

1 1

1

k

k

W k W

W W Wk W

p x xp x x w

p wp x x

Base vector

Low dimensional representation

Residual

An image = 0 01 02 03 04 05 06 07 08 09 1

0

2

4

6

8

10

12

14

16

+ a few words(10 words)

T Tp q p q

p q

w w

X1 X2 X3 hellip Xk

A Probabilistic Implementation

x is a switch variable It controls a word generated from

bull a topic specific distribution

bull a document specific distribution

bull a background distribution

( | )p w d 1( 0 | ) ( | ) ( | )K

kp x d p w z k p z k d

( 1| ) ( | )p x d p w d ( 2 | ) ( | )p x d p w

C Chemudugunta etal Modeling General and Specific Aspects of Documents with a Probabilistic Topic Model NIPS 2006

Search (Online)

DS1 DS2 hellip

DS1 DS2 hellip

DS1 DS2 hellip

DS1 DS2 hellip

DS1 DS2 hellip

hellip

hellip

hellip

hellip

hellip

LSH Index

Doc 300 Doc 401 hellip

A query = 0 01 02 03 04 05 06 07 08 09 1

0

2

4

6

8

10

12

14

16

+ a few words

Re-rankingDoc 401 hellip

Doc 1

Doc 2

Doc 300

Doc 401

Doc N

Doc 300

Index 10M Images 46GBSearch Speed lt 100ms

Doc Meta

Search Example

Query Image

Search Example

Query Image

Simultaneously Modeling Semantics and Structure of Threaded Discussions A Sparse

Coding Approach and Its Applications

Chen LIN Jiang-Ming YANG Rui CAI Xin-jing WANG Wei WANG Lei ZHANG

SIGIR 2009

123

Semantic amp structure

124

SemanticTopics

StructureWho reply to who

Optimize them together

Model semantic

Model structure

125

Reply reconstruction

126

DocumentSimilarity

TopicSimilarity

StructureSimilarity

BaselinesNP

Reply to Nearest PostRR

Reply to RootDS

Document SimilarityLDA

Latent Dirichlet Allocation Project documents to topic space

SWB Special Words Topic Model with Background distribution Project documents to topic and junk topic space

127

Evaluation

method Slashdot Apple

All Posts Good Posts All Posts Good Posts

NP 0021 0012 0289 0239

RR 0183 0319 0269 0474

DS 0463 0643 0409 0628

LDA 0465 0644 0410 0648

SWB 0463 0644 0410 0641

SMSS 0524 0737 0517 0772

128

Expert finding

Reply reconstruction

Network construction

Expert finding

Methods

HITS

PageRank

hellip

129

Baselines

LMFormal Models for Expert Finding in Enterprise Corpora SIGIR

06Achieves stable performance in expert finding task using a

language modelPageRank

Benchmark nodal ranking methodHITS

Find hub nodes and authority nodeEABIF

Personalized Recommendation Driven by Information Flow SIGIR rsquo06

Find most influential node130

Evaluation

131

bull Bayesian estimate

Method MRR MAP P10

LM 0821 0698 0800

EABIF(ori) 0674 0362 0243

EABIF(rec) 0742 0318 0281

PageRank(ori) 0675 0377 0263

PageRank(rec) 0743 0321 0266

HITS(ori) 0906 0832 0900

HITS(rec) 0938 0822 0906

Summary

bull Matrix and probability are fundamental mathematics in information retrieval and computer visionndash Matrix decomposition ndash a good practice to learn matrixndash Graphical model ndash a good practice to learn probability

bull Graphical model is a good tool to analyze problems

bull The essence of decomposition is to discover a set of mid-level features to describe original documentsimages

bull It is more adaptable for various applications than matrix decomposition

  • An Introduction To Matrix Decomposition and Graphical Model
  • Outline
  • What Is Matrix Decomposition
  • Why We Need Matrix Decomposition
  • Why We Need Matrix Decomposition (2)
  • Principle Component Analysis
  • Definition ndash Eigenvalue amp Eigenvector
  • Definition ndash Principle Component Analysis
  • Principle Component Analysis (2)
  • Maximizing Variance
  • Optimization Problem
  • Property Data Decomposition
  • Face Recognition ndash Eigenface
  • Slide 14
  • Slide 15
  • PageRank ndash Power Iteration
  • Column-Stochastic amp Irreducible
  • Iterative PageRank Calculation
  • Convergence of the power iteration
  • Singular value decomposition
  • SVD - Definition
  • Singular Values And Singular Vectors
  • Matrix approximation
  • SVD and PCA
  • Example - LSI
  • SVD and PCA (2)
  • Latent Semantic Indexing (LSI)
  • Latent Semantic Indexing
  • Similarity Measures
  • Similarity Measures (2)
  • HITS (Hyperlink Induced Topic Search)
  • Power Iteration
  • HITS and SVD
  • HITS vs PageRank
  • NMF ndash Non-Negative Matrix Factorization
  • Definition
  • Motivation
  • Multiplicative Update Algorithm
  • Multiplicative Update Algorithm (2)
  • NMF vs PCA
  • Reference
  • Major Reference
  • Outline (2)
  • Not Included
  • Basic Concepts
  • What Is Machine Learning
  • Likelihood Function
  • Maximum Likelihood (ML)
  • IID ndash Independent Identically Distributed
  • Reference (2)
  • Expectation Maximization
  • Why We Need EM
  • More General
  • The Expectation Maximization (EM) Algorithm
  • Jensenrsquos Inequality
  • Lower Bounding the Log Likelihood
  • The E and M Steps of EM
  • The E Step
  • The M Step
  • EM Never Decreases the Likelihood
  • Reference (3)
  • Why Do We Need Graphical Model
  • Why Do We Need Graphical Models
  • Directed Acyclic Graphical Models (Bayesian Networks)
  • Directed Graphs for Statistical Models Plate Notation
  • pLSA ndash Probabilistic Latent Semantic Analysis
  • Latent Semantic Indexing (LSI) Review
  • pLSA ndash Probabilistic Latent Semantic Analysis (2)
  • pLSA
  • pLSA (2)
  • Joint Probability vs Likelihood
  • Document Decomposition
  • pLSA ndash Objective Function
  • EM Steps
  • Lower Bounding the Log Likelihood (2)
  • EM Steps (2)
  • Latent Subspace
  • pLSA vs LSA
  • pLSA vs LSA (2)
  • Applications
  • Text Mining
  • Scene Classification
  • Classification Result
  • Reference (4)
  • LDA ndash Latent Dirichilet Allocation
  • Problems in pLSA
  • Problems in pLSA (2)
  • Dirichlet Distribution
  • Dirichlet Distribution (2)
  • Example Dirichlet Distributions (K=3)
  • Example Dirichlet Distributions (K=3) (2)
  • The LDA Model
  • The LDA Model (2)
  • Joint Probability
  • Likelihood
  • Inference
  • Inference (2)
  • Variational Inference
  • Variantional Inference
  • VBEM vs EM
  • Parameter Estimation
  • Parameter Estimation (2)
  • Topic Examples in a 100-topic LDA Model)
  • Classification (50-topic LDA + SVM)
  • Problems in LDA
  • A Bayesian Hierarchical Model for Learning Natural Scene Catego
  • Codebook
  • Topic Distribution in Different Categories
  • Topic Hierarchical Clustering
  • More Topic Models
  • Are you really into Graphical Models
  • Reference (5)
  • Outline (3)
  • Slide 115
  • The Long Query Problem
  • Key Idea Dimension Reduction + Residual Error Preservation
  • Orthogonal Decomposition
  • A Probabilistic Implementation
  • Search (Online)
  • Search Example
  • Search Example (2)
  • Simultaneously Modeling Semantics and Structure of Threaded Dis
  • Semantic amp structure
  • Optimize them together
  • Reply reconstruction
  • Baselines
  • Evaluation
  • Expert finding
  • Baselines (2)
  • Evaluation (2)
  • Summary
Page 44: An Introduction To Matrix Decomposition and Graphical Model

What Is Machine Learning

Databull Let x = (x1 x2 xD) T denote a data point and D = x(1) x(2) x(N) a

data set D is sometimes associated with desired outputs y1 y2

Predictionsbull We are generally interested in predicting something based on the observed

data setbull Given D what can we say about x(N+1)

Modelbull To make predictions we need to make some assumptions We can often

express these assumptions in the form of a model with some parameters θbull Given data D we learn the model parameters from which we can predict

new data pointsbull The model can often be expressed as a probability distribution over data

points

Likelihood Function

bull Given a set of parameter values probability density function (PDF) will show that some data are more probable than other data

bull Inversely given the observed data and a model of interest Likelihood function is defined as

L(θ) = fθ(x|θ) = p(x|θ)

bull That is likelihood function L(θ) will show that some parameters are more likely to have produced the data

Maximum Likelihood (ML)

bull Maximum likelihood will find the best model parameters that make the data ldquomost likelyrdquo generated from this model

bull Suppose we are given n data samples (x1 x2 hellip xn)

bull Maximum likelihood will find θ that maximize L(θ)

bull Predictive distribution

IID ndash Independent Identically Distributed

bull IID means

bull The problem is considerably simplified as

bull Usually log likehood is used

Reference

bull Zoubin Ghahramani Machine Learning (4F13) 2006 Cambridge(Introduction to Machine Learning Lectures 1-2 Slides)

bull Gregor Heinrich Parameter estimation for text analysis Technical Note 2005-2008

EXPECTATION MAXIMIZATION

Why We Need EM

bull The ExpectationndashMaximization (EM) algorithm is a method for ML learning of parameters in latent variable models

bull Why we need latent variables

bull To describe complex model Gaussian Mixture Model

bull To discover the intrinsic structure inside a data set Topic Models such as pLSA LDA

More General

bull Data set bull Likelihood

bull Goal learn maximum likelihood (ML) parameter values

bull The maximum likelihood procedure finds parameters θ such that

bull Because of the integral (or sum) over latent variables the likelihood can be a very complicated and hard to optimize function

The Expectation Maximization (EM) Algorithm

bull The EM algorithm finds a (local) maximum of a latent variable model likelihood It starts from arbitrary values of the parameters and iterates two steps

bull E step Fill in values of latent variables according to posterior given data

bull M step Maximize likelihood as if latent variables were not hidden

bull Decomposes difficult problems into series of tractable steps

Jensenrsquos Inequality

Lower Bounding the Log Likelihoodbull Observed data D = yn Latent variables X = xn Parameters θbull Goal Maximize the log likelihood (ie ML learning) wrt θ

bull Any distribution q(X) over the hidden variables can be used to obtain a lower bound on the log likelihood using Jensenrsquos inequality

bull where H[q] is the entropy of q(X)

The E and M Steps of EM

bull The lower bound on the log likelihood is given by

bull EM alternates betweenbull E step optimize wrt distribution over hidden variables

holding params fixed

bull M step maximize wrt parameters holding hidden distribution fixed

The E Step

bull E step for fixed θ

bull The second term is the Kullback-Leibler divergencebull This means that for fixed θ F is bounded above by L and achieves

that bound whenbull So the E step simply sets

The M Step

bull M step maximize wrt parameters holding hidden distribution q fixed

bull The second equality comes from fact that entropy of q(X) does not depend directly on θ

bull The specific form of the M step depends on the model Often the maximum wrt θ can be found analytically

EM Never Decreases the Likelihood

bull The E and M steps together never decrease the log likelihood

bull The E step brings F(q θ) to the likelihood L(θ) 顶bull The M-step maximizes F(q θ) wrt θ 抬bull F(q θ) lt L(θ) by Jensen ndash or equivalently from the non-

negativity of KL

Reference

bull Zoubin Ghahramani Machine Learning (4F13) 2006 Cambridge(Unsupervised learning Lecture 5 Slides)

bull Christopher M Bishop (2006) Pattern Recognition and Machine Learning Springer

WHY DO WE NEED GRAPHICAL MODEL

Why Do We Need Graphical Models

bull Consndash Graphical model is so complex even with a few circleshellipndash We have to make too many assumptions

bull Prosndash We do need probability to explain our world But joint probability is

hard to computendash Graphical model can help us analyze and understand our problemsndash Graphs are an intuitive way of representing and visualizing the

relationships between many variablesndash With a graphical model we can decouple joint probability to

conditional probabilities which are usually easier

Directed Acyclic Graphical Models (Bayesian Networks)

bull A DAG Model Bayesian network corresponds to a factorization of the joint probability distribution

p(ABCDE) = p(A)p(B)p(C|AB)p(D|BC)p(E|CD)

bull In general

bull where pa(i) are the parents of node i

Directed Graphs for Statistical ModelsPlate Notation

bull A data set of N points generated from a Gaussian

PLSA ndash PROBABILISTIC LATENT SEMANTIC ANALYSIS

Latent Semantic Indexing (LSI) Review

bull For natural Language Queries simple term matching does not work effectivelyndash Ambiguous terms ndash Same Queries vary due to personal styles

bull Latent semantic indexingndash Creates this lsquolatent semantic spacersquo (hidden meaning)

bull LSI puts documents together even if they donrsquot have common words if the docs share frequently co-occurring terms

bull Disadvantagesndash Statistical foundation is missing

pLSA ndash Probabilistic Latent Semantic Analysis

bull Automated Document Indexing and Information retrievalbull Identification of Latent Classes using an Expectation

Maximization (EM) Algorithmbull Shown to solve

ndash Polysemybull Java could mean ldquocoffeerdquo and also the ldquoPL Javardquobull Cricket is a ldquogamerdquo and also an ldquoinsectrdquo

ndash Synonymybull ldquocomputerrdquo ldquopcrdquo ldquodesktoprdquo all could mean the same

bull Has a better statistical foundation than LSA

pLSA

M

Nd

d

z

w

M

d

z1

w1

z2

w2

z3

w3

zN

wN

hellip

z1 hellip zN are variables ziє[1K]K is the number of latent topics

pLSA

d1

z1

w1

z2

w2

zN1

wN1

hellip

d2

z1

w1

z2

w2

zN2

wN2

hellip

dM

z1

w1

z2

w2

zNm

wNm

hellip

p(w|z=1) p(w|z=2) p(w | z=Nm) are shared for all documents

Likelihood

Joint Probability vs Likelihood

bull Joint probability

bull Likelihood (only for observed variables)

bull p(d) is assumed to be uniform

Document Decomposition

bull Each document can be decomposed as

bull This is similar to the matrix decomposition if we consider each discrete distribution as a vector

p(w|d) = ZVtimesk p(z|d)

bull With many documents we hope to find latent topics as common basis

pLSA ndash Objective Function

bull pLSA tries to maximize the log likelihood

bull Due to the summation over z inside log we have to resort to EM

EM Steps

bull E-Stepndash Expectation of the likelihood function is calculated with the current

parameter valuesbull M-Step

ndash Update the parameters with the calculated posterior probabilitiesndash Find the parameters that maximizes the likelihood function

Lower Bounding the Log Likelihood

EM Steps

bull The E-Step

bull The M-Step

Latent Subspace

pLSA vs LSAbull LSA and PLSA perform dimensionality reduction

ndash In LSA by keeping only K singular valuesndash In PLSA by having K aspects

bull Comparison to SVDndash U Matrix related to P(z|d) (doc to aspect)ndash V Matrix related to P(w|z) (aspect to term)ndash E Matrix related to P(z) (aspect strength)

pLSA vs LSAbull The main difference is the way the approximation is done

bull PLSA generates a model (aspect model) and maximizes its predictive power

bull Selecting the proper value of K is heuristic in LSA

bull Model selection in statistics can determine optimal K in PLSA

Applications

bull Text mining topic discovering

bull Scene Classification

Text Mining

Scene Classification

Classification Result

Reference

bull Thomas Hofmann Probabilistic latent semantic analysis In Proc of Uncertainty in Artificial Intelligence UAI99 Stockholm 1999

bull Bosch A Zisserman A and Munoz X Scene Classification via pLSAProceedings of the European Conference on Computer Vision (2006)

bull Sivic J Russell B C Efros A A Zisserman A and Freeman W T Discovering Object Categories in Image Collections MIT AI Lab Memo AIM-2005-005 February 2005

LDA ndash LATENT DIRICHILET ALLOCATION

Problems in pLSA

bull pLSA provides no probabilistic model at the document level Each doc has its own topic mixture proportion

bull The number of parameters in the model grows linearly with M (the number of documents in the training set)

Problems in pLSA

bull There is no constraint for distributions p(z|di)

bull Easy to lead to serious problems with over-fitting

d1

z1

w1

z2

w2

zN1

wN1

hellip

d2

z1

w1

z2

w2

zN2

wN2

hellip

dm

z1

w1

z2

w2

zNm

wNm

hellip

p(z|d1) p(z|d2) p(z|dm)

Dirichlet Distribution

bull In the LDA model the topic mixture proportions for each document are assumed to follow some distribution

bull Requirement for such a distributionndash The samples (mixture proportions) generated from it are K-tuples of non-

negative numbers that sum to one That is the samples are multinormialsndash Easy to optimize

bull Dirichlet Distribution is one of such distributions

bull The space of all of these multinomials has a nice geometric interpretation as a (k-1)-simplex

Dirichlet Distribution

bull Definition

bull The density is zero outside this open (K minus 1)-dimensional simplex

k

i ii

ki ik

i i

ki i

kk

xx

xxxxp i

1

11

1

12121

1 0 st

)Γ()(Γ)|(

bull Various parameter α

(6 2 2) (3 7 5)

(2 3 4) (6 2 6)

Example Dirichlet Distributions (K=3)

Example Dirichlet Distributions (K=3)

bull Equal αi different

α0=01 α0=1 α0=10

k

i i10

The LDA Model

z4z3z2z1

w4w3w2w1

b

z4z3z2z1

w4w3w2w1

z4z3z2z1

w4w3w2w1

bull For each documentbull Choose ~Dirichlet()bull For each of the N words wn

ndash Choose a topic znraquo Multinomial()

ndash Choose a word wn from p(wn|znb) a multinomial probability conditioned on the topic zn

The LDA Model

bull For each documentbull Choose ~Dirichlet()bull For each of the N words wn

ndash Choose a topic znraquo Multinomial()

ndash Choose a word wn from p(wn|znb) a multinomial probability conditioned on the topic zn

Joint Probability

bull Given parameter α and β

where

Likelihood

bull Joint Probability

bull Marginal distribution of a document

bull Likelihood over all the documents

Inference

bull The likelihood can be computed by summing each documentbull Jansenrsquos inequality in EM

Inference

bull In E-Step we need to compute the posterior distribution of the hidden variables

bull Unfortunately this distribution is intractable to compute in general

bull We have to resort to variational approach

Variational Inference

bull In variational inference we consider a simplified graphical model with variational parameters and minimize the KL Divergence between the variational and posterior distributions

Variantional Inference

bull The difference between the lower bound and the likelihood is the KL divergence

bull Maximizing the lower bound L(b) with respect to and is equivalent to minimizing the KL divergence

VBEM vs EM

bull Only different in the E-Step

bull In standard EM q(X) is directly set to p(X|Dθ) and let KL=0bull In VBEM it is intractable to compute p(X|Dθ) Instead it

approximates p(X|Dθ) by a variational distribution q(X) by minimizing KL(q(X) | P(X|D θ)

bull This is also equivalent to maximizing the lower bound L(θ)

Parameter Estimation

bull Given a corpus of documents we would like to find the parameters and b which maximize the likelihood of the observed data

bull Strategy (Variational EM)

bull Lower bound log p(w|b) by a function L(b)bull Repeat until convergence

ndash E Maximize L(b) with respect to the variational parameters ndash M Maximize the bound with respect to parameters and b

Parameter Estimation

bull E-Step Variational Inference ndash repeat until convergence

bull M-Step Parameter estimation

β

α can be implemented using the Newton-Raphson method

Topic Examples in a 100-topic LDA Model)bull 16000 documents from a subset of the TREC AP corpus

Classification (50-topic LDA + SVM)

bull Reuters-21578 dataset ndash contains 8000 documents and 15818 words

(a) EARN vs NOT EARN (b) GRAIN vs NOT GRAIN

Problems in LDA

bull Dirichlet Distribution is helpful to avoid over-fitting But the assumption might be too strong

z4z3z2z1

w4w3w2w1

b

z4z3z2z1

w4w3w2w1

z4z3z2z1

w4w3w2w1

A Bayesian Hierarchical Model for Learning Natural Scene Categories

bull Incorporating category information

MNd

π

z

x

θ

β

Codebookbull 174 Local Image Patches

bull DetectionEvenly Sampled GridRandom SamplingSaliency DetectorLowersquos DoG Detector

bull RepresentationNormalized 11x11 gray values128-dim SIFT

Topic Distribution in Different Categories

Topic Hierarchical Clustering

More Topic Models

bull Dynamic topic models ICML 2006bull Correlated Topic Model NIPS 2005bull Hierarchical Dirichlet Process Journal of the American

Statistical Association 2003bull Nonparametric Bayes pachinko allocation UAI 2007bull Supervised LDA NIPS 2007bull MedLDA ndash Maximum Margin Discrimant LDA ICML 2009bull hellip

Are you really into Graphical Models

bull Describing Visual Scenes using Transformed Dirichlet Processes E Sudderth A Torralba W Freeman and A Willsky NIPS Dec 2005

Reference

bull David M Blei Andrew Y Ng Michael I Jordan Latent Dirichlet Allocation Journal of Machine Learning Research (JMLR) 2003

bull Matthew J Beal Variational Algorithms for Approximate Bayesian Inference PhD Thesis University of Cambridge 1998

bull L Fei-Fei and P Perona A Bayesian Hierarchical Model for Learning Natural Scene Categories CVPR 2005

Outline

bull Matrix Decompositionndash PCA SVD NMFndash LDA ICA Sparse Coding etc

bull Graphical Modelndash Basic concepts in probabilistic machine learningndash EMndash pLSAndash LDA

bull Two Applicationsndash Document decomposition for ldquolong queryrdquo retrieval ICCV 2009ndash Modeling Threaded Discussions SIGIR 2009

Large-Scale Indexing for ldquoLong Queryrdquo Retrieval (Similarity Search)

Xiao Zhang Zhiwei Li Lei Zhang Wei-Ying Ma and Heung-Yeung Shum

ICCV 2009

The Long Query Problem

bull If a query contains 1000 keywordsndash Need to access 1000 inverted listsndash The intersection of 1000 inverted lists may be emptyndash The union of 1000 inverted list may be the whole corpus

bull Dimension reductionTerm1 Term2 Term3 Term4 hellip TermN

Img1 1 2 0 0 hellip 2

f1 f2 hellip fM

Img1 02 01 hellip 003

Topic Projection

Dim = 1 million

Dim = 200

Key Idea Dimension Reduction + Residual Error Preservation

bull p original TF-IDF vector in vocabulary spacebull X projection matrix for dimension reductionbull w low dimensional feature vectorbull residual error

p Xw

An image = 0 01 02 03 04 05 06 07 08 09 1

0

2

4

6

8

10

12

14

16

+ a few words(10 words)

Orthogonal Decomposition

p Xw 1 11 1 1

2 21 2 1 2

1 1

1

k

k

W k W

W W Wk W

p x xp x x w

p wp x x

Base vector

Low dimensional representation

Residual

An image = 0 01 02 03 04 05 06 07 08 09 1

0

2

4

6

8

10

12

14

16

+ a few words(10 words)

T Tp q p q

p q

w w

X1 X2 X3 hellip Xk

A Probabilistic Implementation

x is a switch variable It controls a word generated from

bull a topic specific distribution

bull a document specific distribution

bull a background distribution

( | )p w d 1( 0 | ) ( | ) ( | )K

kp x d p w z k p z k d

( 1| ) ( | )p x d p w d ( 2 | ) ( | )p x d p w

C Chemudugunta etal Modeling General and Specific Aspects of Documents with a Probabilistic Topic Model NIPS 2006

Search (Online)

DS1 DS2 hellip

DS1 DS2 hellip

DS1 DS2 hellip

DS1 DS2 hellip

DS1 DS2 hellip

hellip

hellip

hellip

hellip

hellip

LSH Index

Doc 300 Doc 401 hellip

A query = 0 01 02 03 04 05 06 07 08 09 1

0

2

4

6

8

10

12

14

16

+ a few words

Re-rankingDoc 401 hellip

Doc 1

Doc 2

Doc 300

Doc 401

Doc N

Doc 300

Index 10M Images 46GBSearch Speed lt 100ms

Doc Meta

Search Example

Query Image

Search Example

Query Image

Simultaneously Modeling Semantics and Structure of Threaded Discussions A Sparse

Coding Approach and Its Applications

Chen LIN Jiang-Ming YANG Rui CAI Xin-jing WANG Wei WANG Lei ZHANG

SIGIR 2009

123

Semantic amp structure

124

SemanticTopics

StructureWho reply to who

Optimize them together

Model semantic

Model structure

125

Reply reconstruction

126

DocumentSimilarity

TopicSimilarity

StructureSimilarity

BaselinesNP

Reply to Nearest PostRR

Reply to RootDS

Document SimilarityLDA

Latent Dirichlet Allocation Project documents to topic space

SWB Special Words Topic Model with Background distribution Project documents to topic and junk topic space

127

Evaluation

method Slashdot Apple

All Posts Good Posts All Posts Good Posts

NP 0021 0012 0289 0239

RR 0183 0319 0269 0474

DS 0463 0643 0409 0628

LDA 0465 0644 0410 0648

SWB 0463 0644 0410 0641

SMSS 0524 0737 0517 0772

128

Expert finding

Reply reconstruction

Network construction

Expert finding

Methods

HITS

PageRank

hellip

129

Baselines

LMFormal Models for Expert Finding in Enterprise Corpora SIGIR

06Achieves stable performance in expert finding task using a

language modelPageRank

Benchmark nodal ranking methodHITS

Find hub nodes and authority nodeEABIF

Personalized Recommendation Driven by Information Flow SIGIR rsquo06

Find most influential node130

Evaluation

131

bull Bayesian estimate

Method MRR MAP P10

LM 0821 0698 0800

EABIF(ori) 0674 0362 0243

EABIF(rec) 0742 0318 0281

PageRank(ori) 0675 0377 0263

PageRank(rec) 0743 0321 0266

HITS(ori) 0906 0832 0900

HITS(rec) 0938 0822 0906

Summary

bull Matrix and probability are fundamental mathematics in information retrieval and computer visionndash Matrix decomposition ndash a good practice to learn matrixndash Graphical model ndash a good practice to learn probability

bull Graphical model is a good tool to analyze problems

bull The essence of decomposition is to discover a set of mid-level features to describe original documentsimages

bull It is more adaptable for various applications than matrix decomposition

  • An Introduction To Matrix Decomposition and Graphical Model
  • Outline
  • What Is Matrix Decomposition
  • Why We Need Matrix Decomposition
  • Why We Need Matrix Decomposition (2)
  • Principle Component Analysis
  • Definition ndash Eigenvalue amp Eigenvector
  • Definition ndash Principle Component Analysis
  • Principle Component Analysis (2)
  • Maximizing Variance
  • Optimization Problem
  • Property Data Decomposition
  • Face Recognition ndash Eigenface
  • Slide 14
  • Slide 15
  • PageRank ndash Power Iteration
  • Column-Stochastic amp Irreducible
  • Iterative PageRank Calculation
  • Convergence of the power iteration
  • Singular value decomposition
  • SVD - Definition
  • Singular Values And Singular Vectors
  • Matrix approximation
  • SVD and PCA
  • Example - LSI
  • SVD and PCA (2)
  • Latent Semantic Indexing (LSI)
  • Latent Semantic Indexing
  • Similarity Measures
  • Similarity Measures (2)
  • HITS (Hyperlink Induced Topic Search)
  • Power Iteration
  • HITS and SVD
  • HITS vs PageRank
  • NMF ndash Non-Negative Matrix Factorization
  • Definition
  • Motivation
  • Multiplicative Update Algorithm
  • Multiplicative Update Algorithm (2)
  • NMF vs PCA
  • Reference
  • Major Reference
  • Outline (2)
  • Not Included
  • Basic Concepts
  • What Is Machine Learning
  • Likelihood Function
  • Maximum Likelihood (ML)
  • IID ndash Independent Identically Distributed
  • Reference (2)
  • Expectation Maximization
  • Why We Need EM
  • More General
  • The Expectation Maximization (EM) Algorithm
  • Jensenrsquos Inequality
  • Lower Bounding the Log Likelihood
  • The E and M Steps of EM
  • The E Step
  • The M Step
  • EM Never Decreases the Likelihood
  • Reference (3)
  • Why Do We Need Graphical Model
  • Why Do We Need Graphical Models
  • Directed Acyclic Graphical Models (Bayesian Networks)
  • Directed Graphs for Statistical Models Plate Notation
  • pLSA ndash Probabilistic Latent Semantic Analysis
  • Latent Semantic Indexing (LSI) Review
  • pLSA ndash Probabilistic Latent Semantic Analysis (2)
  • pLSA
  • pLSA (2)
  • Joint Probability vs Likelihood
  • Document Decomposition
  • pLSA ndash Objective Function
  • EM Steps
  • Lower Bounding the Log Likelihood (2)
  • EM Steps (2)
  • Latent Subspace
  • pLSA vs LSA
  • pLSA vs LSA (2)
  • Applications
  • Text Mining
  • Scene Classification
  • Classification Result
  • Reference (4)
  • LDA ndash Latent Dirichilet Allocation
  • Problems in pLSA
  • Problems in pLSA (2)
  • Dirichlet Distribution
  • Dirichlet Distribution (2)
  • Example Dirichlet Distributions (K=3)
  • Example Dirichlet Distributions (K=3) (2)
  • The LDA Model
  • The LDA Model (2)
  • Joint Probability
  • Likelihood
  • Inference
  • Inference (2)
  • Variational Inference
  • Variantional Inference
  • VBEM vs EM
  • Parameter Estimation
  • Parameter Estimation (2)
  • Topic Examples in a 100-topic LDA Model)
  • Classification (50-topic LDA + SVM)
  • Problems in LDA
  • A Bayesian Hierarchical Model for Learning Natural Scene Catego
  • Codebook
  • Topic Distribution in Different Categories
  • Topic Hierarchical Clustering
  • More Topic Models
  • Are you really into Graphical Models
  • Reference (5)
  • Outline (3)
  • Slide 115
  • The Long Query Problem
  • Key Idea Dimension Reduction + Residual Error Preservation
  • Orthogonal Decomposition
  • A Probabilistic Implementation
  • Search (Online)
  • Search Example
  • Search Example (2)
  • Simultaneously Modeling Semantics and Structure of Threaded Dis
  • Semantic amp structure
  • Optimize them together
  • Reply reconstruction
  • Baselines
  • Evaluation
  • Expert finding
  • Baselines (2)
  • Evaluation (2)
  • Summary
Page 45: An Introduction To Matrix Decomposition and Graphical Model

Likelihood Function

bull Given a set of parameter values probability density function (PDF) will show that some data are more probable than other data

bull Inversely given the observed data and a model of interest Likelihood function is defined as

L(θ) = fθ(x|θ) = p(x|θ)

bull That is likelihood function L(θ) will show that some parameters are more likely to have produced the data

Maximum Likelihood (ML)

bull Maximum likelihood will find the best model parameters that make the data ldquomost likelyrdquo generated from this model

bull Suppose we are given n data samples (x1 x2 hellip xn)

bull Maximum likelihood will find θ that maximize L(θ)

bull Predictive distribution

IID ndash Independent Identically Distributed

bull IID means

bull The problem is considerably simplified as

bull Usually log likehood is used

Reference

bull Zoubin Ghahramani Machine Learning (4F13) 2006 Cambridge(Introduction to Machine Learning Lectures 1-2 Slides)

bull Gregor Heinrich Parameter estimation for text analysis Technical Note 2005-2008

EXPECTATION MAXIMIZATION

Why We Need EM

bull The ExpectationndashMaximization (EM) algorithm is a method for ML learning of parameters in latent variable models

bull Why we need latent variables

bull To describe complex model Gaussian Mixture Model

bull To discover the intrinsic structure inside a data set Topic Models such as pLSA LDA

More General

bull Data set bull Likelihood

bull Goal learn maximum likelihood (ML) parameter values

bull The maximum likelihood procedure finds parameters θ such that

bull Because of the integral (or sum) over latent variables the likelihood can be a very complicated and hard to optimize function

The Expectation Maximization (EM) Algorithm

bull The EM algorithm finds a (local) maximum of a latent variable model likelihood It starts from arbitrary values of the parameters and iterates two steps

bull E step Fill in values of latent variables according to posterior given data

bull M step Maximize likelihood as if latent variables were not hidden

bull Decomposes difficult problems into series of tractable steps

Jensenrsquos Inequality

Lower Bounding the Log Likelihoodbull Observed data D = yn Latent variables X = xn Parameters θbull Goal Maximize the log likelihood (ie ML learning) wrt θ

bull Any distribution q(X) over the hidden variables can be used to obtain a lower bound on the log likelihood using Jensenrsquos inequality

bull where H[q] is the entropy of q(X)

The E and M Steps of EM

bull The lower bound on the log likelihood is given by

bull EM alternates betweenbull E step optimize wrt distribution over hidden variables

holding params fixed

bull M step maximize wrt parameters holding hidden distribution fixed

The E Step

bull E step for fixed θ

bull The second term is the Kullback-Leibler divergencebull This means that for fixed θ F is bounded above by L and achieves

that bound whenbull So the E step simply sets

The M Step

bull M step maximize wrt parameters holding hidden distribution q fixed

bull The second equality comes from fact that entropy of q(X) does not depend directly on θ

bull The specific form of the M step depends on the model Often the maximum wrt θ can be found analytically

EM Never Decreases the Likelihood

bull The E and M steps together never decrease the log likelihood

bull The E step brings F(q θ) to the likelihood L(θ) 顶bull The M-step maximizes F(q θ) wrt θ 抬bull F(q θ) lt L(θ) by Jensen ndash or equivalently from the non-

negativity of KL

Reference

bull Zoubin Ghahramani Machine Learning (4F13) 2006 Cambridge(Unsupervised learning Lecture 5 Slides)

bull Christopher M Bishop (2006) Pattern Recognition and Machine Learning Springer

WHY DO WE NEED GRAPHICAL MODEL

Why Do We Need Graphical Models

bull Consndash Graphical model is so complex even with a few circleshellipndash We have to make too many assumptions

bull Prosndash We do need probability to explain our world But joint probability is

hard to computendash Graphical model can help us analyze and understand our problemsndash Graphs are an intuitive way of representing and visualizing the

relationships between many variablesndash With a graphical model we can decouple joint probability to

conditional probabilities which are usually easier

Directed Acyclic Graphical Models (Bayesian Networks)

bull A DAG Model Bayesian network corresponds to a factorization of the joint probability distribution

p(ABCDE) = p(A)p(B)p(C|AB)p(D|BC)p(E|CD)

bull In general

bull where pa(i) are the parents of node i

Directed Graphs for Statistical ModelsPlate Notation

bull A data set of N points generated from a Gaussian

PLSA ndash PROBABILISTIC LATENT SEMANTIC ANALYSIS

Latent Semantic Indexing (LSI) Review

bull For natural Language Queries simple term matching does not work effectivelyndash Ambiguous terms ndash Same Queries vary due to personal styles

bull Latent semantic indexingndash Creates this lsquolatent semantic spacersquo (hidden meaning)

bull LSI puts documents together even if they donrsquot have common words if the docs share frequently co-occurring terms

bull Disadvantagesndash Statistical foundation is missing

pLSA ndash Probabilistic Latent Semantic Analysis

bull Automated Document Indexing and Information retrievalbull Identification of Latent Classes using an Expectation

Maximization (EM) Algorithmbull Shown to solve

ndash Polysemybull Java could mean ldquocoffeerdquo and also the ldquoPL Javardquobull Cricket is a ldquogamerdquo and also an ldquoinsectrdquo

ndash Synonymybull ldquocomputerrdquo ldquopcrdquo ldquodesktoprdquo all could mean the same

bull Has a better statistical foundation than LSA

pLSA

M

Nd

d

z

w

M

d

z1

w1

z2

w2

z3

w3

zN

wN

hellip

z1 hellip zN are variables ziє[1K]K is the number of latent topics

pLSA

d1

z1

w1

z2

w2

zN1

wN1

hellip

d2

z1

w1

z2

w2

zN2

wN2

hellip

dM

z1

w1

z2

w2

zNm

wNm

hellip

p(w|z=1) p(w|z=2) p(w | z=Nm) are shared for all documents

Likelihood

Joint Probability vs Likelihood

bull Joint probability

bull Likelihood (only for observed variables)

bull p(d) is assumed to be uniform

Document Decomposition

bull Each document can be decomposed as

bull This is similar to the matrix decomposition if we consider each discrete distribution as a vector

p(w|d) = ZVtimesk p(z|d)

bull With many documents we hope to find latent topics as common basis

pLSA ndash Objective Function

bull pLSA tries to maximize the log likelihood

bull Due to the summation over z inside log we have to resort to EM

EM Steps

bull E-Stepndash Expectation of the likelihood function is calculated with the current

parameter valuesbull M-Step

ndash Update the parameters with the calculated posterior probabilitiesndash Find the parameters that maximizes the likelihood function

Lower Bounding the Log Likelihood

EM Steps

bull The E-Step

bull The M-Step

Latent Subspace

pLSA vs LSAbull LSA and PLSA perform dimensionality reduction

ndash In LSA by keeping only K singular valuesndash In PLSA by having K aspects

bull Comparison to SVDndash U Matrix related to P(z|d) (doc to aspect)ndash V Matrix related to P(w|z) (aspect to term)ndash E Matrix related to P(z) (aspect strength)

pLSA vs LSAbull The main difference is the way the approximation is done

bull PLSA generates a model (aspect model) and maximizes its predictive power

bull Selecting the proper value of K is heuristic in LSA

bull Model selection in statistics can determine optimal K in PLSA

Applications

bull Text mining topic discovering

bull Scene Classification

Text Mining

Scene Classification

Classification Result

Reference

bull Thomas Hofmann Probabilistic latent semantic analysis In Proc of Uncertainty in Artificial Intelligence UAI99 Stockholm 1999

bull Bosch A Zisserman A and Munoz X Scene Classification via pLSAProceedings of the European Conference on Computer Vision (2006)

bull Sivic J Russell B C Efros A A Zisserman A and Freeman W T Discovering Object Categories in Image Collections MIT AI Lab Memo AIM-2005-005 February 2005

LDA ndash LATENT DIRICHILET ALLOCATION

Problems in pLSA

bull pLSA provides no probabilistic model at the document level Each doc has its own topic mixture proportion

bull The number of parameters in the model grows linearly with M (the number of documents in the training set)

Problems in pLSA

bull There is no constraint for distributions p(z|di)

bull Easy to lead to serious problems with over-fitting

d1

z1

w1

z2

w2

zN1

wN1

hellip

d2

z1

w1

z2

w2

zN2

wN2

hellip

dm

z1

w1

z2

w2

zNm

wNm

hellip

p(z|d1) p(z|d2) p(z|dm)

Dirichlet Distribution

bull In the LDA model the topic mixture proportions for each document are assumed to follow some distribution

bull Requirement for such a distributionndash The samples (mixture proportions) generated from it are K-tuples of non-

negative numbers that sum to one That is the samples are multinormialsndash Easy to optimize

bull Dirichlet Distribution is one of such distributions

bull The space of all of these multinomials has a nice geometric interpretation as a (k-1)-simplex

Dirichlet Distribution

bull Definition

bull The density is zero outside this open (K minus 1)-dimensional simplex

k

i ii

ki ik

i i

ki i

kk

xx

xxxxp i

1

11

1

12121

1 0 st

)Γ()(Γ)|(

bull Various parameter α

(6 2 2) (3 7 5)

(2 3 4) (6 2 6)

Example Dirichlet Distributions (K=3)

Example Dirichlet Distributions (K=3)

bull Equal αi different

α0=01 α0=1 α0=10

k

i i10

The LDA Model

z4z3z2z1

w4w3w2w1

b

z4z3z2z1

w4w3w2w1

z4z3z2z1

w4w3w2w1

bull For each documentbull Choose ~Dirichlet()bull For each of the N words wn

ndash Choose a topic znraquo Multinomial()

ndash Choose a word wn from p(wn|znb) a multinomial probability conditioned on the topic zn

The LDA Model

bull For each documentbull Choose ~Dirichlet()bull For each of the N words wn

ndash Choose a topic znraquo Multinomial()

ndash Choose a word wn from p(wn|znb) a multinomial probability conditioned on the topic zn

Joint Probability

bull Given parameter α and β

where

Likelihood

bull Joint Probability

bull Marginal distribution of a document

bull Likelihood over all the documents

Inference

bull The likelihood can be computed by summing each documentbull Jansenrsquos inequality in EM

Inference

bull In E-Step we need to compute the posterior distribution of the hidden variables

bull Unfortunately this distribution is intractable to compute in general

bull We have to resort to variational approach

Variational Inference

bull In variational inference we consider a simplified graphical model with variational parameters and minimize the KL Divergence between the variational and posterior distributions

Variantional Inference

bull The difference between the lower bound and the likelihood is the KL divergence

bull Maximizing the lower bound L(b) with respect to and is equivalent to minimizing the KL divergence

VBEM vs EM

bull Only different in the E-Step

bull In standard EM q(X) is directly set to p(X|Dθ) and let KL=0bull In VBEM it is intractable to compute p(X|Dθ) Instead it

approximates p(X|Dθ) by a variational distribution q(X) by minimizing KL(q(X) | P(X|D θ)

bull This is also equivalent to maximizing the lower bound L(θ)

Parameter Estimation

bull Given a corpus of documents we would like to find the parameters and b which maximize the likelihood of the observed data

bull Strategy (Variational EM)

bull Lower bound log p(w|b) by a function L(b)bull Repeat until convergence

ndash E Maximize L(b) with respect to the variational parameters ndash M Maximize the bound with respect to parameters and b

Parameter Estimation

bull E-Step Variational Inference ndash repeat until convergence

bull M-Step Parameter estimation

β

α can be implemented using the Newton-Raphson method

Topic Examples in a 100-topic LDA Model)bull 16000 documents from a subset of the TREC AP corpus

Classification (50-topic LDA + SVM)

bull Reuters-21578 dataset ndash contains 8000 documents and 15818 words

(a) EARN vs NOT EARN (b) GRAIN vs NOT GRAIN

Problems in LDA

bull Dirichlet Distribution is helpful to avoid over-fitting But the assumption might be too strong

z4z3z2z1

w4w3w2w1

b

z4z3z2z1

w4w3w2w1

z4z3z2z1

w4w3w2w1

A Bayesian Hierarchical Model for Learning Natural Scene Categories

bull Incorporating category information

MNd

π

z

x

θ

β

Codebookbull 174 Local Image Patches

bull DetectionEvenly Sampled GridRandom SamplingSaliency DetectorLowersquos DoG Detector

bull RepresentationNormalized 11x11 gray values128-dim SIFT

Topic Distribution in Different Categories

Topic Hierarchical Clustering

More Topic Models

bull Dynamic topic models ICML 2006bull Correlated Topic Model NIPS 2005bull Hierarchical Dirichlet Process Journal of the American

Statistical Association 2003bull Nonparametric Bayes pachinko allocation UAI 2007bull Supervised LDA NIPS 2007bull MedLDA ndash Maximum Margin Discrimant LDA ICML 2009bull hellip

Are you really into Graphical Models

bull Describing Visual Scenes using Transformed Dirichlet Processes E Sudderth A Torralba W Freeman and A Willsky NIPS Dec 2005

Reference

bull David M Blei Andrew Y Ng Michael I Jordan Latent Dirichlet Allocation Journal of Machine Learning Research (JMLR) 2003

bull Matthew J Beal Variational Algorithms for Approximate Bayesian Inference PhD Thesis University of Cambridge 1998

bull L Fei-Fei and P Perona A Bayesian Hierarchical Model for Learning Natural Scene Categories CVPR 2005

Outline

bull Matrix Decompositionndash PCA SVD NMFndash LDA ICA Sparse Coding etc

bull Graphical Modelndash Basic concepts in probabilistic machine learningndash EMndash pLSAndash LDA

bull Two Applicationsndash Document decomposition for ldquolong queryrdquo retrieval ICCV 2009ndash Modeling Threaded Discussions SIGIR 2009

Large-Scale Indexing for ldquoLong Queryrdquo Retrieval (Similarity Search)

Xiao Zhang Zhiwei Li Lei Zhang Wei-Ying Ma and Heung-Yeung Shum

ICCV 2009

The Long Query Problem

bull If a query contains 1000 keywordsndash Need to access 1000 inverted listsndash The intersection of 1000 inverted lists may be emptyndash The union of 1000 inverted list may be the whole corpus

bull Dimension reductionTerm1 Term2 Term3 Term4 hellip TermN

Img1 1 2 0 0 hellip 2

f1 f2 hellip fM

Img1 02 01 hellip 003

Topic Projection

Dim = 1 million

Dim = 200

Key Idea Dimension Reduction + Residual Error Preservation

bull p original TF-IDF vector in vocabulary spacebull X projection matrix for dimension reductionbull w low dimensional feature vectorbull residual error

p Xw

An image = 0 01 02 03 04 05 06 07 08 09 1

0

2

4

6

8

10

12

14

16

+ a few words(10 words)

Orthogonal Decomposition

p Xw 1 11 1 1

2 21 2 1 2

1 1

1

k

k

W k W

W W Wk W

p x xp x x w

p wp x x

Base vector

Low dimensional representation

Residual

An image = 0 01 02 03 04 05 06 07 08 09 1

0

2

4

6

8

10

12

14

16

+ a few words(10 words)

T Tp q p q

p q

w w

X1 X2 X3 hellip Xk

A Probabilistic Implementation

x is a switch variable It controls a word generated from

bull a topic specific distribution

bull a document specific distribution

bull a background distribution

( | )p w d 1( 0 | ) ( | ) ( | )K

kp x d p w z k p z k d

( 1| ) ( | )p x d p w d ( 2 | ) ( | )p x d p w

C Chemudugunta etal Modeling General and Specific Aspects of Documents with a Probabilistic Topic Model NIPS 2006

Search (Online)

DS1 DS2 hellip

DS1 DS2 hellip

DS1 DS2 hellip

DS1 DS2 hellip

DS1 DS2 hellip

hellip

hellip

hellip

hellip

hellip

LSH Index

Doc 300 Doc 401 hellip

A query = 0 01 02 03 04 05 06 07 08 09 1

0

2

4

6

8

10

12

14

16

+ a few words

Re-rankingDoc 401 hellip

Doc 1

Doc 2

Doc 300

Doc 401

Doc N

Doc 300

Index 10M Images 46GBSearch Speed lt 100ms

Doc Meta

Search Example

Query Image

Search Example

Query Image

Simultaneously Modeling Semantics and Structure of Threaded Discussions A Sparse

Coding Approach and Its Applications

Chen LIN Jiang-Ming YANG Rui CAI Xin-jing WANG Wei WANG Lei ZHANG

SIGIR 2009

123

Semantic amp structure

124

SemanticTopics

StructureWho reply to who

Optimize them together

Model semantic

Model structure

125

Reply reconstruction

126

DocumentSimilarity

TopicSimilarity

StructureSimilarity

BaselinesNP

Reply to Nearest PostRR

Reply to RootDS

Document SimilarityLDA

Latent Dirichlet Allocation Project documents to topic space

SWB Special Words Topic Model with Background distribution Project documents to topic and junk topic space

127

Evaluation

method Slashdot Apple

All Posts Good Posts All Posts Good Posts

NP 0021 0012 0289 0239

RR 0183 0319 0269 0474

DS 0463 0643 0409 0628

LDA 0465 0644 0410 0648

SWB 0463 0644 0410 0641

SMSS 0524 0737 0517 0772

128

Expert finding

Reply reconstruction

Network construction

Expert finding

Methods

HITS

PageRank

hellip

129

Baselines

LMFormal Models for Expert Finding in Enterprise Corpora SIGIR

06Achieves stable performance in expert finding task using a

language modelPageRank

Benchmark nodal ranking methodHITS

Find hub nodes and authority nodeEABIF

Personalized Recommendation Driven by Information Flow SIGIR rsquo06

Find most influential node130

Evaluation

131

bull Bayesian estimate

Method MRR MAP P10

LM 0821 0698 0800

EABIF(ori) 0674 0362 0243

EABIF(rec) 0742 0318 0281

PageRank(ori) 0675 0377 0263

PageRank(rec) 0743 0321 0266

HITS(ori) 0906 0832 0900

HITS(rec) 0938 0822 0906

Summary

bull Matrix and probability are fundamental mathematics in information retrieval and computer visionndash Matrix decomposition ndash a good practice to learn matrixndash Graphical model ndash a good practice to learn probability

bull Graphical model is a good tool to analyze problems

bull The essence of decomposition is to discover a set of mid-level features to describe original documentsimages

bull It is more adaptable for various applications than matrix decomposition

  • An Introduction To Matrix Decomposition and Graphical Model
  • Outline
  • What Is Matrix Decomposition
  • Why We Need Matrix Decomposition
  • Why We Need Matrix Decomposition (2)
  • Principle Component Analysis
  • Definition ndash Eigenvalue amp Eigenvector
  • Definition ndash Principle Component Analysis
  • Principle Component Analysis (2)
  • Maximizing Variance
  • Optimization Problem
  • Property Data Decomposition
  • Face Recognition ndash Eigenface
  • Slide 14
  • Slide 15
  • PageRank ndash Power Iteration
  • Column-Stochastic amp Irreducible
  • Iterative PageRank Calculation
  • Convergence of the power iteration
  • Singular value decomposition
  • SVD - Definition
  • Singular Values And Singular Vectors
  • Matrix approximation
  • SVD and PCA
  • Example - LSI
  • SVD and PCA (2)
  • Latent Semantic Indexing (LSI)
  • Latent Semantic Indexing
  • Similarity Measures
  • Similarity Measures (2)
  • HITS (Hyperlink Induced Topic Search)
  • Power Iteration
  • HITS and SVD
  • HITS vs PageRank
  • NMF ndash Non-Negative Matrix Factorization
  • Definition
  • Motivation
  • Multiplicative Update Algorithm
  • Multiplicative Update Algorithm (2)
  • NMF vs PCA
  • Reference
  • Major Reference
  • Outline (2)
  • Not Included
  • Basic Concepts
  • What Is Machine Learning
  • Likelihood Function
  • Maximum Likelihood (ML)
  • IID ndash Independent Identically Distributed
  • Reference (2)
  • Expectation Maximization
  • Why We Need EM
  • More General
  • The Expectation Maximization (EM) Algorithm
  • Jensenrsquos Inequality
  • Lower Bounding the Log Likelihood
  • The E and M Steps of EM
  • The E Step
  • The M Step
  • EM Never Decreases the Likelihood
  • Reference (3)
  • Why Do We Need Graphical Model
  • Why Do We Need Graphical Models
  • Directed Acyclic Graphical Models (Bayesian Networks)
  • Directed Graphs for Statistical Models Plate Notation
  • pLSA ndash Probabilistic Latent Semantic Analysis
  • Latent Semantic Indexing (LSI) Review
  • pLSA ndash Probabilistic Latent Semantic Analysis (2)
  • pLSA
  • pLSA (2)
  • Joint Probability vs Likelihood
  • Document Decomposition
  • pLSA ndash Objective Function
  • EM Steps
  • Lower Bounding the Log Likelihood (2)
  • EM Steps (2)
  • Latent Subspace
  • pLSA vs LSA
  • pLSA vs LSA (2)
  • Applications
  • Text Mining
  • Scene Classification
  • Classification Result
  • Reference (4)
  • LDA ndash Latent Dirichilet Allocation
  • Problems in pLSA
  • Problems in pLSA (2)
  • Dirichlet Distribution
  • Dirichlet Distribution (2)
  • Example Dirichlet Distributions (K=3)
  • Example Dirichlet Distributions (K=3) (2)
  • The LDA Model
  • The LDA Model (2)
  • Joint Probability
  • Likelihood
  • Inference
  • Inference (2)
  • Variational Inference
  • Variantional Inference
  • VBEM vs EM
  • Parameter Estimation
  • Parameter Estimation (2)
  • Topic Examples in a 100-topic LDA Model)
  • Classification (50-topic LDA + SVM)
  • Problems in LDA
  • A Bayesian Hierarchical Model for Learning Natural Scene Catego
  • Codebook
  • Topic Distribution in Different Categories
  • Topic Hierarchical Clustering
  • More Topic Models
  • Are you really into Graphical Models
  • Reference (5)
  • Outline (3)
  • Slide 115
  • The Long Query Problem
  • Key Idea Dimension Reduction + Residual Error Preservation
  • Orthogonal Decomposition
  • A Probabilistic Implementation
  • Search (Online)
  • Search Example
  • Search Example (2)
  • Simultaneously Modeling Semantics and Structure of Threaded Dis
  • Semantic amp structure
  • Optimize them together
  • Reply reconstruction
  • Baselines
  • Evaluation
  • Expert finding
  • Baselines (2)
  • Evaluation (2)
  • Summary
Page 46: An Introduction To Matrix Decomposition and Graphical Model

Maximum Likelihood (ML)

bull Maximum likelihood will find the best model parameters that make the data ldquomost likelyrdquo generated from this model

bull Suppose we are given n data samples (x1 x2 hellip xn)

bull Maximum likelihood will find θ that maximize L(θ)

bull Predictive distribution

IID ndash Independent Identically Distributed

bull IID means

bull The problem is considerably simplified as

bull Usually log likehood is used

Reference

bull Zoubin Ghahramani Machine Learning (4F13) 2006 Cambridge(Introduction to Machine Learning Lectures 1-2 Slides)

bull Gregor Heinrich Parameter estimation for text analysis Technical Note 2005-2008

EXPECTATION MAXIMIZATION

Why We Need EM

bull The ExpectationndashMaximization (EM) algorithm is a method for ML learning of parameters in latent variable models

bull Why we need latent variables

bull To describe complex model Gaussian Mixture Model

bull To discover the intrinsic structure inside a data set Topic Models such as pLSA LDA

More General

bull Data set bull Likelihood

bull Goal learn maximum likelihood (ML) parameter values

bull The maximum likelihood procedure finds parameters θ such that

bull Because of the integral (or sum) over latent variables the likelihood can be a very complicated and hard to optimize function

The Expectation Maximization (EM) Algorithm

bull The EM algorithm finds a (local) maximum of a latent variable model likelihood It starts from arbitrary values of the parameters and iterates two steps

bull E step Fill in values of latent variables according to posterior given data

bull M step Maximize likelihood as if latent variables were not hidden

bull Decomposes difficult problems into series of tractable steps

Jensenrsquos Inequality

Lower Bounding the Log Likelihoodbull Observed data D = yn Latent variables X = xn Parameters θbull Goal Maximize the log likelihood (ie ML learning) wrt θ

bull Any distribution q(X) over the hidden variables can be used to obtain a lower bound on the log likelihood using Jensenrsquos inequality

bull where H[q] is the entropy of q(X)

The E and M Steps of EM

bull The lower bound on the log likelihood is given by

bull EM alternates betweenbull E step optimize wrt distribution over hidden variables

holding params fixed

bull M step maximize wrt parameters holding hidden distribution fixed

The E Step

bull E step for fixed θ

bull The second term is the Kullback-Leibler divergencebull This means that for fixed θ F is bounded above by L and achieves

that bound whenbull So the E step simply sets

The M Step

bull M step maximize wrt parameters holding hidden distribution q fixed

bull The second equality comes from fact that entropy of q(X) does not depend directly on θ

bull The specific form of the M step depends on the model Often the maximum wrt θ can be found analytically

EM Never Decreases the Likelihood

bull The E and M steps together never decrease the log likelihood

bull The E step brings F(q θ) to the likelihood L(θ) 顶bull The M-step maximizes F(q θ) wrt θ 抬bull F(q θ) lt L(θ) by Jensen ndash or equivalently from the non-

negativity of KL

Reference

bull Zoubin Ghahramani Machine Learning (4F13) 2006 Cambridge(Unsupervised learning Lecture 5 Slides)

bull Christopher M Bishop (2006) Pattern Recognition and Machine Learning Springer

WHY DO WE NEED GRAPHICAL MODEL

Why Do We Need Graphical Models

bull Consndash Graphical model is so complex even with a few circleshellipndash We have to make too many assumptions

bull Prosndash We do need probability to explain our world But joint probability is

hard to computendash Graphical model can help us analyze and understand our problemsndash Graphs are an intuitive way of representing and visualizing the

relationships between many variablesndash With a graphical model we can decouple joint probability to

conditional probabilities which are usually easier

Directed Acyclic Graphical Models (Bayesian Networks)

bull A DAG Model Bayesian network corresponds to a factorization of the joint probability distribution

p(ABCDE) = p(A)p(B)p(C|AB)p(D|BC)p(E|CD)

bull In general

bull where pa(i) are the parents of node i

Directed Graphs for Statistical ModelsPlate Notation

bull A data set of N points generated from a Gaussian

PLSA ndash PROBABILISTIC LATENT SEMANTIC ANALYSIS

Latent Semantic Indexing (LSI) Review

bull For natural Language Queries simple term matching does not work effectivelyndash Ambiguous terms ndash Same Queries vary due to personal styles

bull Latent semantic indexingndash Creates this lsquolatent semantic spacersquo (hidden meaning)

bull LSI puts documents together even if they donrsquot have common words if the docs share frequently co-occurring terms

bull Disadvantagesndash Statistical foundation is missing

pLSA ndash Probabilistic Latent Semantic Analysis

bull Automated Document Indexing and Information retrievalbull Identification of Latent Classes using an Expectation

Maximization (EM) Algorithmbull Shown to solve

ndash Polysemybull Java could mean ldquocoffeerdquo and also the ldquoPL Javardquobull Cricket is a ldquogamerdquo and also an ldquoinsectrdquo

ndash Synonymybull ldquocomputerrdquo ldquopcrdquo ldquodesktoprdquo all could mean the same

bull Has a better statistical foundation than LSA

pLSA

M

Nd

d

z

w

M

d

z1

w1

z2

w2

z3

w3

zN

wN

hellip

z1 hellip zN are variables ziє[1K]K is the number of latent topics

pLSA

d1

z1

w1

z2

w2

zN1

wN1

hellip

d2

z1

w1

z2

w2

zN2

wN2

hellip

dM

z1

w1

z2

w2

zNm

wNm

hellip

p(w|z=1) p(w|z=2) p(w | z=Nm) are shared for all documents

Likelihood

Joint Probability vs Likelihood

bull Joint probability

bull Likelihood (only for observed variables)

bull p(d) is assumed to be uniform

Document Decomposition

bull Each document can be decomposed as

bull This is similar to the matrix decomposition if we consider each discrete distribution as a vector

p(w|d) = ZVtimesk p(z|d)

bull With many documents we hope to find latent topics as common basis

pLSA ndash Objective Function

bull pLSA tries to maximize the log likelihood

bull Due to the summation over z inside log we have to resort to EM

EM Steps

bull E-Stepndash Expectation of the likelihood function is calculated with the current

parameter valuesbull M-Step

ndash Update the parameters with the calculated posterior probabilitiesndash Find the parameters that maximizes the likelihood function

Lower Bounding the Log Likelihood

EM Steps

bull The E-Step

bull The M-Step

Latent Subspace

pLSA vs LSAbull LSA and PLSA perform dimensionality reduction

ndash In LSA by keeping only K singular valuesndash In PLSA by having K aspects

bull Comparison to SVDndash U Matrix related to P(z|d) (doc to aspect)ndash V Matrix related to P(w|z) (aspect to term)ndash E Matrix related to P(z) (aspect strength)

pLSA vs LSAbull The main difference is the way the approximation is done

bull PLSA generates a model (aspect model) and maximizes its predictive power

bull Selecting the proper value of K is heuristic in LSA

bull Model selection in statistics can determine optimal K in PLSA

Applications

bull Text mining topic discovering

bull Scene Classification

Text Mining

Scene Classification

Classification Result

Reference

bull Thomas Hofmann Probabilistic latent semantic analysis In Proc of Uncertainty in Artificial Intelligence UAI99 Stockholm 1999

bull Bosch A Zisserman A and Munoz X Scene Classification via pLSAProceedings of the European Conference on Computer Vision (2006)

bull Sivic J Russell B C Efros A A Zisserman A and Freeman W T Discovering Object Categories in Image Collections MIT AI Lab Memo AIM-2005-005 February 2005

LDA ndash LATENT DIRICHILET ALLOCATION

Problems in pLSA

bull pLSA provides no probabilistic model at the document level Each doc has its own topic mixture proportion

bull The number of parameters in the model grows linearly with M (the number of documents in the training set)

Problems in pLSA

bull There is no constraint for distributions p(z|di)

bull Easy to lead to serious problems with over-fitting

d1

z1

w1

z2

w2

zN1

wN1

hellip

d2

z1

w1

z2

w2

zN2

wN2

hellip

dm

z1

w1

z2

w2

zNm

wNm

hellip

p(z|d1) p(z|d2) p(z|dm)

Dirichlet Distribution

bull In the LDA model the topic mixture proportions for each document are assumed to follow some distribution

bull Requirement for such a distributionndash The samples (mixture proportions) generated from it are K-tuples of non-

negative numbers that sum to one That is the samples are multinormialsndash Easy to optimize

bull Dirichlet Distribution is one of such distributions

bull The space of all of these multinomials has a nice geometric interpretation as a (k-1)-simplex

Dirichlet Distribution

bull Definition

bull The density is zero outside this open (K minus 1)-dimensional simplex

k

i ii

ki ik

i i

ki i

kk

xx

xxxxp i

1

11

1

12121

1 0 st

)Γ()(Γ)|(

bull Various parameter α

(6 2 2) (3 7 5)

(2 3 4) (6 2 6)

Example Dirichlet Distributions (K=3)

Example Dirichlet Distributions (K=3)

bull Equal αi different

α0=01 α0=1 α0=10

k

i i10

The LDA Model

z4z3z2z1

w4w3w2w1

b

z4z3z2z1

w4w3w2w1

z4z3z2z1

w4w3w2w1

bull For each documentbull Choose ~Dirichlet()bull For each of the N words wn

ndash Choose a topic znraquo Multinomial()

ndash Choose a word wn from p(wn|znb) a multinomial probability conditioned on the topic zn

The LDA Model

bull For each documentbull Choose ~Dirichlet()bull For each of the N words wn

ndash Choose a topic znraquo Multinomial()

ndash Choose a word wn from p(wn|znb) a multinomial probability conditioned on the topic zn

Joint Probability

bull Given parameter α and β

where

Likelihood

bull Joint Probability

bull Marginal distribution of a document

bull Likelihood over all the documents

Inference

bull The likelihood can be computed by summing each documentbull Jansenrsquos inequality in EM

Inference

bull In E-Step we need to compute the posterior distribution of the hidden variables

bull Unfortunately this distribution is intractable to compute in general

bull We have to resort to variational approach

Variational Inference

bull In variational inference we consider a simplified graphical model with variational parameters and minimize the KL Divergence between the variational and posterior distributions

Variantional Inference

bull The difference between the lower bound and the likelihood is the KL divergence

bull Maximizing the lower bound L(b) with respect to and is equivalent to minimizing the KL divergence

VBEM vs EM

bull Only different in the E-Step

bull In standard EM q(X) is directly set to p(X|Dθ) and let KL=0bull In VBEM it is intractable to compute p(X|Dθ) Instead it

approximates p(X|Dθ) by a variational distribution q(X) by minimizing KL(q(X) | P(X|D θ)

bull This is also equivalent to maximizing the lower bound L(θ)

Parameter Estimation

bull Given a corpus of documents we would like to find the parameters and b which maximize the likelihood of the observed data

bull Strategy (Variational EM)

bull Lower bound log p(w|b) by a function L(b)bull Repeat until convergence

ndash E Maximize L(b) with respect to the variational parameters ndash M Maximize the bound with respect to parameters and b

Parameter Estimation

bull E-Step Variational Inference ndash repeat until convergence

bull M-Step Parameter estimation

β

α can be implemented using the Newton-Raphson method

Topic Examples in a 100-topic LDA Model)bull 16000 documents from a subset of the TREC AP corpus

Classification (50-topic LDA + SVM)

bull Reuters-21578 dataset ndash contains 8000 documents and 15818 words

(a) EARN vs NOT EARN (b) GRAIN vs NOT GRAIN

Problems in LDA

bull Dirichlet Distribution is helpful to avoid over-fitting But the assumption might be too strong

z4z3z2z1

w4w3w2w1

b

z4z3z2z1

w4w3w2w1

z4z3z2z1

w4w3w2w1

A Bayesian Hierarchical Model for Learning Natural Scene Categories

bull Incorporating category information

MNd

π

z

x

θ

β

Codebookbull 174 Local Image Patches

bull DetectionEvenly Sampled GridRandom SamplingSaliency DetectorLowersquos DoG Detector

bull RepresentationNormalized 11x11 gray values128-dim SIFT

Topic Distribution in Different Categories

Topic Hierarchical Clustering

More Topic Models

bull Dynamic topic models ICML 2006bull Correlated Topic Model NIPS 2005bull Hierarchical Dirichlet Process Journal of the American

Statistical Association 2003bull Nonparametric Bayes pachinko allocation UAI 2007bull Supervised LDA NIPS 2007bull MedLDA ndash Maximum Margin Discrimant LDA ICML 2009bull hellip

Are you really into Graphical Models

bull Describing Visual Scenes using Transformed Dirichlet Processes E Sudderth A Torralba W Freeman and A Willsky NIPS Dec 2005

Reference

bull David M Blei Andrew Y Ng Michael I Jordan Latent Dirichlet Allocation Journal of Machine Learning Research (JMLR) 2003

bull Matthew J Beal Variational Algorithms for Approximate Bayesian Inference PhD Thesis University of Cambridge 1998

bull L Fei-Fei and P Perona A Bayesian Hierarchical Model for Learning Natural Scene Categories CVPR 2005

Outline

bull Matrix Decompositionndash PCA SVD NMFndash LDA ICA Sparse Coding etc

bull Graphical Modelndash Basic concepts in probabilistic machine learningndash EMndash pLSAndash LDA

bull Two Applicationsndash Document decomposition for ldquolong queryrdquo retrieval ICCV 2009ndash Modeling Threaded Discussions SIGIR 2009

Large-Scale Indexing for ldquoLong Queryrdquo Retrieval (Similarity Search)

Xiao Zhang Zhiwei Li Lei Zhang Wei-Ying Ma and Heung-Yeung Shum

ICCV 2009

The Long Query Problem

bull If a query contains 1000 keywordsndash Need to access 1000 inverted listsndash The intersection of 1000 inverted lists may be emptyndash The union of 1000 inverted list may be the whole corpus

bull Dimension reductionTerm1 Term2 Term3 Term4 hellip TermN

Img1 1 2 0 0 hellip 2

f1 f2 hellip fM

Img1 02 01 hellip 003

Topic Projection

Dim = 1 million

Dim = 200

Key Idea Dimension Reduction + Residual Error Preservation

bull p original TF-IDF vector in vocabulary spacebull X projection matrix for dimension reductionbull w low dimensional feature vectorbull residual error

p Xw

An image = 0 01 02 03 04 05 06 07 08 09 1

0

2

4

6

8

10

12

14

16

+ a few words(10 words)

Orthogonal Decomposition

p Xw 1 11 1 1

2 21 2 1 2

1 1

1

k

k

W k W

W W Wk W

p x xp x x w

p wp x x

Base vector

Low dimensional representation

Residual

An image = 0 01 02 03 04 05 06 07 08 09 1

0

2

4

6

8

10

12

14

16

+ a few words(10 words)

T Tp q p q

p q

w w

X1 X2 X3 hellip Xk

A Probabilistic Implementation

x is a switch variable It controls a word generated from

bull a topic specific distribution

bull a document specific distribution

bull a background distribution

( | )p w d 1( 0 | ) ( | ) ( | )K

kp x d p w z k p z k d

( 1| ) ( | )p x d p w d ( 2 | ) ( | )p x d p w

C Chemudugunta etal Modeling General and Specific Aspects of Documents with a Probabilistic Topic Model NIPS 2006

Search (Online)

DS1 DS2 hellip

DS1 DS2 hellip

DS1 DS2 hellip

DS1 DS2 hellip

DS1 DS2 hellip

hellip

hellip

hellip

hellip

hellip

LSH Index

Doc 300 Doc 401 hellip

A query = 0 01 02 03 04 05 06 07 08 09 1

0

2

4

6

8

10

12

14

16

+ a few words

Re-rankingDoc 401 hellip

Doc 1

Doc 2

Doc 300

Doc 401

Doc N

Doc 300

Index 10M Images 46GBSearch Speed lt 100ms

Doc Meta

Search Example

Query Image

Search Example

Query Image

Simultaneously Modeling Semantics and Structure of Threaded Discussions A Sparse

Coding Approach and Its Applications

Chen LIN Jiang-Ming YANG Rui CAI Xin-jing WANG Wei WANG Lei ZHANG

SIGIR 2009

123

Semantic amp structure

124

SemanticTopics

StructureWho reply to who

Optimize them together

Model semantic

Model structure

125

Reply reconstruction

126

DocumentSimilarity

TopicSimilarity

StructureSimilarity

BaselinesNP

Reply to Nearest PostRR

Reply to RootDS

Document SimilarityLDA

Latent Dirichlet Allocation Project documents to topic space

SWB Special Words Topic Model with Background distribution Project documents to topic and junk topic space

127

Evaluation

method Slashdot Apple

All Posts Good Posts All Posts Good Posts

NP 0021 0012 0289 0239

RR 0183 0319 0269 0474

DS 0463 0643 0409 0628

LDA 0465 0644 0410 0648

SWB 0463 0644 0410 0641

SMSS 0524 0737 0517 0772

128

Expert finding

Reply reconstruction

Network construction

Expert finding

Methods

HITS

PageRank

hellip

129

Baselines

LMFormal Models for Expert Finding in Enterprise Corpora SIGIR

06Achieves stable performance in expert finding task using a

language modelPageRank

Benchmark nodal ranking methodHITS

Find hub nodes and authority nodeEABIF

Personalized Recommendation Driven by Information Flow SIGIR rsquo06

Find most influential node130

Evaluation

131

bull Bayesian estimate

Method MRR MAP P10

LM 0821 0698 0800

EABIF(ori) 0674 0362 0243

EABIF(rec) 0742 0318 0281

PageRank(ori) 0675 0377 0263

PageRank(rec) 0743 0321 0266

HITS(ori) 0906 0832 0900

HITS(rec) 0938 0822 0906

Summary

bull Matrix and probability are fundamental mathematics in information retrieval and computer visionndash Matrix decomposition ndash a good practice to learn matrixndash Graphical model ndash a good practice to learn probability

bull Graphical model is a good tool to analyze problems

bull The essence of decomposition is to discover a set of mid-level features to describe original documentsimages

bull It is more adaptable for various applications than matrix decomposition

  • An Introduction To Matrix Decomposition and Graphical Model
  • Outline
  • What Is Matrix Decomposition
  • Why We Need Matrix Decomposition
  • Why We Need Matrix Decomposition (2)
  • Principle Component Analysis
  • Definition ndash Eigenvalue amp Eigenvector
  • Definition ndash Principle Component Analysis
  • Principle Component Analysis (2)
  • Maximizing Variance
  • Optimization Problem
  • Property Data Decomposition
  • Face Recognition ndash Eigenface
  • Slide 14
  • Slide 15
  • PageRank ndash Power Iteration
  • Column-Stochastic amp Irreducible
  • Iterative PageRank Calculation
  • Convergence of the power iteration
  • Singular value decomposition
  • SVD - Definition
  • Singular Values And Singular Vectors
  • Matrix approximation
  • SVD and PCA
  • Example - LSI
  • SVD and PCA (2)
  • Latent Semantic Indexing (LSI)
  • Latent Semantic Indexing
  • Similarity Measures
  • Similarity Measures (2)
  • HITS (Hyperlink Induced Topic Search)
  • Power Iteration
  • HITS and SVD
  • HITS vs PageRank
  • NMF ndash Non-Negative Matrix Factorization
  • Definition
  • Motivation
  • Multiplicative Update Algorithm
  • Multiplicative Update Algorithm (2)
  • NMF vs PCA
  • Reference
  • Major Reference
  • Outline (2)
  • Not Included
  • Basic Concepts
  • What Is Machine Learning
  • Likelihood Function
  • Maximum Likelihood (ML)
  • IID ndash Independent Identically Distributed
  • Reference (2)
  • Expectation Maximization
  • Why We Need EM
  • More General
  • The Expectation Maximization (EM) Algorithm
  • Jensenrsquos Inequality
  • Lower Bounding the Log Likelihood
  • The E and M Steps of EM
  • The E Step
  • The M Step
  • EM Never Decreases the Likelihood
  • Reference (3)
  • Why Do We Need Graphical Model
  • Why Do We Need Graphical Models
  • Directed Acyclic Graphical Models (Bayesian Networks)
  • Directed Graphs for Statistical Models Plate Notation
  • pLSA ndash Probabilistic Latent Semantic Analysis
  • Latent Semantic Indexing (LSI) Review
  • pLSA ndash Probabilistic Latent Semantic Analysis (2)
  • pLSA
  • pLSA (2)
  • Joint Probability vs Likelihood
  • Document Decomposition
  • pLSA ndash Objective Function
  • EM Steps
  • Lower Bounding the Log Likelihood (2)
  • EM Steps (2)
  • Latent Subspace
  • pLSA vs LSA
  • pLSA vs LSA (2)
  • Applications
  • Text Mining
  • Scene Classification
  • Classification Result
  • Reference (4)
  • LDA ndash Latent Dirichilet Allocation
  • Problems in pLSA
  • Problems in pLSA (2)
  • Dirichlet Distribution
  • Dirichlet Distribution (2)
  • Example Dirichlet Distributions (K=3)
  • Example Dirichlet Distributions (K=3) (2)
  • The LDA Model
  • The LDA Model (2)
  • Joint Probability
  • Likelihood
  • Inference
  • Inference (2)
  • Variational Inference
  • Variantional Inference
  • VBEM vs EM
  • Parameter Estimation
  • Parameter Estimation (2)
  • Topic Examples in a 100-topic LDA Model)
  • Classification (50-topic LDA + SVM)
  • Problems in LDA
  • A Bayesian Hierarchical Model for Learning Natural Scene Catego
  • Codebook
  • Topic Distribution in Different Categories
  • Topic Hierarchical Clustering
  • More Topic Models
  • Are you really into Graphical Models
  • Reference (5)
  • Outline (3)
  • Slide 115
  • The Long Query Problem
  • Key Idea Dimension Reduction + Residual Error Preservation
  • Orthogonal Decomposition
  • A Probabilistic Implementation
  • Search (Online)
  • Search Example
  • Search Example (2)
  • Simultaneously Modeling Semantics and Structure of Threaded Dis
  • Semantic amp structure
  • Optimize them together
  • Reply reconstruction
  • Baselines
  • Evaluation
  • Expert finding
  • Baselines (2)
  • Evaluation (2)
  • Summary
Page 47: An Introduction To Matrix Decomposition and Graphical Model

IID ndash Independent Identically Distributed

bull IID means

bull The problem is considerably simplified as

bull Usually log likehood is used

Reference

bull Zoubin Ghahramani Machine Learning (4F13) 2006 Cambridge(Introduction to Machine Learning Lectures 1-2 Slides)

bull Gregor Heinrich Parameter estimation for text analysis Technical Note 2005-2008

EXPECTATION MAXIMIZATION

Why We Need EM

bull The ExpectationndashMaximization (EM) algorithm is a method for ML learning of parameters in latent variable models

bull Why we need latent variables

bull To describe complex model Gaussian Mixture Model

bull To discover the intrinsic structure inside a data set Topic Models such as pLSA LDA

More General

bull Data set bull Likelihood

bull Goal learn maximum likelihood (ML) parameter values

bull The maximum likelihood procedure finds parameters θ such that

bull Because of the integral (or sum) over latent variables the likelihood can be a very complicated and hard to optimize function

The Expectation Maximization (EM) Algorithm

bull The EM algorithm finds a (local) maximum of a latent variable model likelihood It starts from arbitrary values of the parameters and iterates two steps

bull E step Fill in values of latent variables according to posterior given data

bull M step Maximize likelihood as if latent variables were not hidden

bull Decomposes difficult problems into series of tractable steps

Jensenrsquos Inequality

Lower Bounding the Log Likelihoodbull Observed data D = yn Latent variables X = xn Parameters θbull Goal Maximize the log likelihood (ie ML learning) wrt θ

bull Any distribution q(X) over the hidden variables can be used to obtain a lower bound on the log likelihood using Jensenrsquos inequality

bull where H[q] is the entropy of q(X)

The E and M Steps of EM

bull The lower bound on the log likelihood is given by

bull EM alternates betweenbull E step optimize wrt distribution over hidden variables

holding params fixed

bull M step maximize wrt parameters holding hidden distribution fixed

The E Step

bull E step for fixed θ

bull The second term is the Kullback-Leibler divergencebull This means that for fixed θ F is bounded above by L and achieves

that bound whenbull So the E step simply sets

The M Step

bull M step maximize wrt parameters holding hidden distribution q fixed

bull The second equality comes from fact that entropy of q(X) does not depend directly on θ

bull The specific form of the M step depends on the model Often the maximum wrt θ can be found analytically

EM Never Decreases the Likelihood

bull The E and M steps together never decrease the log likelihood

bull The E step brings F(q θ) to the likelihood L(θ) 顶bull The M-step maximizes F(q θ) wrt θ 抬bull F(q θ) lt L(θ) by Jensen ndash or equivalently from the non-

negativity of KL

Reference

bull Zoubin Ghahramani Machine Learning (4F13) 2006 Cambridge(Unsupervised learning Lecture 5 Slides)

bull Christopher M Bishop (2006) Pattern Recognition and Machine Learning Springer

WHY DO WE NEED GRAPHICAL MODEL

Why Do We Need Graphical Models

bull Consndash Graphical model is so complex even with a few circleshellipndash We have to make too many assumptions

bull Prosndash We do need probability to explain our world But joint probability is

hard to computendash Graphical model can help us analyze and understand our problemsndash Graphs are an intuitive way of representing and visualizing the

relationships between many variablesndash With a graphical model we can decouple joint probability to

conditional probabilities which are usually easier

Directed Acyclic Graphical Models (Bayesian Networks)

bull A DAG Model Bayesian network corresponds to a factorization of the joint probability distribution

p(ABCDE) = p(A)p(B)p(C|AB)p(D|BC)p(E|CD)

bull In general

bull where pa(i) are the parents of node i

Directed Graphs for Statistical ModelsPlate Notation

bull A data set of N points generated from a Gaussian

PLSA ndash PROBABILISTIC LATENT SEMANTIC ANALYSIS

Latent Semantic Indexing (LSI) Review

bull For natural Language Queries simple term matching does not work effectivelyndash Ambiguous terms ndash Same Queries vary due to personal styles

bull Latent semantic indexingndash Creates this lsquolatent semantic spacersquo (hidden meaning)

bull LSI puts documents together even if they donrsquot have common words if the docs share frequently co-occurring terms

bull Disadvantagesndash Statistical foundation is missing

pLSA ndash Probabilistic Latent Semantic Analysis

bull Automated Document Indexing and Information retrievalbull Identification of Latent Classes using an Expectation

Maximization (EM) Algorithmbull Shown to solve

ndash Polysemybull Java could mean ldquocoffeerdquo and also the ldquoPL Javardquobull Cricket is a ldquogamerdquo and also an ldquoinsectrdquo

ndash Synonymybull ldquocomputerrdquo ldquopcrdquo ldquodesktoprdquo all could mean the same

bull Has a better statistical foundation than LSA

pLSA

M

Nd

d

z

w

M

d

z1

w1

z2

w2

z3

w3

zN

wN

hellip

z1 hellip zN are variables ziє[1K]K is the number of latent topics

pLSA

d1

z1

w1

z2

w2

zN1

wN1

hellip

d2

z1

w1

z2

w2

zN2

wN2

hellip

dM

z1

w1

z2

w2

zNm

wNm

hellip

p(w|z=1) p(w|z=2) p(w | z=Nm) are shared for all documents

Likelihood

Joint Probability vs Likelihood

bull Joint probability

bull Likelihood (only for observed variables)

bull p(d) is assumed to be uniform

Document Decomposition

bull Each document can be decomposed as

bull This is similar to the matrix decomposition if we consider each discrete distribution as a vector

p(w|d) = ZVtimesk p(z|d)

bull With many documents we hope to find latent topics as common basis

pLSA ndash Objective Function

bull pLSA tries to maximize the log likelihood

bull Due to the summation over z inside log we have to resort to EM

EM Steps

bull E-Stepndash Expectation of the likelihood function is calculated with the current

parameter valuesbull M-Step

ndash Update the parameters with the calculated posterior probabilitiesndash Find the parameters that maximizes the likelihood function

Lower Bounding the Log Likelihood

EM Steps

bull The E-Step

bull The M-Step

Latent Subspace

pLSA vs LSAbull LSA and PLSA perform dimensionality reduction

ndash In LSA by keeping only K singular valuesndash In PLSA by having K aspects

bull Comparison to SVDndash U Matrix related to P(z|d) (doc to aspect)ndash V Matrix related to P(w|z) (aspect to term)ndash E Matrix related to P(z) (aspect strength)

pLSA vs LSAbull The main difference is the way the approximation is done

bull PLSA generates a model (aspect model) and maximizes its predictive power

bull Selecting the proper value of K is heuristic in LSA

bull Model selection in statistics can determine optimal K in PLSA

Applications

bull Text mining topic discovering

bull Scene Classification

Text Mining

Scene Classification

Classification Result

Reference

bull Thomas Hofmann Probabilistic latent semantic analysis In Proc of Uncertainty in Artificial Intelligence UAI99 Stockholm 1999

bull Bosch A Zisserman A and Munoz X Scene Classification via pLSAProceedings of the European Conference on Computer Vision (2006)

bull Sivic J Russell B C Efros A A Zisserman A and Freeman W T Discovering Object Categories in Image Collections MIT AI Lab Memo AIM-2005-005 February 2005

LDA ndash LATENT DIRICHILET ALLOCATION

Problems in pLSA

bull pLSA provides no probabilistic model at the document level Each doc has its own topic mixture proportion

bull The number of parameters in the model grows linearly with M (the number of documents in the training set)

Problems in pLSA

bull There is no constraint for distributions p(z|di)

bull Easy to lead to serious problems with over-fitting

d1

z1

w1

z2

w2

zN1

wN1

hellip

d2

z1

w1

z2

w2

zN2

wN2

hellip

dm

z1

w1

z2

w2

zNm

wNm

hellip

p(z|d1) p(z|d2) p(z|dm)

Dirichlet Distribution

bull In the LDA model the topic mixture proportions for each document are assumed to follow some distribution

bull Requirement for such a distributionndash The samples (mixture proportions) generated from it are K-tuples of non-

negative numbers that sum to one That is the samples are multinormialsndash Easy to optimize

bull Dirichlet Distribution is one of such distributions

bull The space of all of these multinomials has a nice geometric interpretation as a (k-1)-simplex

Dirichlet Distribution

bull Definition

bull The density is zero outside this open (K minus 1)-dimensional simplex

k

i ii

ki ik

i i

ki i

kk

xx

xxxxp i

1

11

1

12121

1 0 st

)Γ()(Γ)|(

bull Various parameter α

(6 2 2) (3 7 5)

(2 3 4) (6 2 6)

Example Dirichlet Distributions (K=3)

Example Dirichlet Distributions (K=3)

bull Equal αi different

α0=01 α0=1 α0=10

k

i i10

The LDA Model

z4z3z2z1

w4w3w2w1

b

z4z3z2z1

w4w3w2w1

z4z3z2z1

w4w3w2w1

bull For each documentbull Choose ~Dirichlet()bull For each of the N words wn

ndash Choose a topic znraquo Multinomial()

ndash Choose a word wn from p(wn|znb) a multinomial probability conditioned on the topic zn

The LDA Model

bull For each documentbull Choose ~Dirichlet()bull For each of the N words wn

ndash Choose a topic znraquo Multinomial()

ndash Choose a word wn from p(wn|znb) a multinomial probability conditioned on the topic zn

Joint Probability

bull Given parameter α and β

where

Likelihood

bull Joint Probability

bull Marginal distribution of a document

bull Likelihood over all the documents

Inference

bull The likelihood can be computed by summing each documentbull Jansenrsquos inequality in EM

Inference

bull In E-Step we need to compute the posterior distribution of the hidden variables

bull Unfortunately this distribution is intractable to compute in general

bull We have to resort to variational approach

Variational Inference

bull In variational inference we consider a simplified graphical model with variational parameters and minimize the KL Divergence between the variational and posterior distributions

Variantional Inference

bull The difference between the lower bound and the likelihood is the KL divergence

bull Maximizing the lower bound L(b) with respect to and is equivalent to minimizing the KL divergence

VBEM vs EM

bull Only different in the E-Step

bull In standard EM q(X) is directly set to p(X|Dθ) and let KL=0bull In VBEM it is intractable to compute p(X|Dθ) Instead it

approximates p(X|Dθ) by a variational distribution q(X) by minimizing KL(q(X) | P(X|D θ)

bull This is also equivalent to maximizing the lower bound L(θ)

Parameter Estimation

bull Given a corpus of documents we would like to find the parameters and b which maximize the likelihood of the observed data

bull Strategy (Variational EM)

bull Lower bound log p(w|b) by a function L(b)bull Repeat until convergence

ndash E Maximize L(b) with respect to the variational parameters ndash M Maximize the bound with respect to parameters and b

Parameter Estimation

bull E-Step Variational Inference ndash repeat until convergence

bull M-Step Parameter estimation

β

α can be implemented using the Newton-Raphson method

Topic Examples in a 100-topic LDA Model)bull 16000 documents from a subset of the TREC AP corpus

Classification (50-topic LDA + SVM)

bull Reuters-21578 dataset ndash contains 8000 documents and 15818 words

(a) EARN vs NOT EARN (b) GRAIN vs NOT GRAIN

Problems in LDA

bull Dirichlet Distribution is helpful to avoid over-fitting But the assumption might be too strong

z4z3z2z1

w4w3w2w1

b

z4z3z2z1

w4w3w2w1

z4z3z2z1

w4w3w2w1

A Bayesian Hierarchical Model for Learning Natural Scene Categories

bull Incorporating category information

MNd

π

z

x

θ

β

Codebookbull 174 Local Image Patches

bull DetectionEvenly Sampled GridRandom SamplingSaliency DetectorLowersquos DoG Detector

bull RepresentationNormalized 11x11 gray values128-dim SIFT

Topic Distribution in Different Categories

Topic Hierarchical Clustering

More Topic Models

bull Dynamic topic models ICML 2006bull Correlated Topic Model NIPS 2005bull Hierarchical Dirichlet Process Journal of the American

Statistical Association 2003bull Nonparametric Bayes pachinko allocation UAI 2007bull Supervised LDA NIPS 2007bull MedLDA ndash Maximum Margin Discrimant LDA ICML 2009bull hellip

Are you really into Graphical Models

bull Describing Visual Scenes using Transformed Dirichlet Processes E Sudderth A Torralba W Freeman and A Willsky NIPS Dec 2005

Reference

bull David M Blei Andrew Y Ng Michael I Jordan Latent Dirichlet Allocation Journal of Machine Learning Research (JMLR) 2003

bull Matthew J Beal Variational Algorithms for Approximate Bayesian Inference PhD Thesis University of Cambridge 1998

bull L Fei-Fei and P Perona A Bayesian Hierarchical Model for Learning Natural Scene Categories CVPR 2005

Outline

bull Matrix Decompositionndash PCA SVD NMFndash LDA ICA Sparse Coding etc

bull Graphical Modelndash Basic concepts in probabilistic machine learningndash EMndash pLSAndash LDA

bull Two Applicationsndash Document decomposition for ldquolong queryrdquo retrieval ICCV 2009ndash Modeling Threaded Discussions SIGIR 2009

Large-Scale Indexing for ldquoLong Queryrdquo Retrieval (Similarity Search)

Xiao Zhang Zhiwei Li Lei Zhang Wei-Ying Ma and Heung-Yeung Shum

ICCV 2009

The Long Query Problem

bull If a query contains 1000 keywordsndash Need to access 1000 inverted listsndash The intersection of 1000 inverted lists may be emptyndash The union of 1000 inverted list may be the whole corpus

bull Dimension reductionTerm1 Term2 Term3 Term4 hellip TermN

Img1 1 2 0 0 hellip 2

f1 f2 hellip fM

Img1 02 01 hellip 003

Topic Projection

Dim = 1 million

Dim = 200

Key Idea Dimension Reduction + Residual Error Preservation

bull p original TF-IDF vector in vocabulary spacebull X projection matrix for dimension reductionbull w low dimensional feature vectorbull residual error

p Xw

An image = 0 01 02 03 04 05 06 07 08 09 1

0

2

4

6

8

10

12

14

16

+ a few words(10 words)

Orthogonal Decomposition

p Xw 1 11 1 1

2 21 2 1 2

1 1

1

k

k

W k W

W W Wk W

p x xp x x w

p wp x x

Base vector

Low dimensional representation

Residual

An image = 0 01 02 03 04 05 06 07 08 09 1

0

2

4

6

8

10

12

14

16

+ a few words(10 words)

T Tp q p q

p q

w w

X1 X2 X3 hellip Xk

A Probabilistic Implementation

x is a switch variable It controls a word generated from

bull a topic specific distribution

bull a document specific distribution

bull a background distribution

( | )p w d 1( 0 | ) ( | ) ( | )K

kp x d p w z k p z k d

( 1| ) ( | )p x d p w d ( 2 | ) ( | )p x d p w

C Chemudugunta etal Modeling General and Specific Aspects of Documents with a Probabilistic Topic Model NIPS 2006

Search (Online)

DS1 DS2 hellip

DS1 DS2 hellip

DS1 DS2 hellip

DS1 DS2 hellip

DS1 DS2 hellip

hellip

hellip

hellip

hellip

hellip

LSH Index

Doc 300 Doc 401 hellip

A query = 0 01 02 03 04 05 06 07 08 09 1

0

2

4

6

8

10

12

14

16

+ a few words

Re-rankingDoc 401 hellip

Doc 1

Doc 2

Doc 300

Doc 401

Doc N

Doc 300

Index 10M Images 46GBSearch Speed lt 100ms

Doc Meta

Search Example

Query Image

Search Example

Query Image

Simultaneously Modeling Semantics and Structure of Threaded Discussions A Sparse

Coding Approach and Its Applications

Chen LIN Jiang-Ming YANG Rui CAI Xin-jing WANG Wei WANG Lei ZHANG

SIGIR 2009

123

Semantic amp structure

124

SemanticTopics

StructureWho reply to who

Optimize them together

Model semantic

Model structure

125

Reply reconstruction

126

DocumentSimilarity

TopicSimilarity

StructureSimilarity

BaselinesNP

Reply to Nearest PostRR

Reply to RootDS

Document SimilarityLDA

Latent Dirichlet Allocation Project documents to topic space

SWB Special Words Topic Model with Background distribution Project documents to topic and junk topic space

127

Evaluation

method Slashdot Apple

All Posts Good Posts All Posts Good Posts

NP 0021 0012 0289 0239

RR 0183 0319 0269 0474

DS 0463 0643 0409 0628

LDA 0465 0644 0410 0648

SWB 0463 0644 0410 0641

SMSS 0524 0737 0517 0772

128

Expert finding

Reply reconstruction

Network construction

Expert finding

Methods

HITS

PageRank

hellip

129

Baselines

LMFormal Models for Expert Finding in Enterprise Corpora SIGIR

06Achieves stable performance in expert finding task using a

language modelPageRank

Benchmark nodal ranking methodHITS

Find hub nodes and authority nodeEABIF

Personalized Recommendation Driven by Information Flow SIGIR rsquo06

Find most influential node130

Evaluation

131

bull Bayesian estimate

Method MRR MAP P10

LM 0821 0698 0800

EABIF(ori) 0674 0362 0243

EABIF(rec) 0742 0318 0281

PageRank(ori) 0675 0377 0263

PageRank(rec) 0743 0321 0266

HITS(ori) 0906 0832 0900

HITS(rec) 0938 0822 0906

Summary

bull Matrix and probability are fundamental mathematics in information retrieval and computer visionndash Matrix decomposition ndash a good practice to learn matrixndash Graphical model ndash a good practice to learn probability

bull Graphical model is a good tool to analyze problems

bull The essence of decomposition is to discover a set of mid-level features to describe original documentsimages

bull It is more adaptable for various applications than matrix decomposition

  • An Introduction To Matrix Decomposition and Graphical Model
  • Outline
  • What Is Matrix Decomposition
  • Why We Need Matrix Decomposition
  • Why We Need Matrix Decomposition (2)
  • Principle Component Analysis
  • Definition ndash Eigenvalue amp Eigenvector
  • Definition ndash Principle Component Analysis
  • Principle Component Analysis (2)
  • Maximizing Variance
  • Optimization Problem
  • Property Data Decomposition
  • Face Recognition ndash Eigenface
  • Slide 14
  • Slide 15
  • PageRank ndash Power Iteration
  • Column-Stochastic amp Irreducible
  • Iterative PageRank Calculation
  • Convergence of the power iteration
  • Singular value decomposition
  • SVD - Definition
  • Singular Values And Singular Vectors
  • Matrix approximation
  • SVD and PCA
  • Example - LSI
  • SVD and PCA (2)
  • Latent Semantic Indexing (LSI)
  • Latent Semantic Indexing
  • Similarity Measures
  • Similarity Measures (2)
  • HITS (Hyperlink Induced Topic Search)
  • Power Iteration
  • HITS and SVD
  • HITS vs PageRank
  • NMF ndash Non-Negative Matrix Factorization
  • Definition
  • Motivation
  • Multiplicative Update Algorithm
  • Multiplicative Update Algorithm (2)
  • NMF vs PCA
  • Reference
  • Major Reference
  • Outline (2)
  • Not Included
  • Basic Concepts
  • What Is Machine Learning
  • Likelihood Function
  • Maximum Likelihood (ML)
  • IID ndash Independent Identically Distributed
  • Reference (2)
  • Expectation Maximization
  • Why We Need EM
  • More General
  • The Expectation Maximization (EM) Algorithm
  • Jensenrsquos Inequality
  • Lower Bounding the Log Likelihood
  • The E and M Steps of EM
  • The E Step
  • The M Step
  • EM Never Decreases the Likelihood
  • Reference (3)
  • Why Do We Need Graphical Model
  • Why Do We Need Graphical Models
  • Directed Acyclic Graphical Models (Bayesian Networks)
  • Directed Graphs for Statistical Models Plate Notation
  • pLSA ndash Probabilistic Latent Semantic Analysis
  • Latent Semantic Indexing (LSI) Review
  • pLSA ndash Probabilistic Latent Semantic Analysis (2)
  • pLSA
  • pLSA (2)
  • Joint Probability vs Likelihood
  • Document Decomposition
  • pLSA ndash Objective Function
  • EM Steps
  • Lower Bounding the Log Likelihood (2)
  • EM Steps (2)
  • Latent Subspace
  • pLSA vs LSA
  • pLSA vs LSA (2)
  • Applications
  • Text Mining
  • Scene Classification
  • Classification Result
  • Reference (4)
  • LDA ndash Latent Dirichilet Allocation
  • Problems in pLSA
  • Problems in pLSA (2)
  • Dirichlet Distribution
  • Dirichlet Distribution (2)
  • Example Dirichlet Distributions (K=3)
  • Example Dirichlet Distributions (K=3) (2)
  • The LDA Model
  • The LDA Model (2)
  • Joint Probability
  • Likelihood
  • Inference
  • Inference (2)
  • Variational Inference
  • Variantional Inference
  • VBEM vs EM
  • Parameter Estimation
  • Parameter Estimation (2)
  • Topic Examples in a 100-topic LDA Model)
  • Classification (50-topic LDA + SVM)
  • Problems in LDA
  • A Bayesian Hierarchical Model for Learning Natural Scene Catego
  • Codebook
  • Topic Distribution in Different Categories
  • Topic Hierarchical Clustering
  • More Topic Models
  • Are you really into Graphical Models
  • Reference (5)
  • Outline (3)
  • Slide 115
  • The Long Query Problem
  • Key Idea Dimension Reduction + Residual Error Preservation
  • Orthogonal Decomposition
  • A Probabilistic Implementation
  • Search (Online)
  • Search Example
  • Search Example (2)
  • Simultaneously Modeling Semantics and Structure of Threaded Dis
  • Semantic amp structure
  • Optimize them together
  • Reply reconstruction
  • Baselines
  • Evaluation
  • Expert finding
  • Baselines (2)
  • Evaluation (2)
  • Summary
Page 48: An Introduction To Matrix Decomposition and Graphical Model

Reference

bull Zoubin Ghahramani Machine Learning (4F13) 2006 Cambridge(Introduction to Machine Learning Lectures 1-2 Slides)

bull Gregor Heinrich Parameter estimation for text analysis Technical Note 2005-2008

EXPECTATION MAXIMIZATION

Why We Need EM

bull The ExpectationndashMaximization (EM) algorithm is a method for ML learning of parameters in latent variable models

bull Why we need latent variables

bull To describe complex model Gaussian Mixture Model

bull To discover the intrinsic structure inside a data set Topic Models such as pLSA LDA

More General

bull Data set bull Likelihood

bull Goal learn maximum likelihood (ML) parameter values

bull The maximum likelihood procedure finds parameters θ such that

bull Because of the integral (or sum) over latent variables the likelihood can be a very complicated and hard to optimize function

The Expectation Maximization (EM) Algorithm

bull The EM algorithm finds a (local) maximum of a latent variable model likelihood It starts from arbitrary values of the parameters and iterates two steps

bull E step Fill in values of latent variables according to posterior given data

bull M step Maximize likelihood as if latent variables were not hidden

bull Decomposes difficult problems into series of tractable steps

Jensenrsquos Inequality

Lower Bounding the Log Likelihoodbull Observed data D = yn Latent variables X = xn Parameters θbull Goal Maximize the log likelihood (ie ML learning) wrt θ

bull Any distribution q(X) over the hidden variables can be used to obtain a lower bound on the log likelihood using Jensenrsquos inequality

bull where H[q] is the entropy of q(X)

The E and M Steps of EM

bull The lower bound on the log likelihood is given by

bull EM alternates betweenbull E step optimize wrt distribution over hidden variables

holding params fixed

bull M step maximize wrt parameters holding hidden distribution fixed

The E Step

bull E step for fixed θ

bull The second term is the Kullback-Leibler divergencebull This means that for fixed θ F is bounded above by L and achieves

that bound whenbull So the E step simply sets

The M Step

bull M step maximize wrt parameters holding hidden distribution q fixed

bull The second equality comes from fact that entropy of q(X) does not depend directly on θ

bull The specific form of the M step depends on the model Often the maximum wrt θ can be found analytically

EM Never Decreases the Likelihood

bull The E and M steps together never decrease the log likelihood

bull The E step brings F(q θ) to the likelihood L(θ) 顶bull The M-step maximizes F(q θ) wrt θ 抬bull F(q θ) lt L(θ) by Jensen ndash or equivalently from the non-

negativity of KL

Reference

bull Zoubin Ghahramani Machine Learning (4F13) 2006 Cambridge(Unsupervised learning Lecture 5 Slides)

bull Christopher M Bishop (2006) Pattern Recognition and Machine Learning Springer

WHY DO WE NEED GRAPHICAL MODEL

Why Do We Need Graphical Models

bull Consndash Graphical model is so complex even with a few circleshellipndash We have to make too many assumptions

bull Prosndash We do need probability to explain our world But joint probability is

hard to computendash Graphical model can help us analyze and understand our problemsndash Graphs are an intuitive way of representing and visualizing the

relationships between many variablesndash With a graphical model we can decouple joint probability to

conditional probabilities which are usually easier

Directed Acyclic Graphical Models (Bayesian Networks)

bull A DAG Model Bayesian network corresponds to a factorization of the joint probability distribution

p(ABCDE) = p(A)p(B)p(C|AB)p(D|BC)p(E|CD)

bull In general

bull where pa(i) are the parents of node i

Directed Graphs for Statistical ModelsPlate Notation

bull A data set of N points generated from a Gaussian

PLSA ndash PROBABILISTIC LATENT SEMANTIC ANALYSIS

Latent Semantic Indexing (LSI) Review

bull For natural Language Queries simple term matching does not work effectivelyndash Ambiguous terms ndash Same Queries vary due to personal styles

bull Latent semantic indexingndash Creates this lsquolatent semantic spacersquo (hidden meaning)

bull LSI puts documents together even if they donrsquot have common words if the docs share frequently co-occurring terms

bull Disadvantagesndash Statistical foundation is missing

pLSA ndash Probabilistic Latent Semantic Analysis

bull Automated Document Indexing and Information retrievalbull Identification of Latent Classes using an Expectation

Maximization (EM) Algorithmbull Shown to solve

ndash Polysemybull Java could mean ldquocoffeerdquo and also the ldquoPL Javardquobull Cricket is a ldquogamerdquo and also an ldquoinsectrdquo

ndash Synonymybull ldquocomputerrdquo ldquopcrdquo ldquodesktoprdquo all could mean the same

bull Has a better statistical foundation than LSA

pLSA

M

Nd

d

z

w

M

d

z1

w1

z2

w2

z3

w3

zN

wN

hellip

z1 hellip zN are variables ziє[1K]K is the number of latent topics

pLSA

d1

z1

w1

z2

w2

zN1

wN1

hellip

d2

z1

w1

z2

w2

zN2

wN2

hellip

dM

z1

w1

z2

w2

zNm

wNm

hellip

p(w|z=1) p(w|z=2) p(w | z=Nm) are shared for all documents

Likelihood

Joint Probability vs Likelihood

bull Joint probability

bull Likelihood (only for observed variables)

bull p(d) is assumed to be uniform

Document Decomposition

bull Each document can be decomposed as

bull This is similar to the matrix decomposition if we consider each discrete distribution as a vector

p(w|d) = ZVtimesk p(z|d)

bull With many documents we hope to find latent topics as common basis

pLSA ndash Objective Function

bull pLSA tries to maximize the log likelihood

bull Due to the summation over z inside log we have to resort to EM

EM Steps

bull E-Stepndash Expectation of the likelihood function is calculated with the current

parameter valuesbull M-Step

ndash Update the parameters with the calculated posterior probabilitiesndash Find the parameters that maximizes the likelihood function

Lower Bounding the Log Likelihood

EM Steps

bull The E-Step

bull The M-Step

Latent Subspace

pLSA vs LSAbull LSA and PLSA perform dimensionality reduction

ndash In LSA by keeping only K singular valuesndash In PLSA by having K aspects

bull Comparison to SVDndash U Matrix related to P(z|d) (doc to aspect)ndash V Matrix related to P(w|z) (aspect to term)ndash E Matrix related to P(z) (aspect strength)

pLSA vs LSAbull The main difference is the way the approximation is done

bull PLSA generates a model (aspect model) and maximizes its predictive power

bull Selecting the proper value of K is heuristic in LSA

bull Model selection in statistics can determine optimal K in PLSA

Applications

bull Text mining topic discovering

bull Scene Classification

Text Mining

Scene Classification

Classification Result

Reference

bull Thomas Hofmann Probabilistic latent semantic analysis In Proc of Uncertainty in Artificial Intelligence UAI99 Stockholm 1999

bull Bosch A Zisserman A and Munoz X Scene Classification via pLSAProceedings of the European Conference on Computer Vision (2006)

bull Sivic J Russell B C Efros A A Zisserman A and Freeman W T Discovering Object Categories in Image Collections MIT AI Lab Memo AIM-2005-005 February 2005

LDA ndash LATENT DIRICHILET ALLOCATION

Problems in pLSA

bull pLSA provides no probabilistic model at the document level Each doc has its own topic mixture proportion

bull The number of parameters in the model grows linearly with M (the number of documents in the training set)

Problems in pLSA

bull There is no constraint for distributions p(z|di)

bull Easy to lead to serious problems with over-fitting

d1

z1

w1

z2

w2

zN1

wN1

hellip

d2

z1

w1

z2

w2

zN2

wN2

hellip

dm

z1

w1

z2

w2

zNm

wNm

hellip

p(z|d1) p(z|d2) p(z|dm)

Dirichlet Distribution

bull In the LDA model the topic mixture proportions for each document are assumed to follow some distribution

bull Requirement for such a distributionndash The samples (mixture proportions) generated from it are K-tuples of non-

negative numbers that sum to one That is the samples are multinormialsndash Easy to optimize

bull Dirichlet Distribution is one of such distributions

bull The space of all of these multinomials has a nice geometric interpretation as a (k-1)-simplex

Dirichlet Distribution

bull Definition

bull The density is zero outside this open (K minus 1)-dimensional simplex

k

i ii

ki ik

i i

ki i

kk

xx

xxxxp i

1

11

1

12121

1 0 st

)Γ()(Γ)|(

bull Various parameter α

(6 2 2) (3 7 5)

(2 3 4) (6 2 6)

Example Dirichlet Distributions (K=3)

Example Dirichlet Distributions (K=3)

bull Equal αi different

α0=01 α0=1 α0=10

k

i i10

The LDA Model

z4z3z2z1

w4w3w2w1

b

z4z3z2z1

w4w3w2w1

z4z3z2z1

w4w3w2w1

bull For each documentbull Choose ~Dirichlet()bull For each of the N words wn

ndash Choose a topic znraquo Multinomial()

ndash Choose a word wn from p(wn|znb) a multinomial probability conditioned on the topic zn

The LDA Model

bull For each documentbull Choose ~Dirichlet()bull For each of the N words wn

ndash Choose a topic znraquo Multinomial()

ndash Choose a word wn from p(wn|znb) a multinomial probability conditioned on the topic zn

Joint Probability

bull Given parameter α and β

where

Likelihood

bull Joint Probability

bull Marginal distribution of a document

bull Likelihood over all the documents

Inference

bull The likelihood can be computed by summing each documentbull Jansenrsquos inequality in EM

Inference

bull In E-Step we need to compute the posterior distribution of the hidden variables

bull Unfortunately this distribution is intractable to compute in general

bull We have to resort to variational approach

Variational Inference

bull In variational inference we consider a simplified graphical model with variational parameters and minimize the KL Divergence between the variational and posterior distributions

Variantional Inference

bull The difference between the lower bound and the likelihood is the KL divergence

bull Maximizing the lower bound L(b) with respect to and is equivalent to minimizing the KL divergence

VBEM vs EM

bull Only different in the E-Step

bull In standard EM q(X) is directly set to p(X|Dθ) and let KL=0bull In VBEM it is intractable to compute p(X|Dθ) Instead it

approximates p(X|Dθ) by a variational distribution q(X) by minimizing KL(q(X) | P(X|D θ)

bull This is also equivalent to maximizing the lower bound L(θ)

Parameter Estimation

bull Given a corpus of documents we would like to find the parameters and b which maximize the likelihood of the observed data

bull Strategy (Variational EM)

bull Lower bound log p(w|b) by a function L(b)bull Repeat until convergence

ndash E Maximize L(b) with respect to the variational parameters ndash M Maximize the bound with respect to parameters and b

Parameter Estimation

bull E-Step Variational Inference ndash repeat until convergence

bull M-Step Parameter estimation

β

α can be implemented using the Newton-Raphson method

Topic Examples in a 100-topic LDA Model)bull 16000 documents from a subset of the TREC AP corpus

Classification (50-topic LDA + SVM)

bull Reuters-21578 dataset ndash contains 8000 documents and 15818 words

(a) EARN vs NOT EARN (b) GRAIN vs NOT GRAIN

Problems in LDA

bull Dirichlet Distribution is helpful to avoid over-fitting But the assumption might be too strong

z4z3z2z1

w4w3w2w1

b

z4z3z2z1

w4w3w2w1

z4z3z2z1

w4w3w2w1

A Bayesian Hierarchical Model for Learning Natural Scene Categories

bull Incorporating category information

MNd

π

z

x

θ

β

Codebookbull 174 Local Image Patches

bull DetectionEvenly Sampled GridRandom SamplingSaliency DetectorLowersquos DoG Detector

bull RepresentationNormalized 11x11 gray values128-dim SIFT

Topic Distribution in Different Categories

Topic Hierarchical Clustering

More Topic Models

bull Dynamic topic models ICML 2006bull Correlated Topic Model NIPS 2005bull Hierarchical Dirichlet Process Journal of the American

Statistical Association 2003bull Nonparametric Bayes pachinko allocation UAI 2007bull Supervised LDA NIPS 2007bull MedLDA ndash Maximum Margin Discrimant LDA ICML 2009bull hellip

Are you really into Graphical Models

bull Describing Visual Scenes using Transformed Dirichlet Processes E Sudderth A Torralba W Freeman and A Willsky NIPS Dec 2005

Reference

bull David M Blei Andrew Y Ng Michael I Jordan Latent Dirichlet Allocation Journal of Machine Learning Research (JMLR) 2003

bull Matthew J Beal Variational Algorithms for Approximate Bayesian Inference PhD Thesis University of Cambridge 1998

bull L Fei-Fei and P Perona A Bayesian Hierarchical Model for Learning Natural Scene Categories CVPR 2005

Outline

bull Matrix Decompositionndash PCA SVD NMFndash LDA ICA Sparse Coding etc

bull Graphical Modelndash Basic concepts in probabilistic machine learningndash EMndash pLSAndash LDA

bull Two Applicationsndash Document decomposition for ldquolong queryrdquo retrieval ICCV 2009ndash Modeling Threaded Discussions SIGIR 2009

Large-Scale Indexing for ldquoLong Queryrdquo Retrieval (Similarity Search)

Xiao Zhang Zhiwei Li Lei Zhang Wei-Ying Ma and Heung-Yeung Shum

ICCV 2009

The Long Query Problem

bull If a query contains 1000 keywordsndash Need to access 1000 inverted listsndash The intersection of 1000 inverted lists may be emptyndash The union of 1000 inverted list may be the whole corpus

bull Dimension reductionTerm1 Term2 Term3 Term4 hellip TermN

Img1 1 2 0 0 hellip 2

f1 f2 hellip fM

Img1 02 01 hellip 003

Topic Projection

Dim = 1 million

Dim = 200

Key Idea Dimension Reduction + Residual Error Preservation

bull p original TF-IDF vector in vocabulary spacebull X projection matrix for dimension reductionbull w low dimensional feature vectorbull residual error

p Xw

An image = 0 01 02 03 04 05 06 07 08 09 1

0

2

4

6

8

10

12

14

16

+ a few words(10 words)

Orthogonal Decomposition

p Xw 1 11 1 1

2 21 2 1 2

1 1

1

k

k

W k W

W W Wk W

p x xp x x w

p wp x x

Base vector

Low dimensional representation

Residual

An image = 0 01 02 03 04 05 06 07 08 09 1

0

2

4

6

8

10

12

14

16

+ a few words(10 words)

T Tp q p q

p q

w w

X1 X2 X3 hellip Xk

A Probabilistic Implementation

x is a switch variable It controls a word generated from

bull a topic specific distribution

bull a document specific distribution

bull a background distribution

( | )p w d 1( 0 | ) ( | ) ( | )K

kp x d p w z k p z k d

( 1| ) ( | )p x d p w d ( 2 | ) ( | )p x d p w

C Chemudugunta etal Modeling General and Specific Aspects of Documents with a Probabilistic Topic Model NIPS 2006

Search (Online)

DS1 DS2 hellip

DS1 DS2 hellip

DS1 DS2 hellip

DS1 DS2 hellip

DS1 DS2 hellip

hellip

hellip

hellip

hellip

hellip

LSH Index

Doc 300 Doc 401 hellip

A query = 0 01 02 03 04 05 06 07 08 09 1

0

2

4

6

8

10

12

14

16

+ a few words

Re-rankingDoc 401 hellip

Doc 1

Doc 2

Doc 300

Doc 401

Doc N

Doc 300

Index 10M Images 46GBSearch Speed lt 100ms

Doc Meta

Search Example

Query Image

Search Example

Query Image

Simultaneously Modeling Semantics and Structure of Threaded Discussions A Sparse

Coding Approach and Its Applications

Chen LIN Jiang-Ming YANG Rui CAI Xin-jing WANG Wei WANG Lei ZHANG

SIGIR 2009

123

Semantic amp structure

124

SemanticTopics

StructureWho reply to who

Optimize them together

Model semantic

Model structure

125

Reply reconstruction

126

DocumentSimilarity

TopicSimilarity

StructureSimilarity

BaselinesNP

Reply to Nearest PostRR

Reply to RootDS

Document SimilarityLDA

Latent Dirichlet Allocation Project documents to topic space

SWB Special Words Topic Model with Background distribution Project documents to topic and junk topic space

127

Evaluation

method Slashdot Apple

All Posts Good Posts All Posts Good Posts

NP 0021 0012 0289 0239

RR 0183 0319 0269 0474

DS 0463 0643 0409 0628

LDA 0465 0644 0410 0648

SWB 0463 0644 0410 0641

SMSS 0524 0737 0517 0772

128

Expert finding

Reply reconstruction

Network construction

Expert finding

Methods

HITS

PageRank

hellip

129

Baselines

LMFormal Models for Expert Finding in Enterprise Corpora SIGIR

06Achieves stable performance in expert finding task using a

language modelPageRank

Benchmark nodal ranking methodHITS

Find hub nodes and authority nodeEABIF

Personalized Recommendation Driven by Information Flow SIGIR rsquo06

Find most influential node130

Evaluation

131

bull Bayesian estimate

Method MRR MAP P10

LM 0821 0698 0800

EABIF(ori) 0674 0362 0243

EABIF(rec) 0742 0318 0281

PageRank(ori) 0675 0377 0263

PageRank(rec) 0743 0321 0266

HITS(ori) 0906 0832 0900

HITS(rec) 0938 0822 0906

Summary

bull Matrix and probability are fundamental mathematics in information retrieval and computer visionndash Matrix decomposition ndash a good practice to learn matrixndash Graphical model ndash a good practice to learn probability

bull Graphical model is a good tool to analyze problems

bull The essence of decomposition is to discover a set of mid-level features to describe original documentsimages

bull It is more adaptable for various applications than matrix decomposition

  • An Introduction To Matrix Decomposition and Graphical Model
  • Outline
  • What Is Matrix Decomposition
  • Why We Need Matrix Decomposition
  • Why We Need Matrix Decomposition (2)
  • Principle Component Analysis
  • Definition ndash Eigenvalue amp Eigenvector
  • Definition ndash Principle Component Analysis
  • Principle Component Analysis (2)
  • Maximizing Variance
  • Optimization Problem
  • Property Data Decomposition
  • Face Recognition ndash Eigenface
  • Slide 14
  • Slide 15
  • PageRank ndash Power Iteration
  • Column-Stochastic amp Irreducible
  • Iterative PageRank Calculation
  • Convergence of the power iteration
  • Singular value decomposition
  • SVD - Definition
  • Singular Values And Singular Vectors
  • Matrix approximation
  • SVD and PCA
  • Example - LSI
  • SVD and PCA (2)
  • Latent Semantic Indexing (LSI)
  • Latent Semantic Indexing
  • Similarity Measures
  • Similarity Measures (2)
  • HITS (Hyperlink Induced Topic Search)
  • Power Iteration
  • HITS and SVD
  • HITS vs PageRank
  • NMF ndash Non-Negative Matrix Factorization
  • Definition
  • Motivation
  • Multiplicative Update Algorithm
  • Multiplicative Update Algorithm (2)
  • NMF vs PCA
  • Reference
  • Major Reference
  • Outline (2)
  • Not Included
  • Basic Concepts
  • What Is Machine Learning
  • Likelihood Function
  • Maximum Likelihood (ML)
  • IID ndash Independent Identically Distributed
  • Reference (2)
  • Expectation Maximization
  • Why We Need EM
  • More General
  • The Expectation Maximization (EM) Algorithm
  • Jensenrsquos Inequality
  • Lower Bounding the Log Likelihood
  • The E and M Steps of EM
  • The E Step
  • The M Step
  • EM Never Decreases the Likelihood
  • Reference (3)
  • Why Do We Need Graphical Model
  • Why Do We Need Graphical Models
  • Directed Acyclic Graphical Models (Bayesian Networks)
  • Directed Graphs for Statistical Models Plate Notation
  • pLSA ndash Probabilistic Latent Semantic Analysis
  • Latent Semantic Indexing (LSI) Review
  • pLSA ndash Probabilistic Latent Semantic Analysis (2)
  • pLSA
  • pLSA (2)
  • Joint Probability vs Likelihood
  • Document Decomposition
  • pLSA ndash Objective Function
  • EM Steps
  • Lower Bounding the Log Likelihood (2)
  • EM Steps (2)
  • Latent Subspace
  • pLSA vs LSA
  • pLSA vs LSA (2)
  • Applications
  • Text Mining
  • Scene Classification
  • Classification Result
  • Reference (4)
  • LDA ndash Latent Dirichilet Allocation
  • Problems in pLSA
  • Problems in pLSA (2)
  • Dirichlet Distribution
  • Dirichlet Distribution (2)
  • Example Dirichlet Distributions (K=3)
  • Example Dirichlet Distributions (K=3) (2)
  • The LDA Model
  • The LDA Model (2)
  • Joint Probability
  • Likelihood
  • Inference
  • Inference (2)
  • Variational Inference
  • Variantional Inference
  • VBEM vs EM
  • Parameter Estimation
  • Parameter Estimation (2)
  • Topic Examples in a 100-topic LDA Model)
  • Classification (50-topic LDA + SVM)
  • Problems in LDA
  • A Bayesian Hierarchical Model for Learning Natural Scene Catego
  • Codebook
  • Topic Distribution in Different Categories
  • Topic Hierarchical Clustering
  • More Topic Models
  • Are you really into Graphical Models
  • Reference (5)
  • Outline (3)
  • Slide 115
  • The Long Query Problem
  • Key Idea Dimension Reduction + Residual Error Preservation
  • Orthogonal Decomposition
  • A Probabilistic Implementation
  • Search (Online)
  • Search Example
  • Search Example (2)
  • Simultaneously Modeling Semantics and Structure of Threaded Dis
  • Semantic amp structure
  • Optimize them together
  • Reply reconstruction
  • Baselines
  • Evaluation
  • Expert finding
  • Baselines (2)
  • Evaluation (2)
  • Summary
Page 49: An Introduction To Matrix Decomposition and Graphical Model

EXPECTATION MAXIMIZATION

Why We Need EM

bull The ExpectationndashMaximization (EM) algorithm is a method for ML learning of parameters in latent variable models

bull Why we need latent variables

bull To describe complex model Gaussian Mixture Model

bull To discover the intrinsic structure inside a data set Topic Models such as pLSA LDA

More General

bull Data set bull Likelihood

bull Goal learn maximum likelihood (ML) parameter values

bull The maximum likelihood procedure finds parameters θ such that

bull Because of the integral (or sum) over latent variables the likelihood can be a very complicated and hard to optimize function

The Expectation Maximization (EM) Algorithm

bull The EM algorithm finds a (local) maximum of a latent variable model likelihood It starts from arbitrary values of the parameters and iterates two steps

bull E step Fill in values of latent variables according to posterior given data

bull M step Maximize likelihood as if latent variables were not hidden

bull Decomposes difficult problems into series of tractable steps

Jensenrsquos Inequality

Lower Bounding the Log Likelihoodbull Observed data D = yn Latent variables X = xn Parameters θbull Goal Maximize the log likelihood (ie ML learning) wrt θ

bull Any distribution q(X) over the hidden variables can be used to obtain a lower bound on the log likelihood using Jensenrsquos inequality

bull where H[q] is the entropy of q(X)

The E and M Steps of EM

bull The lower bound on the log likelihood is given by

bull EM alternates betweenbull E step optimize wrt distribution over hidden variables

holding params fixed

bull M step maximize wrt parameters holding hidden distribution fixed

The E Step

bull E step for fixed θ

bull The second term is the Kullback-Leibler divergencebull This means that for fixed θ F is bounded above by L and achieves

that bound whenbull So the E step simply sets

The M Step

bull M step maximize wrt parameters holding hidden distribution q fixed

bull The second equality comes from fact that entropy of q(X) does not depend directly on θ

bull The specific form of the M step depends on the model Often the maximum wrt θ can be found analytically

EM Never Decreases the Likelihood

bull The E and M steps together never decrease the log likelihood

bull The E step brings F(q θ) to the likelihood L(θ) 顶bull The M-step maximizes F(q θ) wrt θ 抬bull F(q θ) lt L(θ) by Jensen ndash or equivalently from the non-

negativity of KL

Reference

bull Zoubin Ghahramani Machine Learning (4F13) 2006 Cambridge(Unsupervised learning Lecture 5 Slides)

bull Christopher M Bishop (2006) Pattern Recognition and Machine Learning Springer

WHY DO WE NEED GRAPHICAL MODEL

Why Do We Need Graphical Models

bull Consndash Graphical model is so complex even with a few circleshellipndash We have to make too many assumptions

bull Prosndash We do need probability to explain our world But joint probability is

hard to computendash Graphical model can help us analyze and understand our problemsndash Graphs are an intuitive way of representing and visualizing the

relationships between many variablesndash With a graphical model we can decouple joint probability to

conditional probabilities which are usually easier

Directed Acyclic Graphical Models (Bayesian Networks)

bull A DAG Model Bayesian network corresponds to a factorization of the joint probability distribution

p(ABCDE) = p(A)p(B)p(C|AB)p(D|BC)p(E|CD)

bull In general

bull where pa(i) are the parents of node i

Directed Graphs for Statistical ModelsPlate Notation

bull A data set of N points generated from a Gaussian

PLSA ndash PROBABILISTIC LATENT SEMANTIC ANALYSIS

Latent Semantic Indexing (LSI) Review

bull For natural Language Queries simple term matching does not work effectivelyndash Ambiguous terms ndash Same Queries vary due to personal styles

bull Latent semantic indexingndash Creates this lsquolatent semantic spacersquo (hidden meaning)

bull LSI puts documents together even if they donrsquot have common words if the docs share frequently co-occurring terms

bull Disadvantagesndash Statistical foundation is missing

pLSA ndash Probabilistic Latent Semantic Analysis

bull Automated Document Indexing and Information retrievalbull Identification of Latent Classes using an Expectation

Maximization (EM) Algorithmbull Shown to solve

ndash Polysemybull Java could mean ldquocoffeerdquo and also the ldquoPL Javardquobull Cricket is a ldquogamerdquo and also an ldquoinsectrdquo

ndash Synonymybull ldquocomputerrdquo ldquopcrdquo ldquodesktoprdquo all could mean the same

bull Has a better statistical foundation than LSA

pLSA

M

Nd

d

z

w

M

d

z1

w1

z2

w2

z3

w3

zN

wN

hellip

z1 hellip zN are variables ziє[1K]K is the number of latent topics

pLSA

d1

z1

w1

z2

w2

zN1

wN1

hellip

d2

z1

w1

z2

w2

zN2

wN2

hellip

dM

z1

w1

z2

w2

zNm

wNm

hellip

p(w|z=1) p(w|z=2) p(w | z=Nm) are shared for all documents

Likelihood

Joint Probability vs Likelihood

bull Joint probability

bull Likelihood (only for observed variables)

bull p(d) is assumed to be uniform

Document Decomposition

bull Each document can be decomposed as

bull This is similar to the matrix decomposition if we consider each discrete distribution as a vector

p(w|d) = ZVtimesk p(z|d)

bull With many documents we hope to find latent topics as common basis

pLSA ndash Objective Function

bull pLSA tries to maximize the log likelihood

bull Due to the summation over z inside log we have to resort to EM

EM Steps

bull E-Stepndash Expectation of the likelihood function is calculated with the current

parameter valuesbull M-Step

ndash Update the parameters with the calculated posterior probabilitiesndash Find the parameters that maximizes the likelihood function

Lower Bounding the Log Likelihood

EM Steps

bull The E-Step

bull The M-Step

Latent Subspace

pLSA vs LSAbull LSA and PLSA perform dimensionality reduction

ndash In LSA by keeping only K singular valuesndash In PLSA by having K aspects

bull Comparison to SVDndash U Matrix related to P(z|d) (doc to aspect)ndash V Matrix related to P(w|z) (aspect to term)ndash E Matrix related to P(z) (aspect strength)

pLSA vs LSAbull The main difference is the way the approximation is done

bull PLSA generates a model (aspect model) and maximizes its predictive power

bull Selecting the proper value of K is heuristic in LSA

bull Model selection in statistics can determine optimal K in PLSA

Applications

bull Text mining topic discovering

bull Scene Classification

Text Mining

Scene Classification

Classification Result

Reference

bull Thomas Hofmann Probabilistic latent semantic analysis In Proc of Uncertainty in Artificial Intelligence UAI99 Stockholm 1999

bull Bosch A Zisserman A and Munoz X Scene Classification via pLSAProceedings of the European Conference on Computer Vision (2006)

bull Sivic J Russell B C Efros A A Zisserman A and Freeman W T Discovering Object Categories in Image Collections MIT AI Lab Memo AIM-2005-005 February 2005

LDA ndash LATENT DIRICHILET ALLOCATION

Problems in pLSA

bull pLSA provides no probabilistic model at the document level Each doc has its own topic mixture proportion

bull The number of parameters in the model grows linearly with M (the number of documents in the training set)

Problems in pLSA

bull There is no constraint for distributions p(z|di)

bull Easy to lead to serious problems with over-fitting

d1

z1

w1

z2

w2

zN1

wN1

hellip

d2

z1

w1

z2

w2

zN2

wN2

hellip

dm

z1

w1

z2

w2

zNm

wNm

hellip

p(z|d1) p(z|d2) p(z|dm)

Dirichlet Distribution

bull In the LDA model the topic mixture proportions for each document are assumed to follow some distribution

bull Requirement for such a distributionndash The samples (mixture proportions) generated from it are K-tuples of non-

negative numbers that sum to one That is the samples are multinormialsndash Easy to optimize

bull Dirichlet Distribution is one of such distributions

bull The space of all of these multinomials has a nice geometric interpretation as a (k-1)-simplex

Dirichlet Distribution

bull Definition

bull The density is zero outside this open (K minus 1)-dimensional simplex

k

i ii

ki ik

i i

ki i

kk

xx

xxxxp i

1

11

1

12121

1 0 st

)Γ()(Γ)|(

bull Various parameter α

(6 2 2) (3 7 5)

(2 3 4) (6 2 6)

Example Dirichlet Distributions (K=3)

Example Dirichlet Distributions (K=3)

bull Equal αi different

α0=01 α0=1 α0=10

k

i i10

The LDA Model

z4z3z2z1

w4w3w2w1

b

z4z3z2z1

w4w3w2w1

z4z3z2z1

w4w3w2w1

bull For each documentbull Choose ~Dirichlet()bull For each of the N words wn

ndash Choose a topic znraquo Multinomial()

ndash Choose a word wn from p(wn|znb) a multinomial probability conditioned on the topic zn

The LDA Model

bull For each documentbull Choose ~Dirichlet()bull For each of the N words wn

ndash Choose a topic znraquo Multinomial()

ndash Choose a word wn from p(wn|znb) a multinomial probability conditioned on the topic zn

Joint Probability

bull Given parameter α and β

where

Likelihood

bull Joint Probability

bull Marginal distribution of a document

bull Likelihood over all the documents

Inference

bull The likelihood can be computed by summing each documentbull Jansenrsquos inequality in EM

Inference

bull In E-Step we need to compute the posterior distribution of the hidden variables

bull Unfortunately this distribution is intractable to compute in general

bull We have to resort to variational approach

Variational Inference

bull In variational inference we consider a simplified graphical model with variational parameters and minimize the KL Divergence between the variational and posterior distributions

Variantional Inference

bull The difference between the lower bound and the likelihood is the KL divergence

bull Maximizing the lower bound L(b) with respect to and is equivalent to minimizing the KL divergence

VBEM vs EM

bull Only different in the E-Step

bull In standard EM q(X) is directly set to p(X|Dθ) and let KL=0bull In VBEM it is intractable to compute p(X|Dθ) Instead it

approximates p(X|Dθ) by a variational distribution q(X) by minimizing KL(q(X) | P(X|D θ)

bull This is also equivalent to maximizing the lower bound L(θ)

Parameter Estimation

bull Given a corpus of documents we would like to find the parameters and b which maximize the likelihood of the observed data

bull Strategy (Variational EM)

bull Lower bound log p(w|b) by a function L(b)bull Repeat until convergence

ndash E Maximize L(b) with respect to the variational parameters ndash M Maximize the bound with respect to parameters and b

Parameter Estimation

bull E-Step Variational Inference ndash repeat until convergence

bull M-Step Parameter estimation

β

α can be implemented using the Newton-Raphson method

Topic Examples in a 100-topic LDA Model)bull 16000 documents from a subset of the TREC AP corpus

Classification (50-topic LDA + SVM)

bull Reuters-21578 dataset ndash contains 8000 documents and 15818 words

(a) EARN vs NOT EARN (b) GRAIN vs NOT GRAIN

Problems in LDA

bull Dirichlet Distribution is helpful to avoid over-fitting But the assumption might be too strong

z4z3z2z1

w4w3w2w1

b

z4z3z2z1

w4w3w2w1

z4z3z2z1

w4w3w2w1

A Bayesian Hierarchical Model for Learning Natural Scene Categories

bull Incorporating category information

MNd

π

z

x

θ

β

Codebookbull 174 Local Image Patches

bull DetectionEvenly Sampled GridRandom SamplingSaliency DetectorLowersquos DoG Detector

bull RepresentationNormalized 11x11 gray values128-dim SIFT

Topic Distribution in Different Categories

Topic Hierarchical Clustering

More Topic Models

bull Dynamic topic models ICML 2006bull Correlated Topic Model NIPS 2005bull Hierarchical Dirichlet Process Journal of the American

Statistical Association 2003bull Nonparametric Bayes pachinko allocation UAI 2007bull Supervised LDA NIPS 2007bull MedLDA ndash Maximum Margin Discrimant LDA ICML 2009bull hellip

Are you really into Graphical Models

bull Describing Visual Scenes using Transformed Dirichlet Processes E Sudderth A Torralba W Freeman and A Willsky NIPS Dec 2005

Reference

bull David M Blei Andrew Y Ng Michael I Jordan Latent Dirichlet Allocation Journal of Machine Learning Research (JMLR) 2003

bull Matthew J Beal Variational Algorithms for Approximate Bayesian Inference PhD Thesis University of Cambridge 1998

bull L Fei-Fei and P Perona A Bayesian Hierarchical Model for Learning Natural Scene Categories CVPR 2005

Outline

bull Matrix Decompositionndash PCA SVD NMFndash LDA ICA Sparse Coding etc

bull Graphical Modelndash Basic concepts in probabilistic machine learningndash EMndash pLSAndash LDA

bull Two Applicationsndash Document decomposition for ldquolong queryrdquo retrieval ICCV 2009ndash Modeling Threaded Discussions SIGIR 2009

Large-Scale Indexing for ldquoLong Queryrdquo Retrieval (Similarity Search)

Xiao Zhang Zhiwei Li Lei Zhang Wei-Ying Ma and Heung-Yeung Shum

ICCV 2009

The Long Query Problem

bull If a query contains 1000 keywordsndash Need to access 1000 inverted listsndash The intersection of 1000 inverted lists may be emptyndash The union of 1000 inverted list may be the whole corpus

bull Dimension reductionTerm1 Term2 Term3 Term4 hellip TermN

Img1 1 2 0 0 hellip 2

f1 f2 hellip fM

Img1 02 01 hellip 003

Topic Projection

Dim = 1 million

Dim = 200

Key Idea Dimension Reduction + Residual Error Preservation

bull p original TF-IDF vector in vocabulary spacebull X projection matrix for dimension reductionbull w low dimensional feature vectorbull residual error

p Xw

An image = 0 01 02 03 04 05 06 07 08 09 1

0

2

4

6

8

10

12

14

16

+ a few words(10 words)

Orthogonal Decomposition

p Xw 1 11 1 1

2 21 2 1 2

1 1

1

k

k

W k W

W W Wk W

p x xp x x w

p wp x x

Base vector

Low dimensional representation

Residual

An image = 0 01 02 03 04 05 06 07 08 09 1

0

2

4

6

8

10

12

14

16

+ a few words(10 words)

T Tp q p q

p q

w w

X1 X2 X3 hellip Xk

A Probabilistic Implementation

x is a switch variable It controls a word generated from

bull a topic specific distribution

bull a document specific distribution

bull a background distribution

( | )p w d 1( 0 | ) ( | ) ( | )K

kp x d p w z k p z k d

( 1| ) ( | )p x d p w d ( 2 | ) ( | )p x d p w

C Chemudugunta etal Modeling General and Specific Aspects of Documents with a Probabilistic Topic Model NIPS 2006

Search (Online)

DS1 DS2 hellip

DS1 DS2 hellip

DS1 DS2 hellip

DS1 DS2 hellip

DS1 DS2 hellip

hellip

hellip

hellip

hellip

hellip

LSH Index

Doc 300 Doc 401 hellip

A query = 0 01 02 03 04 05 06 07 08 09 1

0

2

4

6

8

10

12

14

16

+ a few words

Re-rankingDoc 401 hellip

Doc 1

Doc 2

Doc 300

Doc 401

Doc N

Doc 300

Index 10M Images 46GBSearch Speed lt 100ms

Doc Meta

Search Example

Query Image

Search Example

Query Image

Simultaneously Modeling Semantics and Structure of Threaded Discussions A Sparse

Coding Approach and Its Applications

Chen LIN Jiang-Ming YANG Rui CAI Xin-jing WANG Wei WANG Lei ZHANG

SIGIR 2009

123

Semantic amp structure

124

SemanticTopics

StructureWho reply to who

Optimize them together

Model semantic

Model structure

125

Reply reconstruction

126

DocumentSimilarity

TopicSimilarity

StructureSimilarity

BaselinesNP

Reply to Nearest PostRR

Reply to RootDS

Document SimilarityLDA

Latent Dirichlet Allocation Project documents to topic space

SWB Special Words Topic Model with Background distribution Project documents to topic and junk topic space

127

Evaluation

method Slashdot Apple

All Posts Good Posts All Posts Good Posts

NP 0021 0012 0289 0239

RR 0183 0319 0269 0474

DS 0463 0643 0409 0628

LDA 0465 0644 0410 0648

SWB 0463 0644 0410 0641

SMSS 0524 0737 0517 0772

128

Expert finding

Reply reconstruction

Network construction

Expert finding

Methods

HITS

PageRank

hellip

129

Baselines

LMFormal Models for Expert Finding in Enterprise Corpora SIGIR

06Achieves stable performance in expert finding task using a

language modelPageRank

Benchmark nodal ranking methodHITS

Find hub nodes and authority nodeEABIF

Personalized Recommendation Driven by Information Flow SIGIR rsquo06

Find most influential node130

Evaluation

131

bull Bayesian estimate

Method MRR MAP P10

LM 0821 0698 0800

EABIF(ori) 0674 0362 0243

EABIF(rec) 0742 0318 0281

PageRank(ori) 0675 0377 0263

PageRank(rec) 0743 0321 0266

HITS(ori) 0906 0832 0900

HITS(rec) 0938 0822 0906

Summary

bull Matrix and probability are fundamental mathematics in information retrieval and computer visionndash Matrix decomposition ndash a good practice to learn matrixndash Graphical model ndash a good practice to learn probability

bull Graphical model is a good tool to analyze problems

bull The essence of decomposition is to discover a set of mid-level features to describe original documentsimages

bull It is more adaptable for various applications than matrix decomposition

  • An Introduction To Matrix Decomposition and Graphical Model
  • Outline
  • What Is Matrix Decomposition
  • Why We Need Matrix Decomposition
  • Why We Need Matrix Decomposition (2)
  • Principle Component Analysis
  • Definition ndash Eigenvalue amp Eigenvector
  • Definition ndash Principle Component Analysis
  • Principle Component Analysis (2)
  • Maximizing Variance
  • Optimization Problem
  • Property Data Decomposition
  • Face Recognition ndash Eigenface
  • Slide 14
  • Slide 15
  • PageRank ndash Power Iteration
  • Column-Stochastic amp Irreducible
  • Iterative PageRank Calculation
  • Convergence of the power iteration
  • Singular value decomposition
  • SVD - Definition
  • Singular Values And Singular Vectors
  • Matrix approximation
  • SVD and PCA
  • Example - LSI
  • SVD and PCA (2)
  • Latent Semantic Indexing (LSI)
  • Latent Semantic Indexing
  • Similarity Measures
  • Similarity Measures (2)
  • HITS (Hyperlink Induced Topic Search)
  • Power Iteration
  • HITS and SVD
  • HITS vs PageRank
  • NMF ndash Non-Negative Matrix Factorization
  • Definition
  • Motivation
  • Multiplicative Update Algorithm
  • Multiplicative Update Algorithm (2)
  • NMF vs PCA
  • Reference
  • Major Reference
  • Outline (2)
  • Not Included
  • Basic Concepts
  • What Is Machine Learning
  • Likelihood Function
  • Maximum Likelihood (ML)
  • IID ndash Independent Identically Distributed
  • Reference (2)
  • Expectation Maximization
  • Why We Need EM
  • More General
  • The Expectation Maximization (EM) Algorithm
  • Jensenrsquos Inequality
  • Lower Bounding the Log Likelihood
  • The E and M Steps of EM
  • The E Step
  • The M Step
  • EM Never Decreases the Likelihood
  • Reference (3)
  • Why Do We Need Graphical Model
  • Why Do We Need Graphical Models
  • Directed Acyclic Graphical Models (Bayesian Networks)
  • Directed Graphs for Statistical Models Plate Notation
  • pLSA ndash Probabilistic Latent Semantic Analysis
  • Latent Semantic Indexing (LSI) Review
  • pLSA ndash Probabilistic Latent Semantic Analysis (2)
  • pLSA
  • pLSA (2)
  • Joint Probability vs Likelihood
  • Document Decomposition
  • pLSA ndash Objective Function
  • EM Steps
  • Lower Bounding the Log Likelihood (2)
  • EM Steps (2)
  • Latent Subspace
  • pLSA vs LSA
  • pLSA vs LSA (2)
  • Applications
  • Text Mining
  • Scene Classification
  • Classification Result
  • Reference (4)
  • LDA ndash Latent Dirichilet Allocation
  • Problems in pLSA
  • Problems in pLSA (2)
  • Dirichlet Distribution
  • Dirichlet Distribution (2)
  • Example Dirichlet Distributions (K=3)
  • Example Dirichlet Distributions (K=3) (2)
  • The LDA Model
  • The LDA Model (2)
  • Joint Probability
  • Likelihood
  • Inference
  • Inference (2)
  • Variational Inference
  • Variantional Inference
  • VBEM vs EM
  • Parameter Estimation
  • Parameter Estimation (2)
  • Topic Examples in a 100-topic LDA Model)
  • Classification (50-topic LDA + SVM)
  • Problems in LDA
  • A Bayesian Hierarchical Model for Learning Natural Scene Catego
  • Codebook
  • Topic Distribution in Different Categories
  • Topic Hierarchical Clustering
  • More Topic Models
  • Are you really into Graphical Models
  • Reference (5)
  • Outline (3)
  • Slide 115
  • The Long Query Problem
  • Key Idea Dimension Reduction + Residual Error Preservation
  • Orthogonal Decomposition
  • A Probabilistic Implementation
  • Search (Online)
  • Search Example
  • Search Example (2)
  • Simultaneously Modeling Semantics and Structure of Threaded Dis
  • Semantic amp structure
  • Optimize them together
  • Reply reconstruction
  • Baselines
  • Evaluation
  • Expert finding
  • Baselines (2)
  • Evaluation (2)
  • Summary
Page 50: An Introduction To Matrix Decomposition and Graphical Model

Why We Need EM

bull The ExpectationndashMaximization (EM) algorithm is a method for ML learning of parameters in latent variable models

bull Why we need latent variables

bull To describe complex model Gaussian Mixture Model

bull To discover the intrinsic structure inside a data set Topic Models such as pLSA LDA

More General

bull Data set bull Likelihood

bull Goal learn maximum likelihood (ML) parameter values

bull The maximum likelihood procedure finds parameters θ such that

bull Because of the integral (or sum) over latent variables the likelihood can be a very complicated and hard to optimize function

The Expectation Maximization (EM) Algorithm

bull The EM algorithm finds a (local) maximum of a latent variable model likelihood It starts from arbitrary values of the parameters and iterates two steps

bull E step Fill in values of latent variables according to posterior given data

bull M step Maximize likelihood as if latent variables were not hidden

bull Decomposes difficult problems into series of tractable steps

Jensenrsquos Inequality

Lower Bounding the Log Likelihoodbull Observed data D = yn Latent variables X = xn Parameters θbull Goal Maximize the log likelihood (ie ML learning) wrt θ

bull Any distribution q(X) over the hidden variables can be used to obtain a lower bound on the log likelihood using Jensenrsquos inequality

bull where H[q] is the entropy of q(X)

The E and M Steps of EM

bull The lower bound on the log likelihood is given by

bull EM alternates betweenbull E step optimize wrt distribution over hidden variables

holding params fixed

bull M step maximize wrt parameters holding hidden distribution fixed

The E Step

bull E step for fixed θ

bull The second term is the Kullback-Leibler divergencebull This means that for fixed θ F is bounded above by L and achieves

that bound whenbull So the E step simply sets

The M Step

bull M step maximize wrt parameters holding hidden distribution q fixed

bull The second equality comes from fact that entropy of q(X) does not depend directly on θ

bull The specific form of the M step depends on the model Often the maximum wrt θ can be found analytically

EM Never Decreases the Likelihood

bull The E and M steps together never decrease the log likelihood

bull The E step brings F(q θ) to the likelihood L(θ) 顶bull The M-step maximizes F(q θ) wrt θ 抬bull F(q θ) lt L(θ) by Jensen ndash or equivalently from the non-

negativity of KL

Reference

bull Zoubin Ghahramani Machine Learning (4F13) 2006 Cambridge(Unsupervised learning Lecture 5 Slides)

bull Christopher M Bishop (2006) Pattern Recognition and Machine Learning Springer

WHY DO WE NEED GRAPHICAL MODEL

Why Do We Need Graphical Models

bull Consndash Graphical model is so complex even with a few circleshellipndash We have to make too many assumptions

bull Prosndash We do need probability to explain our world But joint probability is

hard to computendash Graphical model can help us analyze and understand our problemsndash Graphs are an intuitive way of representing and visualizing the

relationships between many variablesndash With a graphical model we can decouple joint probability to

conditional probabilities which are usually easier

Directed Acyclic Graphical Models (Bayesian Networks)

bull A DAG Model Bayesian network corresponds to a factorization of the joint probability distribution

p(ABCDE) = p(A)p(B)p(C|AB)p(D|BC)p(E|CD)

bull In general

bull where pa(i) are the parents of node i

Directed Graphs for Statistical ModelsPlate Notation

bull A data set of N points generated from a Gaussian

PLSA ndash PROBABILISTIC LATENT SEMANTIC ANALYSIS

Latent Semantic Indexing (LSI) Review

bull For natural Language Queries simple term matching does not work effectivelyndash Ambiguous terms ndash Same Queries vary due to personal styles

bull Latent semantic indexingndash Creates this lsquolatent semantic spacersquo (hidden meaning)

bull LSI puts documents together even if they donrsquot have common words if the docs share frequently co-occurring terms

bull Disadvantagesndash Statistical foundation is missing

pLSA ndash Probabilistic Latent Semantic Analysis

bull Automated Document Indexing and Information retrievalbull Identification of Latent Classes using an Expectation

Maximization (EM) Algorithmbull Shown to solve

ndash Polysemybull Java could mean ldquocoffeerdquo and also the ldquoPL Javardquobull Cricket is a ldquogamerdquo and also an ldquoinsectrdquo

ndash Synonymybull ldquocomputerrdquo ldquopcrdquo ldquodesktoprdquo all could mean the same

bull Has a better statistical foundation than LSA

pLSA

M

Nd

d

z

w

M

d

z1

w1

z2

w2

z3

w3

zN

wN

hellip

z1 hellip zN are variables ziє[1K]K is the number of latent topics

pLSA

d1

z1

w1

z2

w2

zN1

wN1

hellip

d2

z1

w1

z2

w2

zN2

wN2

hellip

dM

z1

w1

z2

w2

zNm

wNm

hellip

p(w|z=1) p(w|z=2) p(w | z=Nm) are shared for all documents

Likelihood

Joint Probability vs Likelihood

bull Joint probability

bull Likelihood (only for observed variables)

bull p(d) is assumed to be uniform

Document Decomposition

bull Each document can be decomposed as

bull This is similar to the matrix decomposition if we consider each discrete distribution as a vector

p(w|d) = ZVtimesk p(z|d)

bull With many documents we hope to find latent topics as common basis

pLSA ndash Objective Function

bull pLSA tries to maximize the log likelihood

bull Due to the summation over z inside log we have to resort to EM

EM Steps

bull E-Stepndash Expectation of the likelihood function is calculated with the current

parameter valuesbull M-Step

ndash Update the parameters with the calculated posterior probabilitiesndash Find the parameters that maximizes the likelihood function

Lower Bounding the Log Likelihood

EM Steps

bull The E-Step

bull The M-Step

Latent Subspace

pLSA vs LSAbull LSA and PLSA perform dimensionality reduction

ndash In LSA by keeping only K singular valuesndash In PLSA by having K aspects

bull Comparison to SVDndash U Matrix related to P(z|d) (doc to aspect)ndash V Matrix related to P(w|z) (aspect to term)ndash E Matrix related to P(z) (aspect strength)

pLSA vs LSAbull The main difference is the way the approximation is done

bull PLSA generates a model (aspect model) and maximizes its predictive power

bull Selecting the proper value of K is heuristic in LSA

bull Model selection in statistics can determine optimal K in PLSA

Applications

bull Text mining topic discovering

bull Scene Classification

Text Mining

Scene Classification

Classification Result

Reference

bull Thomas Hofmann Probabilistic latent semantic analysis In Proc of Uncertainty in Artificial Intelligence UAI99 Stockholm 1999

bull Bosch A Zisserman A and Munoz X Scene Classification via pLSAProceedings of the European Conference on Computer Vision (2006)

bull Sivic J Russell B C Efros A A Zisserman A and Freeman W T Discovering Object Categories in Image Collections MIT AI Lab Memo AIM-2005-005 February 2005

LDA ndash LATENT DIRICHILET ALLOCATION

Problems in pLSA

bull pLSA provides no probabilistic model at the document level Each doc has its own topic mixture proportion

bull The number of parameters in the model grows linearly with M (the number of documents in the training set)

Problems in pLSA

bull There is no constraint for distributions p(z|di)

bull Easy to lead to serious problems with over-fitting

d1

z1

w1

z2

w2

zN1

wN1

hellip

d2

z1

w1

z2

w2

zN2

wN2

hellip

dm

z1

w1

z2

w2

zNm

wNm

hellip

p(z|d1) p(z|d2) p(z|dm)

Dirichlet Distribution

bull In the LDA model the topic mixture proportions for each document are assumed to follow some distribution

bull Requirement for such a distributionndash The samples (mixture proportions) generated from it are K-tuples of non-

negative numbers that sum to one That is the samples are multinormialsndash Easy to optimize

bull Dirichlet Distribution is one of such distributions

bull The space of all of these multinomials has a nice geometric interpretation as a (k-1)-simplex

Dirichlet Distribution

bull Definition

bull The density is zero outside this open (K minus 1)-dimensional simplex

k

i ii

ki ik

i i

ki i

kk

xx

xxxxp i

1

11

1

12121

1 0 st

)Γ()(Γ)|(

bull Various parameter α

(6 2 2) (3 7 5)

(2 3 4) (6 2 6)

Example Dirichlet Distributions (K=3)

Example Dirichlet Distributions (K=3)

bull Equal αi different

α0=01 α0=1 α0=10

k

i i10

The LDA Model

z4z3z2z1

w4w3w2w1

b

z4z3z2z1

w4w3w2w1

z4z3z2z1

w4w3w2w1

bull For each documentbull Choose ~Dirichlet()bull For each of the N words wn

ndash Choose a topic znraquo Multinomial()

ndash Choose a word wn from p(wn|znb) a multinomial probability conditioned on the topic zn

The LDA Model

bull For each documentbull Choose ~Dirichlet()bull For each of the N words wn

ndash Choose a topic znraquo Multinomial()

ndash Choose a word wn from p(wn|znb) a multinomial probability conditioned on the topic zn

Joint Probability

bull Given parameter α and β

where

Likelihood

bull Joint Probability

bull Marginal distribution of a document

bull Likelihood over all the documents

Inference

bull The likelihood can be computed by summing each documentbull Jansenrsquos inequality in EM

Inference

bull In E-Step we need to compute the posterior distribution of the hidden variables

bull Unfortunately this distribution is intractable to compute in general

bull We have to resort to variational approach

Variational Inference

bull In variational inference we consider a simplified graphical model with variational parameters and minimize the KL Divergence between the variational and posterior distributions

Variantional Inference

bull The difference between the lower bound and the likelihood is the KL divergence

bull Maximizing the lower bound L(b) with respect to and is equivalent to minimizing the KL divergence

VBEM vs EM

bull Only different in the E-Step

bull In standard EM q(X) is directly set to p(X|Dθ) and let KL=0bull In VBEM it is intractable to compute p(X|Dθ) Instead it

approximates p(X|Dθ) by a variational distribution q(X) by minimizing KL(q(X) | P(X|D θ)

bull This is also equivalent to maximizing the lower bound L(θ)

Parameter Estimation

bull Given a corpus of documents we would like to find the parameters and b which maximize the likelihood of the observed data

bull Strategy (Variational EM)

bull Lower bound log p(w|b) by a function L(b)bull Repeat until convergence

ndash E Maximize L(b) with respect to the variational parameters ndash M Maximize the bound with respect to parameters and b

Parameter Estimation

bull E-Step Variational Inference ndash repeat until convergence

bull M-Step Parameter estimation

β

α can be implemented using the Newton-Raphson method

Topic Examples in a 100-topic LDA Model)bull 16000 documents from a subset of the TREC AP corpus

Classification (50-topic LDA + SVM)

bull Reuters-21578 dataset ndash contains 8000 documents and 15818 words

(a) EARN vs NOT EARN (b) GRAIN vs NOT GRAIN

Problems in LDA

bull Dirichlet Distribution is helpful to avoid over-fitting But the assumption might be too strong

z4z3z2z1

w4w3w2w1

b

z4z3z2z1

w4w3w2w1

z4z3z2z1

w4w3w2w1

A Bayesian Hierarchical Model for Learning Natural Scene Categories

bull Incorporating category information

MNd

π

z

x

θ

β

Codebookbull 174 Local Image Patches

bull DetectionEvenly Sampled GridRandom SamplingSaliency DetectorLowersquos DoG Detector

bull RepresentationNormalized 11x11 gray values128-dim SIFT

Topic Distribution in Different Categories

Topic Hierarchical Clustering

More Topic Models

bull Dynamic topic models ICML 2006bull Correlated Topic Model NIPS 2005bull Hierarchical Dirichlet Process Journal of the American

Statistical Association 2003bull Nonparametric Bayes pachinko allocation UAI 2007bull Supervised LDA NIPS 2007bull MedLDA ndash Maximum Margin Discrimant LDA ICML 2009bull hellip

Are you really into Graphical Models

bull Describing Visual Scenes using Transformed Dirichlet Processes E Sudderth A Torralba W Freeman and A Willsky NIPS Dec 2005

Reference

bull David M Blei Andrew Y Ng Michael I Jordan Latent Dirichlet Allocation Journal of Machine Learning Research (JMLR) 2003

bull Matthew J Beal Variational Algorithms for Approximate Bayesian Inference PhD Thesis University of Cambridge 1998

bull L Fei-Fei and P Perona A Bayesian Hierarchical Model for Learning Natural Scene Categories CVPR 2005

Outline

bull Matrix Decompositionndash PCA SVD NMFndash LDA ICA Sparse Coding etc

bull Graphical Modelndash Basic concepts in probabilistic machine learningndash EMndash pLSAndash LDA

bull Two Applicationsndash Document decomposition for ldquolong queryrdquo retrieval ICCV 2009ndash Modeling Threaded Discussions SIGIR 2009

Large-Scale Indexing for ldquoLong Queryrdquo Retrieval (Similarity Search)

Xiao Zhang Zhiwei Li Lei Zhang Wei-Ying Ma and Heung-Yeung Shum

ICCV 2009

The Long Query Problem

bull If a query contains 1000 keywordsndash Need to access 1000 inverted listsndash The intersection of 1000 inverted lists may be emptyndash The union of 1000 inverted list may be the whole corpus

bull Dimension reductionTerm1 Term2 Term3 Term4 hellip TermN

Img1 1 2 0 0 hellip 2

f1 f2 hellip fM

Img1 02 01 hellip 003

Topic Projection

Dim = 1 million

Dim = 200

Key Idea Dimension Reduction + Residual Error Preservation

bull p original TF-IDF vector in vocabulary spacebull X projection matrix for dimension reductionbull w low dimensional feature vectorbull residual error

p Xw

An image = 0 01 02 03 04 05 06 07 08 09 1

0

2

4

6

8

10

12

14

16

+ a few words(10 words)

Orthogonal Decomposition

p Xw 1 11 1 1

2 21 2 1 2

1 1

1

k

k

W k W

W W Wk W

p x xp x x w

p wp x x

Base vector

Low dimensional representation

Residual

An image = 0 01 02 03 04 05 06 07 08 09 1

0

2

4

6

8

10

12

14

16

+ a few words(10 words)

T Tp q p q

p q

w w

X1 X2 X3 hellip Xk

A Probabilistic Implementation

x is a switch variable It controls a word generated from

bull a topic specific distribution

bull a document specific distribution

bull a background distribution

( | )p w d 1( 0 | ) ( | ) ( | )K

kp x d p w z k p z k d

( 1| ) ( | )p x d p w d ( 2 | ) ( | )p x d p w

C Chemudugunta etal Modeling General and Specific Aspects of Documents with a Probabilistic Topic Model NIPS 2006

Search (Online)

DS1 DS2 hellip

DS1 DS2 hellip

DS1 DS2 hellip

DS1 DS2 hellip

DS1 DS2 hellip

hellip

hellip

hellip

hellip

hellip

LSH Index

Doc 300 Doc 401 hellip

A query = 0 01 02 03 04 05 06 07 08 09 1

0

2

4

6

8

10

12

14

16

+ a few words

Re-rankingDoc 401 hellip

Doc 1

Doc 2

Doc 300

Doc 401

Doc N

Doc 300

Index 10M Images 46GBSearch Speed lt 100ms

Doc Meta

Search Example

Query Image

Search Example

Query Image

Simultaneously Modeling Semantics and Structure of Threaded Discussions A Sparse

Coding Approach and Its Applications

Chen LIN Jiang-Ming YANG Rui CAI Xin-jing WANG Wei WANG Lei ZHANG

SIGIR 2009

123

Semantic amp structure

124

SemanticTopics

StructureWho reply to who

Optimize them together

Model semantic

Model structure

125

Reply reconstruction

126

DocumentSimilarity

TopicSimilarity

StructureSimilarity

BaselinesNP

Reply to Nearest PostRR

Reply to RootDS

Document SimilarityLDA

Latent Dirichlet Allocation Project documents to topic space

SWB Special Words Topic Model with Background distribution Project documents to topic and junk topic space

127

Evaluation

method Slashdot Apple

All Posts Good Posts All Posts Good Posts

NP 0021 0012 0289 0239

RR 0183 0319 0269 0474

DS 0463 0643 0409 0628

LDA 0465 0644 0410 0648

SWB 0463 0644 0410 0641

SMSS 0524 0737 0517 0772

128

Expert finding

Reply reconstruction

Network construction

Expert finding

Methods

HITS

PageRank

hellip

129

Baselines

LMFormal Models for Expert Finding in Enterprise Corpora SIGIR

06Achieves stable performance in expert finding task using a

language modelPageRank

Benchmark nodal ranking methodHITS

Find hub nodes and authority nodeEABIF

Personalized Recommendation Driven by Information Flow SIGIR rsquo06

Find most influential node130

Evaluation

131

bull Bayesian estimate

Method MRR MAP P10

LM 0821 0698 0800

EABIF(ori) 0674 0362 0243

EABIF(rec) 0742 0318 0281

PageRank(ori) 0675 0377 0263

PageRank(rec) 0743 0321 0266

HITS(ori) 0906 0832 0900

HITS(rec) 0938 0822 0906

Summary

bull Matrix and probability are fundamental mathematics in information retrieval and computer visionndash Matrix decomposition ndash a good practice to learn matrixndash Graphical model ndash a good practice to learn probability

bull Graphical model is a good tool to analyze problems

bull The essence of decomposition is to discover a set of mid-level features to describe original documentsimages

bull It is more adaptable for various applications than matrix decomposition

  • An Introduction To Matrix Decomposition and Graphical Model
  • Outline
  • What Is Matrix Decomposition
  • Why We Need Matrix Decomposition
  • Why We Need Matrix Decomposition (2)
  • Principle Component Analysis
  • Definition ndash Eigenvalue amp Eigenvector
  • Definition ndash Principle Component Analysis
  • Principle Component Analysis (2)
  • Maximizing Variance
  • Optimization Problem
  • Property Data Decomposition
  • Face Recognition ndash Eigenface
  • Slide 14
  • Slide 15
  • PageRank ndash Power Iteration
  • Column-Stochastic amp Irreducible
  • Iterative PageRank Calculation
  • Convergence of the power iteration
  • Singular value decomposition
  • SVD - Definition
  • Singular Values And Singular Vectors
  • Matrix approximation
  • SVD and PCA
  • Example - LSI
  • SVD and PCA (2)
  • Latent Semantic Indexing (LSI)
  • Latent Semantic Indexing
  • Similarity Measures
  • Similarity Measures (2)
  • HITS (Hyperlink Induced Topic Search)
  • Power Iteration
  • HITS and SVD
  • HITS vs PageRank
  • NMF ndash Non-Negative Matrix Factorization
  • Definition
  • Motivation
  • Multiplicative Update Algorithm
  • Multiplicative Update Algorithm (2)
  • NMF vs PCA
  • Reference
  • Major Reference
  • Outline (2)
  • Not Included
  • Basic Concepts
  • What Is Machine Learning
  • Likelihood Function
  • Maximum Likelihood (ML)
  • IID ndash Independent Identically Distributed
  • Reference (2)
  • Expectation Maximization
  • Why We Need EM
  • More General
  • The Expectation Maximization (EM) Algorithm
  • Jensenrsquos Inequality
  • Lower Bounding the Log Likelihood
  • The E and M Steps of EM
  • The E Step
  • The M Step
  • EM Never Decreases the Likelihood
  • Reference (3)
  • Why Do We Need Graphical Model
  • Why Do We Need Graphical Models
  • Directed Acyclic Graphical Models (Bayesian Networks)
  • Directed Graphs for Statistical Models Plate Notation
  • pLSA ndash Probabilistic Latent Semantic Analysis
  • Latent Semantic Indexing (LSI) Review
  • pLSA ndash Probabilistic Latent Semantic Analysis (2)
  • pLSA
  • pLSA (2)
  • Joint Probability vs Likelihood
  • Document Decomposition
  • pLSA ndash Objective Function
  • EM Steps
  • Lower Bounding the Log Likelihood (2)
  • EM Steps (2)
  • Latent Subspace
  • pLSA vs LSA
  • pLSA vs LSA (2)
  • Applications
  • Text Mining
  • Scene Classification
  • Classification Result
  • Reference (4)
  • LDA ndash Latent Dirichilet Allocation
  • Problems in pLSA
  • Problems in pLSA (2)
  • Dirichlet Distribution
  • Dirichlet Distribution (2)
  • Example Dirichlet Distributions (K=3)
  • Example Dirichlet Distributions (K=3) (2)
  • The LDA Model
  • The LDA Model (2)
  • Joint Probability
  • Likelihood
  • Inference
  • Inference (2)
  • Variational Inference
  • Variantional Inference
  • VBEM vs EM
  • Parameter Estimation
  • Parameter Estimation (2)
  • Topic Examples in a 100-topic LDA Model)
  • Classification (50-topic LDA + SVM)
  • Problems in LDA
  • A Bayesian Hierarchical Model for Learning Natural Scene Catego
  • Codebook
  • Topic Distribution in Different Categories
  • Topic Hierarchical Clustering
  • More Topic Models
  • Are you really into Graphical Models
  • Reference (5)
  • Outline (3)
  • Slide 115
  • The Long Query Problem
  • Key Idea Dimension Reduction + Residual Error Preservation
  • Orthogonal Decomposition
  • A Probabilistic Implementation
  • Search (Online)
  • Search Example
  • Search Example (2)
  • Simultaneously Modeling Semantics and Structure of Threaded Dis
  • Semantic amp structure
  • Optimize them together
  • Reply reconstruction
  • Baselines
  • Evaluation
  • Expert finding
  • Baselines (2)
  • Evaluation (2)
  • Summary
Page 51: An Introduction To Matrix Decomposition and Graphical Model

More General

bull Data set bull Likelihood

bull Goal learn maximum likelihood (ML) parameter values

bull The maximum likelihood procedure finds parameters θ such that

bull Because of the integral (or sum) over latent variables the likelihood can be a very complicated and hard to optimize function

The Expectation Maximization (EM) Algorithm

bull The EM algorithm finds a (local) maximum of a latent variable model likelihood It starts from arbitrary values of the parameters and iterates two steps

bull E step Fill in values of latent variables according to posterior given data

bull M step Maximize likelihood as if latent variables were not hidden

bull Decomposes difficult problems into series of tractable steps

Jensenrsquos Inequality

Lower Bounding the Log Likelihoodbull Observed data D = yn Latent variables X = xn Parameters θbull Goal Maximize the log likelihood (ie ML learning) wrt θ

bull Any distribution q(X) over the hidden variables can be used to obtain a lower bound on the log likelihood using Jensenrsquos inequality

bull where H[q] is the entropy of q(X)

The E and M Steps of EM

bull The lower bound on the log likelihood is given by

bull EM alternates betweenbull E step optimize wrt distribution over hidden variables

holding params fixed

bull M step maximize wrt parameters holding hidden distribution fixed

The E Step

bull E step for fixed θ

bull The second term is the Kullback-Leibler divergencebull This means that for fixed θ F is bounded above by L and achieves

that bound whenbull So the E step simply sets

The M Step

bull M step maximize wrt parameters holding hidden distribution q fixed

bull The second equality comes from fact that entropy of q(X) does not depend directly on θ

bull The specific form of the M step depends on the model Often the maximum wrt θ can be found analytically

EM Never Decreases the Likelihood

bull The E and M steps together never decrease the log likelihood

bull The E step brings F(q θ) to the likelihood L(θ) 顶bull The M-step maximizes F(q θ) wrt θ 抬bull F(q θ) lt L(θ) by Jensen ndash or equivalently from the non-

negativity of KL

Reference

bull Zoubin Ghahramani Machine Learning (4F13) 2006 Cambridge(Unsupervised learning Lecture 5 Slides)

bull Christopher M Bishop (2006) Pattern Recognition and Machine Learning Springer

WHY DO WE NEED GRAPHICAL MODEL

Why Do We Need Graphical Models

bull Consndash Graphical model is so complex even with a few circleshellipndash We have to make too many assumptions

bull Prosndash We do need probability to explain our world But joint probability is

hard to computendash Graphical model can help us analyze and understand our problemsndash Graphs are an intuitive way of representing and visualizing the

relationships between many variablesndash With a graphical model we can decouple joint probability to

conditional probabilities which are usually easier

Directed Acyclic Graphical Models (Bayesian Networks)

bull A DAG Model Bayesian network corresponds to a factorization of the joint probability distribution

p(ABCDE) = p(A)p(B)p(C|AB)p(D|BC)p(E|CD)

bull In general

bull where pa(i) are the parents of node i

Directed Graphs for Statistical ModelsPlate Notation

bull A data set of N points generated from a Gaussian

PLSA ndash PROBABILISTIC LATENT SEMANTIC ANALYSIS

Latent Semantic Indexing (LSI) Review

bull For natural Language Queries simple term matching does not work effectivelyndash Ambiguous terms ndash Same Queries vary due to personal styles

bull Latent semantic indexingndash Creates this lsquolatent semantic spacersquo (hidden meaning)

bull LSI puts documents together even if they donrsquot have common words if the docs share frequently co-occurring terms

bull Disadvantagesndash Statistical foundation is missing

pLSA ndash Probabilistic Latent Semantic Analysis

bull Automated Document Indexing and Information retrievalbull Identification of Latent Classes using an Expectation

Maximization (EM) Algorithmbull Shown to solve

ndash Polysemybull Java could mean ldquocoffeerdquo and also the ldquoPL Javardquobull Cricket is a ldquogamerdquo and also an ldquoinsectrdquo

ndash Synonymybull ldquocomputerrdquo ldquopcrdquo ldquodesktoprdquo all could mean the same

bull Has a better statistical foundation than LSA

pLSA

M

Nd

d

z

w

M

d

z1

w1

z2

w2

z3

w3

zN

wN

hellip

z1 hellip zN are variables ziє[1K]K is the number of latent topics

pLSA

d1

z1

w1

z2

w2

zN1

wN1

hellip

d2

z1

w1

z2

w2

zN2

wN2

hellip

dM

z1

w1

z2

w2

zNm

wNm

hellip

p(w|z=1) p(w|z=2) p(w | z=Nm) are shared for all documents

Likelihood

Joint Probability vs Likelihood

bull Joint probability

bull Likelihood (only for observed variables)

bull p(d) is assumed to be uniform

Document Decomposition

bull Each document can be decomposed as

bull This is similar to the matrix decomposition if we consider each discrete distribution as a vector

p(w|d) = ZVtimesk p(z|d)

bull With many documents we hope to find latent topics as common basis

pLSA ndash Objective Function

bull pLSA tries to maximize the log likelihood

bull Due to the summation over z inside log we have to resort to EM

EM Steps

bull E-Stepndash Expectation of the likelihood function is calculated with the current

parameter valuesbull M-Step

ndash Update the parameters with the calculated posterior probabilitiesndash Find the parameters that maximizes the likelihood function

Lower Bounding the Log Likelihood

EM Steps

bull The E-Step

bull The M-Step

Latent Subspace

pLSA vs LSAbull LSA and PLSA perform dimensionality reduction

ndash In LSA by keeping only K singular valuesndash In PLSA by having K aspects

bull Comparison to SVDndash U Matrix related to P(z|d) (doc to aspect)ndash V Matrix related to P(w|z) (aspect to term)ndash E Matrix related to P(z) (aspect strength)

pLSA vs LSAbull The main difference is the way the approximation is done

bull PLSA generates a model (aspect model) and maximizes its predictive power

bull Selecting the proper value of K is heuristic in LSA

bull Model selection in statistics can determine optimal K in PLSA

Applications

bull Text mining topic discovering

bull Scene Classification

Text Mining

Scene Classification

Classification Result

Reference

bull Thomas Hofmann Probabilistic latent semantic analysis In Proc of Uncertainty in Artificial Intelligence UAI99 Stockholm 1999

bull Bosch A Zisserman A and Munoz X Scene Classification via pLSAProceedings of the European Conference on Computer Vision (2006)

bull Sivic J Russell B C Efros A A Zisserman A and Freeman W T Discovering Object Categories in Image Collections MIT AI Lab Memo AIM-2005-005 February 2005

LDA ndash LATENT DIRICHILET ALLOCATION

Problems in pLSA

bull pLSA provides no probabilistic model at the document level Each doc has its own topic mixture proportion

bull The number of parameters in the model grows linearly with M (the number of documents in the training set)

Problems in pLSA

bull There is no constraint for distributions p(z|di)

bull Easy to lead to serious problems with over-fitting

d1

z1

w1

z2

w2

zN1

wN1

hellip

d2

z1

w1

z2

w2

zN2

wN2

hellip

dm

z1

w1

z2

w2

zNm

wNm

hellip

p(z|d1) p(z|d2) p(z|dm)

Dirichlet Distribution

bull In the LDA model the topic mixture proportions for each document are assumed to follow some distribution

bull Requirement for such a distributionndash The samples (mixture proportions) generated from it are K-tuples of non-

negative numbers that sum to one That is the samples are multinormialsndash Easy to optimize

bull Dirichlet Distribution is one of such distributions

bull The space of all of these multinomials has a nice geometric interpretation as a (k-1)-simplex

Dirichlet Distribution

bull Definition

bull The density is zero outside this open (K minus 1)-dimensional simplex

k

i ii

ki ik

i i

ki i

kk

xx

xxxxp i

1

11

1

12121

1 0 st

)Γ()(Γ)|(

bull Various parameter α

(6 2 2) (3 7 5)

(2 3 4) (6 2 6)

Example Dirichlet Distributions (K=3)

Example Dirichlet Distributions (K=3)

bull Equal αi different

α0=01 α0=1 α0=10

k

i i10

The LDA Model

z4z3z2z1

w4w3w2w1

b

z4z3z2z1

w4w3w2w1

z4z3z2z1

w4w3w2w1

bull For each documentbull Choose ~Dirichlet()bull For each of the N words wn

ndash Choose a topic znraquo Multinomial()

ndash Choose a word wn from p(wn|znb) a multinomial probability conditioned on the topic zn

The LDA Model

bull For each documentbull Choose ~Dirichlet()bull For each of the N words wn

ndash Choose a topic znraquo Multinomial()

ndash Choose a word wn from p(wn|znb) a multinomial probability conditioned on the topic zn

Joint Probability

bull Given parameter α and β

where

Likelihood

bull Joint Probability

bull Marginal distribution of a document

bull Likelihood over all the documents

Inference

bull The likelihood can be computed by summing each documentbull Jansenrsquos inequality in EM

Inference

bull In E-Step we need to compute the posterior distribution of the hidden variables

bull Unfortunately this distribution is intractable to compute in general

bull We have to resort to variational approach

Variational Inference

bull In variational inference we consider a simplified graphical model with variational parameters and minimize the KL Divergence between the variational and posterior distributions

Variantional Inference

bull The difference between the lower bound and the likelihood is the KL divergence

bull Maximizing the lower bound L(b) with respect to and is equivalent to minimizing the KL divergence

VBEM vs EM

bull Only different in the E-Step

bull In standard EM q(X) is directly set to p(X|Dθ) and let KL=0bull In VBEM it is intractable to compute p(X|Dθ) Instead it

approximates p(X|Dθ) by a variational distribution q(X) by minimizing KL(q(X) | P(X|D θ)

bull This is also equivalent to maximizing the lower bound L(θ)

Parameter Estimation

bull Given a corpus of documents we would like to find the parameters and b which maximize the likelihood of the observed data

bull Strategy (Variational EM)

bull Lower bound log p(w|b) by a function L(b)bull Repeat until convergence

ndash E Maximize L(b) with respect to the variational parameters ndash M Maximize the bound with respect to parameters and b

Parameter Estimation

bull E-Step Variational Inference ndash repeat until convergence

bull M-Step Parameter estimation

β

α can be implemented using the Newton-Raphson method

Topic Examples in a 100-topic LDA Model)bull 16000 documents from a subset of the TREC AP corpus

Classification (50-topic LDA + SVM)

bull Reuters-21578 dataset ndash contains 8000 documents and 15818 words

(a) EARN vs NOT EARN (b) GRAIN vs NOT GRAIN

Problems in LDA

bull Dirichlet Distribution is helpful to avoid over-fitting But the assumption might be too strong

z4z3z2z1

w4w3w2w1

b

z4z3z2z1

w4w3w2w1

z4z3z2z1

w4w3w2w1

A Bayesian Hierarchical Model for Learning Natural Scene Categories

bull Incorporating category information

MNd

π

z

x

θ

β

Codebookbull 174 Local Image Patches

bull DetectionEvenly Sampled GridRandom SamplingSaliency DetectorLowersquos DoG Detector

bull RepresentationNormalized 11x11 gray values128-dim SIFT

Topic Distribution in Different Categories

Topic Hierarchical Clustering

More Topic Models

bull Dynamic topic models ICML 2006bull Correlated Topic Model NIPS 2005bull Hierarchical Dirichlet Process Journal of the American

Statistical Association 2003bull Nonparametric Bayes pachinko allocation UAI 2007bull Supervised LDA NIPS 2007bull MedLDA ndash Maximum Margin Discrimant LDA ICML 2009bull hellip

Are you really into Graphical Models

bull Describing Visual Scenes using Transformed Dirichlet Processes E Sudderth A Torralba W Freeman and A Willsky NIPS Dec 2005

Reference

bull David M Blei Andrew Y Ng Michael I Jordan Latent Dirichlet Allocation Journal of Machine Learning Research (JMLR) 2003

bull Matthew J Beal Variational Algorithms for Approximate Bayesian Inference PhD Thesis University of Cambridge 1998

bull L Fei-Fei and P Perona A Bayesian Hierarchical Model for Learning Natural Scene Categories CVPR 2005

Outline

bull Matrix Decompositionndash PCA SVD NMFndash LDA ICA Sparse Coding etc

bull Graphical Modelndash Basic concepts in probabilistic machine learningndash EMndash pLSAndash LDA

bull Two Applicationsndash Document decomposition for ldquolong queryrdquo retrieval ICCV 2009ndash Modeling Threaded Discussions SIGIR 2009

Large-Scale Indexing for ldquoLong Queryrdquo Retrieval (Similarity Search)

Xiao Zhang Zhiwei Li Lei Zhang Wei-Ying Ma and Heung-Yeung Shum

ICCV 2009

The Long Query Problem

bull If a query contains 1000 keywordsndash Need to access 1000 inverted listsndash The intersection of 1000 inverted lists may be emptyndash The union of 1000 inverted list may be the whole corpus

bull Dimension reductionTerm1 Term2 Term3 Term4 hellip TermN

Img1 1 2 0 0 hellip 2

f1 f2 hellip fM

Img1 02 01 hellip 003

Topic Projection

Dim = 1 million

Dim = 200

Key Idea Dimension Reduction + Residual Error Preservation

bull p original TF-IDF vector in vocabulary spacebull X projection matrix for dimension reductionbull w low dimensional feature vectorbull residual error

p Xw

An image = 0 01 02 03 04 05 06 07 08 09 1

0

2

4

6

8

10

12

14

16

+ a few words(10 words)

Orthogonal Decomposition

p Xw 1 11 1 1

2 21 2 1 2

1 1

1

k

k

W k W

W W Wk W

p x xp x x w

p wp x x

Base vector

Low dimensional representation

Residual

An image = 0 01 02 03 04 05 06 07 08 09 1

0

2

4

6

8

10

12

14

16

+ a few words(10 words)

T Tp q p q

p q

w w

X1 X2 X3 hellip Xk

A Probabilistic Implementation

x is a switch variable It controls a word generated from

bull a topic specific distribution

bull a document specific distribution

bull a background distribution

( | )p w d 1( 0 | ) ( | ) ( | )K

kp x d p w z k p z k d

( 1| ) ( | )p x d p w d ( 2 | ) ( | )p x d p w

C Chemudugunta etal Modeling General and Specific Aspects of Documents with a Probabilistic Topic Model NIPS 2006

Search (Online)

DS1 DS2 hellip

DS1 DS2 hellip

DS1 DS2 hellip

DS1 DS2 hellip

DS1 DS2 hellip

hellip

hellip

hellip

hellip

hellip

LSH Index

Doc 300 Doc 401 hellip

A query = 0 01 02 03 04 05 06 07 08 09 1

0

2

4

6

8

10

12

14

16

+ a few words

Re-rankingDoc 401 hellip

Doc 1

Doc 2

Doc 300

Doc 401

Doc N

Doc 300

Index 10M Images 46GBSearch Speed lt 100ms

Doc Meta

Search Example

Query Image

Search Example

Query Image

Simultaneously Modeling Semantics and Structure of Threaded Discussions A Sparse

Coding Approach and Its Applications

Chen LIN Jiang-Ming YANG Rui CAI Xin-jing WANG Wei WANG Lei ZHANG

SIGIR 2009

123

Semantic amp structure

124

SemanticTopics

StructureWho reply to who

Optimize them together

Model semantic

Model structure

125

Reply reconstruction

126

DocumentSimilarity

TopicSimilarity

StructureSimilarity

BaselinesNP

Reply to Nearest PostRR

Reply to RootDS

Document SimilarityLDA

Latent Dirichlet Allocation Project documents to topic space

SWB Special Words Topic Model with Background distribution Project documents to topic and junk topic space

127

Evaluation

method Slashdot Apple

All Posts Good Posts All Posts Good Posts

NP 0021 0012 0289 0239

RR 0183 0319 0269 0474

DS 0463 0643 0409 0628

LDA 0465 0644 0410 0648

SWB 0463 0644 0410 0641

SMSS 0524 0737 0517 0772

128

Expert finding

Reply reconstruction

Network construction

Expert finding

Methods

HITS

PageRank

hellip

129

Baselines

LMFormal Models for Expert Finding in Enterprise Corpora SIGIR

06Achieves stable performance in expert finding task using a

language modelPageRank

Benchmark nodal ranking methodHITS

Find hub nodes and authority nodeEABIF

Personalized Recommendation Driven by Information Flow SIGIR rsquo06

Find most influential node130

Evaluation

131

bull Bayesian estimate

Method MRR MAP P10

LM 0821 0698 0800

EABIF(ori) 0674 0362 0243

EABIF(rec) 0742 0318 0281

PageRank(ori) 0675 0377 0263

PageRank(rec) 0743 0321 0266

HITS(ori) 0906 0832 0900

HITS(rec) 0938 0822 0906

Summary

bull Matrix and probability are fundamental mathematics in information retrieval and computer visionndash Matrix decomposition ndash a good practice to learn matrixndash Graphical model ndash a good practice to learn probability

bull Graphical model is a good tool to analyze problems

bull The essence of decomposition is to discover a set of mid-level features to describe original documentsimages

bull It is more adaptable for various applications than matrix decomposition

  • An Introduction To Matrix Decomposition and Graphical Model
  • Outline
  • What Is Matrix Decomposition
  • Why We Need Matrix Decomposition
  • Why We Need Matrix Decomposition (2)
  • Principle Component Analysis
  • Definition ndash Eigenvalue amp Eigenvector
  • Definition ndash Principle Component Analysis
  • Principle Component Analysis (2)
  • Maximizing Variance
  • Optimization Problem
  • Property Data Decomposition
  • Face Recognition ndash Eigenface
  • Slide 14
  • Slide 15
  • PageRank ndash Power Iteration
  • Column-Stochastic amp Irreducible
  • Iterative PageRank Calculation
  • Convergence of the power iteration
  • Singular value decomposition
  • SVD - Definition
  • Singular Values And Singular Vectors
  • Matrix approximation
  • SVD and PCA
  • Example - LSI
  • SVD and PCA (2)
  • Latent Semantic Indexing (LSI)
  • Latent Semantic Indexing
  • Similarity Measures
  • Similarity Measures (2)
  • HITS (Hyperlink Induced Topic Search)
  • Power Iteration
  • HITS and SVD
  • HITS vs PageRank
  • NMF ndash Non-Negative Matrix Factorization
  • Definition
  • Motivation
  • Multiplicative Update Algorithm
  • Multiplicative Update Algorithm (2)
  • NMF vs PCA
  • Reference
  • Major Reference
  • Outline (2)
  • Not Included
  • Basic Concepts
  • What Is Machine Learning
  • Likelihood Function
  • Maximum Likelihood (ML)
  • IID ndash Independent Identically Distributed
  • Reference (2)
  • Expectation Maximization
  • Why We Need EM
  • More General
  • The Expectation Maximization (EM) Algorithm
  • Jensenrsquos Inequality
  • Lower Bounding the Log Likelihood
  • The E and M Steps of EM
  • The E Step
  • The M Step
  • EM Never Decreases the Likelihood
  • Reference (3)
  • Why Do We Need Graphical Model
  • Why Do We Need Graphical Models
  • Directed Acyclic Graphical Models (Bayesian Networks)
  • Directed Graphs for Statistical Models Plate Notation
  • pLSA ndash Probabilistic Latent Semantic Analysis
  • Latent Semantic Indexing (LSI) Review
  • pLSA ndash Probabilistic Latent Semantic Analysis (2)
  • pLSA
  • pLSA (2)
  • Joint Probability vs Likelihood
  • Document Decomposition
  • pLSA ndash Objective Function
  • EM Steps
  • Lower Bounding the Log Likelihood (2)
  • EM Steps (2)
  • Latent Subspace
  • pLSA vs LSA
  • pLSA vs LSA (2)
  • Applications
  • Text Mining
  • Scene Classification
  • Classification Result
  • Reference (4)
  • LDA ndash Latent Dirichilet Allocation
  • Problems in pLSA
  • Problems in pLSA (2)
  • Dirichlet Distribution
  • Dirichlet Distribution (2)
  • Example Dirichlet Distributions (K=3)
  • Example Dirichlet Distributions (K=3) (2)
  • The LDA Model
  • The LDA Model (2)
  • Joint Probability
  • Likelihood
  • Inference
  • Inference (2)
  • Variational Inference
  • Variantional Inference
  • VBEM vs EM
  • Parameter Estimation
  • Parameter Estimation (2)
  • Topic Examples in a 100-topic LDA Model)
  • Classification (50-topic LDA + SVM)
  • Problems in LDA
  • A Bayesian Hierarchical Model for Learning Natural Scene Catego
  • Codebook
  • Topic Distribution in Different Categories
  • Topic Hierarchical Clustering
  • More Topic Models
  • Are you really into Graphical Models
  • Reference (5)
  • Outline (3)
  • Slide 115
  • The Long Query Problem
  • Key Idea Dimension Reduction + Residual Error Preservation
  • Orthogonal Decomposition
  • A Probabilistic Implementation
  • Search (Online)
  • Search Example
  • Search Example (2)
  • Simultaneously Modeling Semantics and Structure of Threaded Dis
  • Semantic amp structure
  • Optimize them together
  • Reply reconstruction
  • Baselines
  • Evaluation
  • Expert finding
  • Baselines (2)
  • Evaluation (2)
  • Summary
Page 52: An Introduction To Matrix Decomposition and Graphical Model

The Expectation Maximization (EM) Algorithm

bull The EM algorithm finds a (local) maximum of a latent variable model likelihood It starts from arbitrary values of the parameters and iterates two steps

bull E step Fill in values of latent variables according to posterior given data

bull M step Maximize likelihood as if latent variables were not hidden

bull Decomposes difficult problems into series of tractable steps

Jensenrsquos Inequality

Lower Bounding the Log Likelihoodbull Observed data D = yn Latent variables X = xn Parameters θbull Goal Maximize the log likelihood (ie ML learning) wrt θ

bull Any distribution q(X) over the hidden variables can be used to obtain a lower bound on the log likelihood using Jensenrsquos inequality

bull where H[q] is the entropy of q(X)

The E and M Steps of EM

bull The lower bound on the log likelihood is given by

bull EM alternates betweenbull E step optimize wrt distribution over hidden variables

holding params fixed

bull M step maximize wrt parameters holding hidden distribution fixed

The E Step

bull E step for fixed θ

bull The second term is the Kullback-Leibler divergencebull This means that for fixed θ F is bounded above by L and achieves

that bound whenbull So the E step simply sets

The M Step

bull M step maximize wrt parameters holding hidden distribution q fixed

bull The second equality comes from fact that entropy of q(X) does not depend directly on θ

bull The specific form of the M step depends on the model Often the maximum wrt θ can be found analytically

EM Never Decreases the Likelihood

bull The E and M steps together never decrease the log likelihood

bull The E step brings F(q θ) to the likelihood L(θ) 顶bull The M-step maximizes F(q θ) wrt θ 抬bull F(q θ) lt L(θ) by Jensen ndash or equivalently from the non-

negativity of KL

Reference

bull Zoubin Ghahramani Machine Learning (4F13) 2006 Cambridge(Unsupervised learning Lecture 5 Slides)

bull Christopher M Bishop (2006) Pattern Recognition and Machine Learning Springer

WHY DO WE NEED GRAPHICAL MODEL

Why Do We Need Graphical Models

bull Consndash Graphical model is so complex even with a few circleshellipndash We have to make too many assumptions

bull Prosndash We do need probability to explain our world But joint probability is

hard to computendash Graphical model can help us analyze and understand our problemsndash Graphs are an intuitive way of representing and visualizing the

relationships between many variablesndash With a graphical model we can decouple joint probability to

conditional probabilities which are usually easier

Directed Acyclic Graphical Models (Bayesian Networks)

bull A DAG Model Bayesian network corresponds to a factorization of the joint probability distribution

p(ABCDE) = p(A)p(B)p(C|AB)p(D|BC)p(E|CD)

bull In general

bull where pa(i) are the parents of node i

Directed Graphs for Statistical ModelsPlate Notation

bull A data set of N points generated from a Gaussian

PLSA ndash PROBABILISTIC LATENT SEMANTIC ANALYSIS

Latent Semantic Indexing (LSI) Review

bull For natural Language Queries simple term matching does not work effectivelyndash Ambiguous terms ndash Same Queries vary due to personal styles

bull Latent semantic indexingndash Creates this lsquolatent semantic spacersquo (hidden meaning)

bull LSI puts documents together even if they donrsquot have common words if the docs share frequently co-occurring terms

bull Disadvantagesndash Statistical foundation is missing

pLSA ndash Probabilistic Latent Semantic Analysis

bull Automated Document Indexing and Information retrievalbull Identification of Latent Classes using an Expectation

Maximization (EM) Algorithmbull Shown to solve

ndash Polysemybull Java could mean ldquocoffeerdquo and also the ldquoPL Javardquobull Cricket is a ldquogamerdquo and also an ldquoinsectrdquo

ndash Synonymybull ldquocomputerrdquo ldquopcrdquo ldquodesktoprdquo all could mean the same

bull Has a better statistical foundation than LSA

pLSA

M

Nd

d

z

w

M

d

z1

w1

z2

w2

z3

w3

zN

wN

hellip

z1 hellip zN are variables ziє[1K]K is the number of latent topics

pLSA

d1

z1

w1

z2

w2

zN1

wN1

hellip

d2

z1

w1

z2

w2

zN2

wN2

hellip

dM

z1

w1

z2

w2

zNm

wNm

hellip

p(w|z=1) p(w|z=2) p(w | z=Nm) are shared for all documents

Likelihood

Joint Probability vs Likelihood

bull Joint probability

bull Likelihood (only for observed variables)

bull p(d) is assumed to be uniform

Document Decomposition

bull Each document can be decomposed as

bull This is similar to the matrix decomposition if we consider each discrete distribution as a vector

p(w|d) = ZVtimesk p(z|d)

bull With many documents we hope to find latent topics as common basis

pLSA ndash Objective Function

bull pLSA tries to maximize the log likelihood

bull Due to the summation over z inside log we have to resort to EM

EM Steps

bull E-Stepndash Expectation of the likelihood function is calculated with the current

parameter valuesbull M-Step

ndash Update the parameters with the calculated posterior probabilitiesndash Find the parameters that maximizes the likelihood function

Lower Bounding the Log Likelihood

EM Steps

bull The E-Step

bull The M-Step

Latent Subspace

pLSA vs LSAbull LSA and PLSA perform dimensionality reduction

ndash In LSA by keeping only K singular valuesndash In PLSA by having K aspects

bull Comparison to SVDndash U Matrix related to P(z|d) (doc to aspect)ndash V Matrix related to P(w|z) (aspect to term)ndash E Matrix related to P(z) (aspect strength)

pLSA vs LSAbull The main difference is the way the approximation is done

bull PLSA generates a model (aspect model) and maximizes its predictive power

bull Selecting the proper value of K is heuristic in LSA

bull Model selection in statistics can determine optimal K in PLSA

Applications

bull Text mining topic discovering

bull Scene Classification

Text Mining

Scene Classification

Classification Result

Reference

bull Thomas Hofmann Probabilistic latent semantic analysis In Proc of Uncertainty in Artificial Intelligence UAI99 Stockholm 1999

bull Bosch A Zisserman A and Munoz X Scene Classification via pLSAProceedings of the European Conference on Computer Vision (2006)

bull Sivic J Russell B C Efros A A Zisserman A and Freeman W T Discovering Object Categories in Image Collections MIT AI Lab Memo AIM-2005-005 February 2005

LDA ndash LATENT DIRICHILET ALLOCATION

Problems in pLSA

bull pLSA provides no probabilistic model at the document level Each doc has its own topic mixture proportion

bull The number of parameters in the model grows linearly with M (the number of documents in the training set)

Problems in pLSA

bull There is no constraint for distributions p(z|di)

bull Easy to lead to serious problems with over-fitting

d1

z1

w1

z2

w2

zN1

wN1

hellip

d2

z1

w1

z2

w2

zN2

wN2

hellip

dm

z1

w1

z2

w2

zNm

wNm

hellip

p(z|d1) p(z|d2) p(z|dm)

Dirichlet Distribution

bull In the LDA model the topic mixture proportions for each document are assumed to follow some distribution

bull Requirement for such a distributionndash The samples (mixture proportions) generated from it are K-tuples of non-

negative numbers that sum to one That is the samples are multinormialsndash Easy to optimize

bull Dirichlet Distribution is one of such distributions

bull The space of all of these multinomials has a nice geometric interpretation as a (k-1)-simplex

Dirichlet Distribution

bull Definition

bull The density is zero outside this open (K minus 1)-dimensional simplex

k

i ii

ki ik

i i

ki i

kk

xx

xxxxp i

1

11

1

12121

1 0 st

)Γ()(Γ)|(

bull Various parameter α

(6 2 2) (3 7 5)

(2 3 4) (6 2 6)

Example Dirichlet Distributions (K=3)

Example Dirichlet Distributions (K=3)

bull Equal αi different

α0=01 α0=1 α0=10

k

i i10

The LDA Model

z4z3z2z1

w4w3w2w1

b

z4z3z2z1

w4w3w2w1

z4z3z2z1

w4w3w2w1

bull For each documentbull Choose ~Dirichlet()bull For each of the N words wn

ndash Choose a topic znraquo Multinomial()

ndash Choose a word wn from p(wn|znb) a multinomial probability conditioned on the topic zn

The LDA Model

bull For each documentbull Choose ~Dirichlet()bull For each of the N words wn

ndash Choose a topic znraquo Multinomial()

ndash Choose a word wn from p(wn|znb) a multinomial probability conditioned on the topic zn

Joint Probability

bull Given parameter α and β

where

Likelihood

bull Joint Probability

bull Marginal distribution of a document

bull Likelihood over all the documents

Inference

bull The likelihood can be computed by summing each documentbull Jansenrsquos inequality in EM

Inference

bull In E-Step we need to compute the posterior distribution of the hidden variables

bull Unfortunately this distribution is intractable to compute in general

bull We have to resort to variational approach

Variational Inference

bull In variational inference we consider a simplified graphical model with variational parameters and minimize the KL Divergence between the variational and posterior distributions

Variantional Inference

bull The difference between the lower bound and the likelihood is the KL divergence

bull Maximizing the lower bound L(b) with respect to and is equivalent to minimizing the KL divergence

VBEM vs EM

bull Only different in the E-Step

bull In standard EM q(X) is directly set to p(X|Dθ) and let KL=0bull In VBEM it is intractable to compute p(X|Dθ) Instead it

approximates p(X|Dθ) by a variational distribution q(X) by minimizing KL(q(X) | P(X|D θ)

bull This is also equivalent to maximizing the lower bound L(θ)

Parameter Estimation

bull Given a corpus of documents we would like to find the parameters and b which maximize the likelihood of the observed data

bull Strategy (Variational EM)

bull Lower bound log p(w|b) by a function L(b)bull Repeat until convergence

ndash E Maximize L(b) with respect to the variational parameters ndash M Maximize the bound with respect to parameters and b

Parameter Estimation

bull E-Step Variational Inference ndash repeat until convergence

bull M-Step Parameter estimation

β

α can be implemented using the Newton-Raphson method

Topic Examples in a 100-topic LDA Model)bull 16000 documents from a subset of the TREC AP corpus

Classification (50-topic LDA + SVM)

bull Reuters-21578 dataset ndash contains 8000 documents and 15818 words

(a) EARN vs NOT EARN (b) GRAIN vs NOT GRAIN

Problems in LDA

bull Dirichlet Distribution is helpful to avoid over-fitting But the assumption might be too strong

z4z3z2z1

w4w3w2w1

b

z4z3z2z1

w4w3w2w1

z4z3z2z1

w4w3w2w1

A Bayesian Hierarchical Model for Learning Natural Scene Categories

bull Incorporating category information

MNd

π

z

x

θ

β

Codebookbull 174 Local Image Patches

bull DetectionEvenly Sampled GridRandom SamplingSaliency DetectorLowersquos DoG Detector

bull RepresentationNormalized 11x11 gray values128-dim SIFT

Topic Distribution in Different Categories

Topic Hierarchical Clustering

More Topic Models

bull Dynamic topic models ICML 2006bull Correlated Topic Model NIPS 2005bull Hierarchical Dirichlet Process Journal of the American

Statistical Association 2003bull Nonparametric Bayes pachinko allocation UAI 2007bull Supervised LDA NIPS 2007bull MedLDA ndash Maximum Margin Discrimant LDA ICML 2009bull hellip

Are you really into Graphical Models

bull Describing Visual Scenes using Transformed Dirichlet Processes E Sudderth A Torralba W Freeman and A Willsky NIPS Dec 2005

Reference

bull David M Blei Andrew Y Ng Michael I Jordan Latent Dirichlet Allocation Journal of Machine Learning Research (JMLR) 2003

bull Matthew J Beal Variational Algorithms for Approximate Bayesian Inference PhD Thesis University of Cambridge 1998

bull L Fei-Fei and P Perona A Bayesian Hierarchical Model for Learning Natural Scene Categories CVPR 2005

Outline

bull Matrix Decompositionndash PCA SVD NMFndash LDA ICA Sparse Coding etc

bull Graphical Modelndash Basic concepts in probabilistic machine learningndash EMndash pLSAndash LDA

bull Two Applicationsndash Document decomposition for ldquolong queryrdquo retrieval ICCV 2009ndash Modeling Threaded Discussions SIGIR 2009

Large-Scale Indexing for ldquoLong Queryrdquo Retrieval (Similarity Search)

Xiao Zhang Zhiwei Li Lei Zhang Wei-Ying Ma and Heung-Yeung Shum

ICCV 2009

The Long Query Problem

bull If a query contains 1000 keywordsndash Need to access 1000 inverted listsndash The intersection of 1000 inverted lists may be emptyndash The union of 1000 inverted list may be the whole corpus

bull Dimension reductionTerm1 Term2 Term3 Term4 hellip TermN

Img1 1 2 0 0 hellip 2

f1 f2 hellip fM

Img1 02 01 hellip 003

Topic Projection

Dim = 1 million

Dim = 200

Key Idea Dimension Reduction + Residual Error Preservation

bull p original TF-IDF vector in vocabulary spacebull X projection matrix for dimension reductionbull w low dimensional feature vectorbull residual error

p Xw

An image = 0 01 02 03 04 05 06 07 08 09 1

0

2

4

6

8

10

12

14

16

+ a few words(10 words)

Orthogonal Decomposition

p Xw 1 11 1 1

2 21 2 1 2

1 1

1

k

k

W k W

W W Wk W

p x xp x x w

p wp x x

Base vector

Low dimensional representation

Residual

An image = 0 01 02 03 04 05 06 07 08 09 1

0

2

4

6

8

10

12

14

16

+ a few words(10 words)

T Tp q p q

p q

w w

X1 X2 X3 hellip Xk

A Probabilistic Implementation

x is a switch variable It controls a word generated from

bull a topic specific distribution

bull a document specific distribution

bull a background distribution

( | )p w d 1( 0 | ) ( | ) ( | )K

kp x d p w z k p z k d

( 1| ) ( | )p x d p w d ( 2 | ) ( | )p x d p w

C Chemudugunta etal Modeling General and Specific Aspects of Documents with a Probabilistic Topic Model NIPS 2006

Search (Online)

DS1 DS2 hellip

DS1 DS2 hellip

DS1 DS2 hellip

DS1 DS2 hellip

DS1 DS2 hellip

hellip

hellip

hellip

hellip

hellip

LSH Index

Doc 300 Doc 401 hellip

A query = 0 01 02 03 04 05 06 07 08 09 1

0

2

4

6

8

10

12

14

16

+ a few words

Re-rankingDoc 401 hellip

Doc 1

Doc 2

Doc 300

Doc 401

Doc N

Doc 300

Index 10M Images 46GBSearch Speed lt 100ms

Doc Meta

Search Example

Query Image

Search Example

Query Image

Simultaneously Modeling Semantics and Structure of Threaded Discussions A Sparse

Coding Approach and Its Applications

Chen LIN Jiang-Ming YANG Rui CAI Xin-jing WANG Wei WANG Lei ZHANG

SIGIR 2009

123

Semantic amp structure

124

SemanticTopics

StructureWho reply to who

Optimize them together

Model semantic

Model structure

125

Reply reconstruction

126

DocumentSimilarity

TopicSimilarity

StructureSimilarity

BaselinesNP

Reply to Nearest PostRR

Reply to RootDS

Document SimilarityLDA

Latent Dirichlet Allocation Project documents to topic space

SWB Special Words Topic Model with Background distribution Project documents to topic and junk topic space

127

Evaluation

method Slashdot Apple

All Posts Good Posts All Posts Good Posts

NP 0021 0012 0289 0239

RR 0183 0319 0269 0474

DS 0463 0643 0409 0628

LDA 0465 0644 0410 0648

SWB 0463 0644 0410 0641

SMSS 0524 0737 0517 0772

128

Expert finding

Reply reconstruction

Network construction

Expert finding

Methods

HITS

PageRank

hellip

129

Baselines

LMFormal Models for Expert Finding in Enterprise Corpora SIGIR

06Achieves stable performance in expert finding task using a

language modelPageRank

Benchmark nodal ranking methodHITS

Find hub nodes and authority nodeEABIF

Personalized Recommendation Driven by Information Flow SIGIR rsquo06

Find most influential node130

Evaluation

131

bull Bayesian estimate

Method MRR MAP P10

LM 0821 0698 0800

EABIF(ori) 0674 0362 0243

EABIF(rec) 0742 0318 0281

PageRank(ori) 0675 0377 0263

PageRank(rec) 0743 0321 0266

HITS(ori) 0906 0832 0900

HITS(rec) 0938 0822 0906

Summary

bull Matrix and probability are fundamental mathematics in information retrieval and computer visionndash Matrix decomposition ndash a good practice to learn matrixndash Graphical model ndash a good practice to learn probability

bull Graphical model is a good tool to analyze problems

bull The essence of decomposition is to discover a set of mid-level features to describe original documentsimages

bull It is more adaptable for various applications than matrix decomposition

  • An Introduction To Matrix Decomposition and Graphical Model
  • Outline
  • What Is Matrix Decomposition
  • Why We Need Matrix Decomposition
  • Why We Need Matrix Decomposition (2)
  • Principle Component Analysis
  • Definition ndash Eigenvalue amp Eigenvector
  • Definition ndash Principle Component Analysis
  • Principle Component Analysis (2)
  • Maximizing Variance
  • Optimization Problem
  • Property Data Decomposition
  • Face Recognition ndash Eigenface
  • Slide 14
  • Slide 15
  • PageRank ndash Power Iteration
  • Column-Stochastic amp Irreducible
  • Iterative PageRank Calculation
  • Convergence of the power iteration
  • Singular value decomposition
  • SVD - Definition
  • Singular Values And Singular Vectors
  • Matrix approximation
  • SVD and PCA
  • Example - LSI
  • SVD and PCA (2)
  • Latent Semantic Indexing (LSI)
  • Latent Semantic Indexing
  • Similarity Measures
  • Similarity Measures (2)
  • HITS (Hyperlink Induced Topic Search)
  • Power Iteration
  • HITS and SVD
  • HITS vs PageRank
  • NMF ndash Non-Negative Matrix Factorization
  • Definition
  • Motivation
  • Multiplicative Update Algorithm
  • Multiplicative Update Algorithm (2)
  • NMF vs PCA
  • Reference
  • Major Reference
  • Outline (2)
  • Not Included
  • Basic Concepts
  • What Is Machine Learning
  • Likelihood Function
  • Maximum Likelihood (ML)
  • IID ndash Independent Identically Distributed
  • Reference (2)
  • Expectation Maximization
  • Why We Need EM
  • More General
  • The Expectation Maximization (EM) Algorithm
  • Jensenrsquos Inequality
  • Lower Bounding the Log Likelihood
  • The E and M Steps of EM
  • The E Step
  • The M Step
  • EM Never Decreases the Likelihood
  • Reference (3)
  • Why Do We Need Graphical Model
  • Why Do We Need Graphical Models
  • Directed Acyclic Graphical Models (Bayesian Networks)
  • Directed Graphs for Statistical Models Plate Notation
  • pLSA ndash Probabilistic Latent Semantic Analysis
  • Latent Semantic Indexing (LSI) Review
  • pLSA ndash Probabilistic Latent Semantic Analysis (2)
  • pLSA
  • pLSA (2)
  • Joint Probability vs Likelihood
  • Document Decomposition
  • pLSA ndash Objective Function
  • EM Steps
  • Lower Bounding the Log Likelihood (2)
  • EM Steps (2)
  • Latent Subspace
  • pLSA vs LSA
  • pLSA vs LSA (2)
  • Applications
  • Text Mining
  • Scene Classification
  • Classification Result
  • Reference (4)
  • LDA ndash Latent Dirichilet Allocation
  • Problems in pLSA
  • Problems in pLSA (2)
  • Dirichlet Distribution
  • Dirichlet Distribution (2)
  • Example Dirichlet Distributions (K=3)
  • Example Dirichlet Distributions (K=3) (2)
  • The LDA Model
  • The LDA Model (2)
  • Joint Probability
  • Likelihood
  • Inference
  • Inference (2)
  • Variational Inference
  • Variantional Inference
  • VBEM vs EM
  • Parameter Estimation
  • Parameter Estimation (2)
  • Topic Examples in a 100-topic LDA Model)
  • Classification (50-topic LDA + SVM)
  • Problems in LDA
  • A Bayesian Hierarchical Model for Learning Natural Scene Catego
  • Codebook
  • Topic Distribution in Different Categories
  • Topic Hierarchical Clustering
  • More Topic Models
  • Are you really into Graphical Models
  • Reference (5)
  • Outline (3)
  • Slide 115
  • The Long Query Problem
  • Key Idea Dimension Reduction + Residual Error Preservation
  • Orthogonal Decomposition
  • A Probabilistic Implementation
  • Search (Online)
  • Search Example
  • Search Example (2)
  • Simultaneously Modeling Semantics and Structure of Threaded Dis
  • Semantic amp structure
  • Optimize them together
  • Reply reconstruction
  • Baselines
  • Evaluation
  • Expert finding
  • Baselines (2)
  • Evaluation (2)
  • Summary
Page 53: An Introduction To Matrix Decomposition and Graphical Model

Jensenrsquos Inequality

Lower Bounding the Log Likelihoodbull Observed data D = yn Latent variables X = xn Parameters θbull Goal Maximize the log likelihood (ie ML learning) wrt θ

bull Any distribution q(X) over the hidden variables can be used to obtain a lower bound on the log likelihood using Jensenrsquos inequality

bull where H[q] is the entropy of q(X)

The E and M Steps of EM

bull The lower bound on the log likelihood is given by

bull EM alternates betweenbull E step optimize wrt distribution over hidden variables

holding params fixed

bull M step maximize wrt parameters holding hidden distribution fixed

The E Step

bull E step for fixed θ

bull The second term is the Kullback-Leibler divergencebull This means that for fixed θ F is bounded above by L and achieves

that bound whenbull So the E step simply sets

The M Step

bull M step maximize wrt parameters holding hidden distribution q fixed

bull The second equality comes from fact that entropy of q(X) does not depend directly on θ

bull The specific form of the M step depends on the model Often the maximum wrt θ can be found analytically

EM Never Decreases the Likelihood

bull The E and M steps together never decrease the log likelihood

bull The E step brings F(q θ) to the likelihood L(θ) 顶bull The M-step maximizes F(q θ) wrt θ 抬bull F(q θ) lt L(θ) by Jensen ndash or equivalently from the non-

negativity of KL

Reference

bull Zoubin Ghahramani Machine Learning (4F13) 2006 Cambridge(Unsupervised learning Lecture 5 Slides)

bull Christopher M Bishop (2006) Pattern Recognition and Machine Learning Springer

WHY DO WE NEED GRAPHICAL MODEL

Why Do We Need Graphical Models

bull Consndash Graphical model is so complex even with a few circleshellipndash We have to make too many assumptions

bull Prosndash We do need probability to explain our world But joint probability is

hard to computendash Graphical model can help us analyze and understand our problemsndash Graphs are an intuitive way of representing and visualizing the

relationships between many variablesndash With a graphical model we can decouple joint probability to

conditional probabilities which are usually easier

Directed Acyclic Graphical Models (Bayesian Networks)

bull A DAG Model Bayesian network corresponds to a factorization of the joint probability distribution

p(ABCDE) = p(A)p(B)p(C|AB)p(D|BC)p(E|CD)

bull In general

bull where pa(i) are the parents of node i

Directed Graphs for Statistical ModelsPlate Notation

bull A data set of N points generated from a Gaussian

PLSA ndash PROBABILISTIC LATENT SEMANTIC ANALYSIS

Latent Semantic Indexing (LSI) Review

bull For natural Language Queries simple term matching does not work effectivelyndash Ambiguous terms ndash Same Queries vary due to personal styles

bull Latent semantic indexingndash Creates this lsquolatent semantic spacersquo (hidden meaning)

bull LSI puts documents together even if they donrsquot have common words if the docs share frequently co-occurring terms

bull Disadvantagesndash Statistical foundation is missing

pLSA ndash Probabilistic Latent Semantic Analysis

bull Automated Document Indexing and Information retrievalbull Identification of Latent Classes using an Expectation

Maximization (EM) Algorithmbull Shown to solve

ndash Polysemybull Java could mean ldquocoffeerdquo and also the ldquoPL Javardquobull Cricket is a ldquogamerdquo and also an ldquoinsectrdquo

ndash Synonymybull ldquocomputerrdquo ldquopcrdquo ldquodesktoprdquo all could mean the same

bull Has a better statistical foundation than LSA

pLSA

M

Nd

d

z

w

M

d

z1

w1

z2

w2

z3

w3

zN

wN

hellip

z1 hellip zN are variables ziє[1K]K is the number of latent topics

pLSA

d1

z1

w1

z2

w2

zN1

wN1

hellip

d2

z1

w1

z2

w2

zN2

wN2

hellip

dM

z1

w1

z2

w2

zNm

wNm

hellip

p(w|z=1) p(w|z=2) p(w | z=Nm) are shared for all documents

Likelihood

Joint Probability vs Likelihood

bull Joint probability

bull Likelihood (only for observed variables)

bull p(d) is assumed to be uniform

Document Decomposition

bull Each document can be decomposed as

bull This is similar to the matrix decomposition if we consider each discrete distribution as a vector

p(w|d) = ZVtimesk p(z|d)

bull With many documents we hope to find latent topics as common basis

pLSA ndash Objective Function

bull pLSA tries to maximize the log likelihood

bull Due to the summation over z inside log we have to resort to EM

EM Steps

bull E-Stepndash Expectation of the likelihood function is calculated with the current

parameter valuesbull M-Step

ndash Update the parameters with the calculated posterior probabilitiesndash Find the parameters that maximizes the likelihood function

Lower Bounding the Log Likelihood

EM Steps

bull The E-Step

bull The M-Step

Latent Subspace

pLSA vs LSAbull LSA and PLSA perform dimensionality reduction

ndash In LSA by keeping only K singular valuesndash In PLSA by having K aspects

bull Comparison to SVDndash U Matrix related to P(z|d) (doc to aspect)ndash V Matrix related to P(w|z) (aspect to term)ndash E Matrix related to P(z) (aspect strength)

pLSA vs LSAbull The main difference is the way the approximation is done

bull PLSA generates a model (aspect model) and maximizes its predictive power

bull Selecting the proper value of K is heuristic in LSA

bull Model selection in statistics can determine optimal K in PLSA

Applications

bull Text mining topic discovering

bull Scene Classification

Text Mining

Scene Classification

Classification Result

Reference

bull Thomas Hofmann Probabilistic latent semantic analysis In Proc of Uncertainty in Artificial Intelligence UAI99 Stockholm 1999

bull Bosch A Zisserman A and Munoz X Scene Classification via pLSAProceedings of the European Conference on Computer Vision (2006)

bull Sivic J Russell B C Efros A A Zisserman A and Freeman W T Discovering Object Categories in Image Collections MIT AI Lab Memo AIM-2005-005 February 2005

LDA ndash LATENT DIRICHILET ALLOCATION

Problems in pLSA

bull pLSA provides no probabilistic model at the document level Each doc has its own topic mixture proportion

bull The number of parameters in the model grows linearly with M (the number of documents in the training set)

Problems in pLSA

bull There is no constraint for distributions p(z|di)

bull Easy to lead to serious problems with over-fitting

d1

z1

w1

z2

w2

zN1

wN1

hellip

d2

z1

w1

z2

w2

zN2

wN2

hellip

dm

z1

w1

z2

w2

zNm

wNm

hellip

p(z|d1) p(z|d2) p(z|dm)

Dirichlet Distribution

bull In the LDA model the topic mixture proportions for each document are assumed to follow some distribution

bull Requirement for such a distributionndash The samples (mixture proportions) generated from it are K-tuples of non-

negative numbers that sum to one That is the samples are multinormialsndash Easy to optimize

bull Dirichlet Distribution is one of such distributions

bull The space of all of these multinomials has a nice geometric interpretation as a (k-1)-simplex

Dirichlet Distribution

bull Definition

bull The density is zero outside this open (K minus 1)-dimensional simplex

k

i ii

ki ik

i i

ki i

kk

xx

xxxxp i

1

11

1

12121

1 0 st

)Γ()(Γ)|(

bull Various parameter α

(6 2 2) (3 7 5)

(2 3 4) (6 2 6)

Example Dirichlet Distributions (K=3)

Example Dirichlet Distributions (K=3)

bull Equal αi different

α0=01 α0=1 α0=10

k

i i10

The LDA Model

z4z3z2z1

w4w3w2w1

b

z4z3z2z1

w4w3w2w1

z4z3z2z1

w4w3w2w1

bull For each documentbull Choose ~Dirichlet()bull For each of the N words wn

ndash Choose a topic znraquo Multinomial()

ndash Choose a word wn from p(wn|znb) a multinomial probability conditioned on the topic zn

The LDA Model

bull For each documentbull Choose ~Dirichlet()bull For each of the N words wn

ndash Choose a topic znraquo Multinomial()

ndash Choose a word wn from p(wn|znb) a multinomial probability conditioned on the topic zn

Joint Probability

bull Given parameter α and β

where

Likelihood

bull Joint Probability

bull Marginal distribution of a document

bull Likelihood over all the documents

Inference

bull The likelihood can be computed by summing each documentbull Jansenrsquos inequality in EM

Inference

bull In E-Step we need to compute the posterior distribution of the hidden variables

bull Unfortunately this distribution is intractable to compute in general

bull We have to resort to variational approach

Variational Inference

bull In variational inference we consider a simplified graphical model with variational parameters and minimize the KL Divergence between the variational and posterior distributions

Variantional Inference

bull The difference between the lower bound and the likelihood is the KL divergence

bull Maximizing the lower bound L(b) with respect to and is equivalent to minimizing the KL divergence

VBEM vs EM

bull Only different in the E-Step

bull In standard EM q(X) is directly set to p(X|Dθ) and let KL=0bull In VBEM it is intractable to compute p(X|Dθ) Instead it

approximates p(X|Dθ) by a variational distribution q(X) by minimizing KL(q(X) | P(X|D θ)

bull This is also equivalent to maximizing the lower bound L(θ)

Parameter Estimation

bull Given a corpus of documents we would like to find the parameters and b which maximize the likelihood of the observed data

bull Strategy (Variational EM)

bull Lower bound log p(w|b) by a function L(b)bull Repeat until convergence

ndash E Maximize L(b) with respect to the variational parameters ndash M Maximize the bound with respect to parameters and b

Parameter Estimation

bull E-Step Variational Inference ndash repeat until convergence

bull M-Step Parameter estimation

β

α can be implemented using the Newton-Raphson method

Topic Examples in a 100-topic LDA Model)bull 16000 documents from a subset of the TREC AP corpus

Classification (50-topic LDA + SVM)

bull Reuters-21578 dataset ndash contains 8000 documents and 15818 words

(a) EARN vs NOT EARN (b) GRAIN vs NOT GRAIN

Problems in LDA

bull Dirichlet Distribution is helpful to avoid over-fitting But the assumption might be too strong

z4z3z2z1

w4w3w2w1

b

z4z3z2z1

w4w3w2w1

z4z3z2z1

w4w3w2w1

A Bayesian Hierarchical Model for Learning Natural Scene Categories

bull Incorporating category information

MNd

π

z

x

θ

β

Codebookbull 174 Local Image Patches

bull DetectionEvenly Sampled GridRandom SamplingSaliency DetectorLowersquos DoG Detector

bull RepresentationNormalized 11x11 gray values128-dim SIFT

Topic Distribution in Different Categories

Topic Hierarchical Clustering

More Topic Models

bull Dynamic topic models ICML 2006bull Correlated Topic Model NIPS 2005bull Hierarchical Dirichlet Process Journal of the American

Statistical Association 2003bull Nonparametric Bayes pachinko allocation UAI 2007bull Supervised LDA NIPS 2007bull MedLDA ndash Maximum Margin Discrimant LDA ICML 2009bull hellip

Are you really into Graphical Models

bull Describing Visual Scenes using Transformed Dirichlet Processes E Sudderth A Torralba W Freeman and A Willsky NIPS Dec 2005

Reference

bull David M Blei Andrew Y Ng Michael I Jordan Latent Dirichlet Allocation Journal of Machine Learning Research (JMLR) 2003

bull Matthew J Beal Variational Algorithms for Approximate Bayesian Inference PhD Thesis University of Cambridge 1998

bull L Fei-Fei and P Perona A Bayesian Hierarchical Model for Learning Natural Scene Categories CVPR 2005

Outline

bull Matrix Decompositionndash PCA SVD NMFndash LDA ICA Sparse Coding etc

bull Graphical Modelndash Basic concepts in probabilistic machine learningndash EMndash pLSAndash LDA

bull Two Applicationsndash Document decomposition for ldquolong queryrdquo retrieval ICCV 2009ndash Modeling Threaded Discussions SIGIR 2009

Large-Scale Indexing for ldquoLong Queryrdquo Retrieval (Similarity Search)

Xiao Zhang Zhiwei Li Lei Zhang Wei-Ying Ma and Heung-Yeung Shum

ICCV 2009

The Long Query Problem

bull If a query contains 1000 keywordsndash Need to access 1000 inverted listsndash The intersection of 1000 inverted lists may be emptyndash The union of 1000 inverted list may be the whole corpus

bull Dimension reductionTerm1 Term2 Term3 Term4 hellip TermN

Img1 1 2 0 0 hellip 2

f1 f2 hellip fM

Img1 02 01 hellip 003

Topic Projection

Dim = 1 million

Dim = 200

Key Idea Dimension Reduction + Residual Error Preservation

bull p original TF-IDF vector in vocabulary spacebull X projection matrix for dimension reductionbull w low dimensional feature vectorbull residual error

p Xw

An image = 0 01 02 03 04 05 06 07 08 09 1

0

2

4

6

8

10

12

14

16

+ a few words(10 words)

Orthogonal Decomposition

p Xw 1 11 1 1

2 21 2 1 2

1 1

1

k

k

W k W

W W Wk W

p x xp x x w

p wp x x

Base vector

Low dimensional representation

Residual

An image = 0 01 02 03 04 05 06 07 08 09 1

0

2

4

6

8

10

12

14

16

+ a few words(10 words)

T Tp q p q

p q

w w

X1 X2 X3 hellip Xk

A Probabilistic Implementation

x is a switch variable It controls a word generated from

bull a topic specific distribution

bull a document specific distribution

bull a background distribution

( | )p w d 1( 0 | ) ( | ) ( | )K

kp x d p w z k p z k d

( 1| ) ( | )p x d p w d ( 2 | ) ( | )p x d p w

C Chemudugunta etal Modeling General and Specific Aspects of Documents with a Probabilistic Topic Model NIPS 2006

Search (Online)

DS1 DS2 hellip

DS1 DS2 hellip

DS1 DS2 hellip

DS1 DS2 hellip

DS1 DS2 hellip

hellip

hellip

hellip

hellip

hellip

LSH Index

Doc 300 Doc 401 hellip

A query = 0 01 02 03 04 05 06 07 08 09 1

0

2

4

6

8

10

12

14

16

+ a few words

Re-rankingDoc 401 hellip

Doc 1

Doc 2

Doc 300

Doc 401

Doc N

Doc 300

Index 10M Images 46GBSearch Speed lt 100ms

Doc Meta

Search Example

Query Image

Search Example

Query Image

Simultaneously Modeling Semantics and Structure of Threaded Discussions A Sparse

Coding Approach and Its Applications

Chen LIN Jiang-Ming YANG Rui CAI Xin-jing WANG Wei WANG Lei ZHANG

SIGIR 2009

123

Semantic amp structure

124

SemanticTopics

StructureWho reply to who

Optimize them together

Model semantic

Model structure

125

Reply reconstruction

126

DocumentSimilarity

TopicSimilarity

StructureSimilarity

BaselinesNP

Reply to Nearest PostRR

Reply to RootDS

Document SimilarityLDA

Latent Dirichlet Allocation Project documents to topic space

SWB Special Words Topic Model with Background distribution Project documents to topic and junk topic space

127

Evaluation

method Slashdot Apple

All Posts Good Posts All Posts Good Posts

NP 0021 0012 0289 0239

RR 0183 0319 0269 0474

DS 0463 0643 0409 0628

LDA 0465 0644 0410 0648

SWB 0463 0644 0410 0641

SMSS 0524 0737 0517 0772

128

Expert finding

Reply reconstruction

Network construction

Expert finding

Methods

HITS

PageRank

hellip

129

Baselines

LMFormal Models for Expert Finding in Enterprise Corpora SIGIR

06Achieves stable performance in expert finding task using a

language modelPageRank

Benchmark nodal ranking methodHITS

Find hub nodes and authority nodeEABIF

Personalized Recommendation Driven by Information Flow SIGIR rsquo06

Find most influential node130

Evaluation

131

bull Bayesian estimate

Method MRR MAP P10

LM 0821 0698 0800

EABIF(ori) 0674 0362 0243

EABIF(rec) 0742 0318 0281

PageRank(ori) 0675 0377 0263

PageRank(rec) 0743 0321 0266

HITS(ori) 0906 0832 0900

HITS(rec) 0938 0822 0906

Summary

bull Matrix and probability are fundamental mathematics in information retrieval and computer visionndash Matrix decomposition ndash a good practice to learn matrixndash Graphical model ndash a good practice to learn probability

bull Graphical model is a good tool to analyze problems

bull The essence of decomposition is to discover a set of mid-level features to describe original documentsimages

bull It is more adaptable for various applications than matrix decomposition

  • An Introduction To Matrix Decomposition and Graphical Model
  • Outline
  • What Is Matrix Decomposition
  • Why We Need Matrix Decomposition
  • Why We Need Matrix Decomposition (2)
  • Principle Component Analysis
  • Definition ndash Eigenvalue amp Eigenvector
  • Definition ndash Principle Component Analysis
  • Principle Component Analysis (2)
  • Maximizing Variance
  • Optimization Problem
  • Property Data Decomposition
  • Face Recognition ndash Eigenface
  • Slide 14
  • Slide 15
  • PageRank ndash Power Iteration
  • Column-Stochastic amp Irreducible
  • Iterative PageRank Calculation
  • Convergence of the power iteration
  • Singular value decomposition
  • SVD - Definition
  • Singular Values And Singular Vectors
  • Matrix approximation
  • SVD and PCA
  • Example - LSI
  • SVD and PCA (2)
  • Latent Semantic Indexing (LSI)
  • Latent Semantic Indexing
  • Similarity Measures
  • Similarity Measures (2)
  • HITS (Hyperlink Induced Topic Search)
  • Power Iteration
  • HITS and SVD
  • HITS vs PageRank
  • NMF ndash Non-Negative Matrix Factorization
  • Definition
  • Motivation
  • Multiplicative Update Algorithm
  • Multiplicative Update Algorithm (2)
  • NMF vs PCA
  • Reference
  • Major Reference
  • Outline (2)
  • Not Included
  • Basic Concepts
  • What Is Machine Learning
  • Likelihood Function
  • Maximum Likelihood (ML)
  • IID ndash Independent Identically Distributed
  • Reference (2)
  • Expectation Maximization
  • Why We Need EM
  • More General
  • The Expectation Maximization (EM) Algorithm
  • Jensenrsquos Inequality
  • Lower Bounding the Log Likelihood
  • The E and M Steps of EM
  • The E Step
  • The M Step
  • EM Never Decreases the Likelihood
  • Reference (3)
  • Why Do We Need Graphical Model
  • Why Do We Need Graphical Models
  • Directed Acyclic Graphical Models (Bayesian Networks)
  • Directed Graphs for Statistical Models Plate Notation
  • pLSA ndash Probabilistic Latent Semantic Analysis
  • Latent Semantic Indexing (LSI) Review
  • pLSA ndash Probabilistic Latent Semantic Analysis (2)
  • pLSA
  • pLSA (2)
  • Joint Probability vs Likelihood
  • Document Decomposition
  • pLSA ndash Objective Function
  • EM Steps
  • Lower Bounding the Log Likelihood (2)
  • EM Steps (2)
  • Latent Subspace
  • pLSA vs LSA
  • pLSA vs LSA (2)
  • Applications
  • Text Mining
  • Scene Classification
  • Classification Result
  • Reference (4)
  • LDA ndash Latent Dirichilet Allocation
  • Problems in pLSA
  • Problems in pLSA (2)
  • Dirichlet Distribution
  • Dirichlet Distribution (2)
  • Example Dirichlet Distributions (K=3)
  • Example Dirichlet Distributions (K=3) (2)
  • The LDA Model
  • The LDA Model (2)
  • Joint Probability
  • Likelihood
  • Inference
  • Inference (2)
  • Variational Inference
  • Variantional Inference
  • VBEM vs EM
  • Parameter Estimation
  • Parameter Estimation (2)
  • Topic Examples in a 100-topic LDA Model)
  • Classification (50-topic LDA + SVM)
  • Problems in LDA
  • A Bayesian Hierarchical Model for Learning Natural Scene Catego
  • Codebook
  • Topic Distribution in Different Categories
  • Topic Hierarchical Clustering
  • More Topic Models
  • Are you really into Graphical Models
  • Reference (5)
  • Outline (3)
  • Slide 115
  • The Long Query Problem
  • Key Idea Dimension Reduction + Residual Error Preservation
  • Orthogonal Decomposition
  • A Probabilistic Implementation
  • Search (Online)
  • Search Example
  • Search Example (2)
  • Simultaneously Modeling Semantics and Structure of Threaded Dis
  • Semantic amp structure
  • Optimize them together
  • Reply reconstruction
  • Baselines
  • Evaluation
  • Expert finding
  • Baselines (2)
  • Evaluation (2)
  • Summary
Page 54: An Introduction To Matrix Decomposition and Graphical Model

Lower Bounding the Log Likelihoodbull Observed data D = yn Latent variables X = xn Parameters θbull Goal Maximize the log likelihood (ie ML learning) wrt θ

bull Any distribution q(X) over the hidden variables can be used to obtain a lower bound on the log likelihood using Jensenrsquos inequality

bull where H[q] is the entropy of q(X)

The E and M Steps of EM

bull The lower bound on the log likelihood is given by

bull EM alternates betweenbull E step optimize wrt distribution over hidden variables

holding params fixed

bull M step maximize wrt parameters holding hidden distribution fixed

The E Step

bull E step for fixed θ

bull The second term is the Kullback-Leibler divergencebull This means that for fixed θ F is bounded above by L and achieves

that bound whenbull So the E step simply sets

The M Step

bull M step maximize wrt parameters holding hidden distribution q fixed

bull The second equality comes from fact that entropy of q(X) does not depend directly on θ

bull The specific form of the M step depends on the model Often the maximum wrt θ can be found analytically

EM Never Decreases the Likelihood

bull The E and M steps together never decrease the log likelihood

bull The E step brings F(q θ) to the likelihood L(θ) 顶bull The M-step maximizes F(q θ) wrt θ 抬bull F(q θ) lt L(θ) by Jensen ndash or equivalently from the non-

negativity of KL

Reference

bull Zoubin Ghahramani Machine Learning (4F13) 2006 Cambridge(Unsupervised learning Lecture 5 Slides)

bull Christopher M Bishop (2006) Pattern Recognition and Machine Learning Springer

WHY DO WE NEED GRAPHICAL MODEL

Why Do We Need Graphical Models

bull Consndash Graphical model is so complex even with a few circleshellipndash We have to make too many assumptions

bull Prosndash We do need probability to explain our world But joint probability is

hard to computendash Graphical model can help us analyze and understand our problemsndash Graphs are an intuitive way of representing and visualizing the

relationships between many variablesndash With a graphical model we can decouple joint probability to

conditional probabilities which are usually easier

Directed Acyclic Graphical Models (Bayesian Networks)

bull A DAG Model Bayesian network corresponds to a factorization of the joint probability distribution

p(ABCDE) = p(A)p(B)p(C|AB)p(D|BC)p(E|CD)

bull In general

bull where pa(i) are the parents of node i

Directed Graphs for Statistical ModelsPlate Notation

bull A data set of N points generated from a Gaussian

PLSA ndash PROBABILISTIC LATENT SEMANTIC ANALYSIS

Latent Semantic Indexing (LSI) Review

bull For natural Language Queries simple term matching does not work effectivelyndash Ambiguous terms ndash Same Queries vary due to personal styles

bull Latent semantic indexingndash Creates this lsquolatent semantic spacersquo (hidden meaning)

bull LSI puts documents together even if they donrsquot have common words if the docs share frequently co-occurring terms

bull Disadvantagesndash Statistical foundation is missing

pLSA ndash Probabilistic Latent Semantic Analysis

bull Automated Document Indexing and Information retrievalbull Identification of Latent Classes using an Expectation

Maximization (EM) Algorithmbull Shown to solve

ndash Polysemybull Java could mean ldquocoffeerdquo and also the ldquoPL Javardquobull Cricket is a ldquogamerdquo and also an ldquoinsectrdquo

ndash Synonymybull ldquocomputerrdquo ldquopcrdquo ldquodesktoprdquo all could mean the same

bull Has a better statistical foundation than LSA

pLSA

M

Nd

d

z

w

M

d

z1

w1

z2

w2

z3

w3

zN

wN

hellip

z1 hellip zN are variables ziє[1K]K is the number of latent topics

pLSA

d1

z1

w1

z2

w2

zN1

wN1

hellip

d2

z1

w1

z2

w2

zN2

wN2

hellip

dM

z1

w1

z2

w2

zNm

wNm

hellip

p(w|z=1) p(w|z=2) p(w | z=Nm) are shared for all documents

Likelihood

Joint Probability vs Likelihood

bull Joint probability

bull Likelihood (only for observed variables)

bull p(d) is assumed to be uniform

Document Decomposition

bull Each document can be decomposed as

bull This is similar to the matrix decomposition if we consider each discrete distribution as a vector

p(w|d) = ZVtimesk p(z|d)

bull With many documents we hope to find latent topics as common basis

pLSA ndash Objective Function

bull pLSA tries to maximize the log likelihood

bull Due to the summation over z inside log we have to resort to EM

EM Steps

bull E-Stepndash Expectation of the likelihood function is calculated with the current

parameter valuesbull M-Step

ndash Update the parameters with the calculated posterior probabilitiesndash Find the parameters that maximizes the likelihood function

Lower Bounding the Log Likelihood

EM Steps

bull The E-Step

bull The M-Step

Latent Subspace

pLSA vs LSAbull LSA and PLSA perform dimensionality reduction

ndash In LSA by keeping only K singular valuesndash In PLSA by having K aspects

bull Comparison to SVDndash U Matrix related to P(z|d) (doc to aspect)ndash V Matrix related to P(w|z) (aspect to term)ndash E Matrix related to P(z) (aspect strength)

pLSA vs LSAbull The main difference is the way the approximation is done

bull PLSA generates a model (aspect model) and maximizes its predictive power

bull Selecting the proper value of K is heuristic in LSA

bull Model selection in statistics can determine optimal K in PLSA

Applications

bull Text mining topic discovering

bull Scene Classification

Text Mining

Scene Classification

Classification Result

Reference

bull Thomas Hofmann Probabilistic latent semantic analysis In Proc of Uncertainty in Artificial Intelligence UAI99 Stockholm 1999

bull Bosch A Zisserman A and Munoz X Scene Classification via pLSAProceedings of the European Conference on Computer Vision (2006)

bull Sivic J Russell B C Efros A A Zisserman A and Freeman W T Discovering Object Categories in Image Collections MIT AI Lab Memo AIM-2005-005 February 2005

LDA ndash LATENT DIRICHILET ALLOCATION

Problems in pLSA

bull pLSA provides no probabilistic model at the document level Each doc has its own topic mixture proportion

bull The number of parameters in the model grows linearly with M (the number of documents in the training set)

Problems in pLSA

bull There is no constraint for distributions p(z|di)

bull Easy to lead to serious problems with over-fitting

d1

z1

w1

z2

w2

zN1

wN1

hellip

d2

z1

w1

z2

w2

zN2

wN2

hellip

dm

z1

w1

z2

w2

zNm

wNm

hellip

p(z|d1) p(z|d2) p(z|dm)

Dirichlet Distribution

bull In the LDA model the topic mixture proportions for each document are assumed to follow some distribution

bull Requirement for such a distributionndash The samples (mixture proportions) generated from it are K-tuples of non-

negative numbers that sum to one That is the samples are multinormialsndash Easy to optimize

bull Dirichlet Distribution is one of such distributions

bull The space of all of these multinomials has a nice geometric interpretation as a (k-1)-simplex

Dirichlet Distribution

bull Definition

bull The density is zero outside this open (K minus 1)-dimensional simplex

k

i ii

ki ik

i i

ki i

kk

xx

xxxxp i

1

11

1

12121

1 0 st

)Γ()(Γ)|(

bull Various parameter α

(6 2 2) (3 7 5)

(2 3 4) (6 2 6)

Example Dirichlet Distributions (K=3)

Example Dirichlet Distributions (K=3)

bull Equal αi different

α0=01 α0=1 α0=10

k

i i10

The LDA Model

z4z3z2z1

w4w3w2w1

b

z4z3z2z1

w4w3w2w1

z4z3z2z1

w4w3w2w1

bull For each documentbull Choose ~Dirichlet()bull For each of the N words wn

ndash Choose a topic znraquo Multinomial()

ndash Choose a word wn from p(wn|znb) a multinomial probability conditioned on the topic zn

The LDA Model

bull For each documentbull Choose ~Dirichlet()bull For each of the N words wn

ndash Choose a topic znraquo Multinomial()

ndash Choose a word wn from p(wn|znb) a multinomial probability conditioned on the topic zn

Joint Probability

bull Given parameter α and β

where

Likelihood

bull Joint Probability

bull Marginal distribution of a document

bull Likelihood over all the documents

Inference

bull The likelihood can be computed by summing each documentbull Jansenrsquos inequality in EM

Inference

bull In E-Step we need to compute the posterior distribution of the hidden variables

bull Unfortunately this distribution is intractable to compute in general

bull We have to resort to variational approach

Variational Inference

bull In variational inference we consider a simplified graphical model with variational parameters and minimize the KL Divergence between the variational and posterior distributions

Variantional Inference

bull The difference between the lower bound and the likelihood is the KL divergence

bull Maximizing the lower bound L(b) with respect to and is equivalent to minimizing the KL divergence

VBEM vs EM

bull Only different in the E-Step

bull In standard EM q(X) is directly set to p(X|Dθ) and let KL=0bull In VBEM it is intractable to compute p(X|Dθ) Instead it

approximates p(X|Dθ) by a variational distribution q(X) by minimizing KL(q(X) | P(X|D θ)

bull This is also equivalent to maximizing the lower bound L(θ)

Parameter Estimation

bull Given a corpus of documents we would like to find the parameters and b which maximize the likelihood of the observed data

bull Strategy (Variational EM)

bull Lower bound log p(w|b) by a function L(b)bull Repeat until convergence

ndash E Maximize L(b) with respect to the variational parameters ndash M Maximize the bound with respect to parameters and b

Parameter Estimation

bull E-Step Variational Inference ndash repeat until convergence

bull M-Step Parameter estimation

β

α can be implemented using the Newton-Raphson method

Topic Examples in a 100-topic LDA Model)bull 16000 documents from a subset of the TREC AP corpus

Classification (50-topic LDA + SVM)

bull Reuters-21578 dataset ndash contains 8000 documents and 15818 words

(a) EARN vs NOT EARN (b) GRAIN vs NOT GRAIN

Problems in LDA

bull Dirichlet Distribution is helpful to avoid over-fitting But the assumption might be too strong

z4z3z2z1

w4w3w2w1

b

z4z3z2z1

w4w3w2w1

z4z3z2z1

w4w3w2w1

A Bayesian Hierarchical Model for Learning Natural Scene Categories

bull Incorporating category information

MNd

π

z

x

θ

β

Codebookbull 174 Local Image Patches

bull DetectionEvenly Sampled GridRandom SamplingSaliency DetectorLowersquos DoG Detector

bull RepresentationNormalized 11x11 gray values128-dim SIFT

Topic Distribution in Different Categories

Topic Hierarchical Clustering

More Topic Models

bull Dynamic topic models ICML 2006bull Correlated Topic Model NIPS 2005bull Hierarchical Dirichlet Process Journal of the American

Statistical Association 2003bull Nonparametric Bayes pachinko allocation UAI 2007bull Supervised LDA NIPS 2007bull MedLDA ndash Maximum Margin Discrimant LDA ICML 2009bull hellip

Are you really into Graphical Models

bull Describing Visual Scenes using Transformed Dirichlet Processes E Sudderth A Torralba W Freeman and A Willsky NIPS Dec 2005

Reference

bull David M Blei Andrew Y Ng Michael I Jordan Latent Dirichlet Allocation Journal of Machine Learning Research (JMLR) 2003

bull Matthew J Beal Variational Algorithms for Approximate Bayesian Inference PhD Thesis University of Cambridge 1998

bull L Fei-Fei and P Perona A Bayesian Hierarchical Model for Learning Natural Scene Categories CVPR 2005

Outline

bull Matrix Decompositionndash PCA SVD NMFndash LDA ICA Sparse Coding etc

bull Graphical Modelndash Basic concepts in probabilistic machine learningndash EMndash pLSAndash LDA

bull Two Applicationsndash Document decomposition for ldquolong queryrdquo retrieval ICCV 2009ndash Modeling Threaded Discussions SIGIR 2009

Large-Scale Indexing for ldquoLong Queryrdquo Retrieval (Similarity Search)

Xiao Zhang Zhiwei Li Lei Zhang Wei-Ying Ma and Heung-Yeung Shum

ICCV 2009

The Long Query Problem

bull If a query contains 1000 keywordsndash Need to access 1000 inverted listsndash The intersection of 1000 inverted lists may be emptyndash The union of 1000 inverted list may be the whole corpus

bull Dimension reductionTerm1 Term2 Term3 Term4 hellip TermN

Img1 1 2 0 0 hellip 2

f1 f2 hellip fM

Img1 02 01 hellip 003

Topic Projection

Dim = 1 million

Dim = 200

Key Idea Dimension Reduction + Residual Error Preservation

bull p original TF-IDF vector in vocabulary spacebull X projection matrix for dimension reductionbull w low dimensional feature vectorbull residual error

p Xw

An image = 0 01 02 03 04 05 06 07 08 09 1

0

2

4

6

8

10

12

14

16

+ a few words(10 words)

Orthogonal Decomposition

p Xw 1 11 1 1

2 21 2 1 2

1 1

1

k

k

W k W

W W Wk W

p x xp x x w

p wp x x

Base vector

Low dimensional representation

Residual

An image = 0 01 02 03 04 05 06 07 08 09 1

0

2

4

6

8

10

12

14

16

+ a few words(10 words)

T Tp q p q

p q

w w

X1 X2 X3 hellip Xk

A Probabilistic Implementation

x is a switch variable It controls a word generated from

bull a topic specific distribution

bull a document specific distribution

bull a background distribution

( | )p w d 1( 0 | ) ( | ) ( | )K

kp x d p w z k p z k d

( 1| ) ( | )p x d p w d ( 2 | ) ( | )p x d p w

C Chemudugunta etal Modeling General and Specific Aspects of Documents with a Probabilistic Topic Model NIPS 2006

Search (Online)

DS1 DS2 hellip

DS1 DS2 hellip

DS1 DS2 hellip

DS1 DS2 hellip

DS1 DS2 hellip

hellip

hellip

hellip

hellip

hellip

LSH Index

Doc 300 Doc 401 hellip

A query = 0 01 02 03 04 05 06 07 08 09 1

0

2

4

6

8

10

12

14

16

+ a few words

Re-rankingDoc 401 hellip

Doc 1

Doc 2

Doc 300

Doc 401

Doc N

Doc 300

Index 10M Images 46GBSearch Speed lt 100ms

Doc Meta

Search Example

Query Image

Search Example

Query Image

Simultaneously Modeling Semantics and Structure of Threaded Discussions A Sparse

Coding Approach and Its Applications

Chen LIN Jiang-Ming YANG Rui CAI Xin-jing WANG Wei WANG Lei ZHANG

SIGIR 2009

123

Semantic amp structure

124

SemanticTopics

StructureWho reply to who

Optimize them together

Model semantic

Model structure

125

Reply reconstruction

126

DocumentSimilarity

TopicSimilarity

StructureSimilarity

BaselinesNP

Reply to Nearest PostRR

Reply to RootDS

Document SimilarityLDA

Latent Dirichlet Allocation Project documents to topic space

SWB Special Words Topic Model with Background distribution Project documents to topic and junk topic space

127

Evaluation

method Slashdot Apple

All Posts Good Posts All Posts Good Posts

NP 0021 0012 0289 0239

RR 0183 0319 0269 0474

DS 0463 0643 0409 0628

LDA 0465 0644 0410 0648

SWB 0463 0644 0410 0641

SMSS 0524 0737 0517 0772

128

Expert finding

Reply reconstruction

Network construction

Expert finding

Methods

HITS

PageRank

hellip

129

Baselines

LMFormal Models for Expert Finding in Enterprise Corpora SIGIR

06Achieves stable performance in expert finding task using a

language modelPageRank

Benchmark nodal ranking methodHITS

Find hub nodes and authority nodeEABIF

Personalized Recommendation Driven by Information Flow SIGIR rsquo06

Find most influential node130

Evaluation

131

bull Bayesian estimate

Method MRR MAP P10

LM 0821 0698 0800

EABIF(ori) 0674 0362 0243

EABIF(rec) 0742 0318 0281

PageRank(ori) 0675 0377 0263

PageRank(rec) 0743 0321 0266

HITS(ori) 0906 0832 0900

HITS(rec) 0938 0822 0906

Summary

bull Matrix and probability are fundamental mathematics in information retrieval and computer visionndash Matrix decomposition ndash a good practice to learn matrixndash Graphical model ndash a good practice to learn probability

bull Graphical model is a good tool to analyze problems

bull The essence of decomposition is to discover a set of mid-level features to describe original documentsimages

bull It is more adaptable for various applications than matrix decomposition

  • An Introduction To Matrix Decomposition and Graphical Model
  • Outline
  • What Is Matrix Decomposition
  • Why We Need Matrix Decomposition
  • Why We Need Matrix Decomposition (2)
  • Principle Component Analysis
  • Definition ndash Eigenvalue amp Eigenvector
  • Definition ndash Principle Component Analysis
  • Principle Component Analysis (2)
  • Maximizing Variance
  • Optimization Problem
  • Property Data Decomposition
  • Face Recognition ndash Eigenface
  • Slide 14
  • Slide 15
  • PageRank ndash Power Iteration
  • Column-Stochastic amp Irreducible
  • Iterative PageRank Calculation
  • Convergence of the power iteration
  • Singular value decomposition
  • SVD - Definition
  • Singular Values And Singular Vectors
  • Matrix approximation
  • SVD and PCA
  • Example - LSI
  • SVD and PCA (2)
  • Latent Semantic Indexing (LSI)
  • Latent Semantic Indexing
  • Similarity Measures
  • Similarity Measures (2)
  • HITS (Hyperlink Induced Topic Search)
  • Power Iteration
  • HITS and SVD
  • HITS vs PageRank
  • NMF ndash Non-Negative Matrix Factorization
  • Definition
  • Motivation
  • Multiplicative Update Algorithm
  • Multiplicative Update Algorithm (2)
  • NMF vs PCA
  • Reference
  • Major Reference
  • Outline (2)
  • Not Included
  • Basic Concepts
  • What Is Machine Learning
  • Likelihood Function
  • Maximum Likelihood (ML)
  • IID ndash Independent Identically Distributed
  • Reference (2)
  • Expectation Maximization
  • Why We Need EM
  • More General
  • The Expectation Maximization (EM) Algorithm
  • Jensenrsquos Inequality
  • Lower Bounding the Log Likelihood
  • The E and M Steps of EM
  • The E Step
  • The M Step
  • EM Never Decreases the Likelihood
  • Reference (3)
  • Why Do We Need Graphical Model
  • Why Do We Need Graphical Models
  • Directed Acyclic Graphical Models (Bayesian Networks)
  • Directed Graphs for Statistical Models Plate Notation
  • pLSA ndash Probabilistic Latent Semantic Analysis
  • Latent Semantic Indexing (LSI) Review
  • pLSA ndash Probabilistic Latent Semantic Analysis (2)
  • pLSA
  • pLSA (2)
  • Joint Probability vs Likelihood
  • Document Decomposition
  • pLSA ndash Objective Function
  • EM Steps
  • Lower Bounding the Log Likelihood (2)
  • EM Steps (2)
  • Latent Subspace
  • pLSA vs LSA
  • pLSA vs LSA (2)
  • Applications
  • Text Mining
  • Scene Classification
  • Classification Result
  • Reference (4)
  • LDA ndash Latent Dirichilet Allocation
  • Problems in pLSA
  • Problems in pLSA (2)
  • Dirichlet Distribution
  • Dirichlet Distribution (2)
  • Example Dirichlet Distributions (K=3)
  • Example Dirichlet Distributions (K=3) (2)
  • The LDA Model
  • The LDA Model (2)
  • Joint Probability
  • Likelihood
  • Inference
  • Inference (2)
  • Variational Inference
  • Variantional Inference
  • VBEM vs EM
  • Parameter Estimation
  • Parameter Estimation (2)
  • Topic Examples in a 100-topic LDA Model)
  • Classification (50-topic LDA + SVM)
  • Problems in LDA
  • A Bayesian Hierarchical Model for Learning Natural Scene Catego
  • Codebook
  • Topic Distribution in Different Categories
  • Topic Hierarchical Clustering
  • More Topic Models
  • Are you really into Graphical Models
  • Reference (5)
  • Outline (3)
  • Slide 115
  • The Long Query Problem
  • Key Idea Dimension Reduction + Residual Error Preservation
  • Orthogonal Decomposition
  • A Probabilistic Implementation
  • Search (Online)
  • Search Example
  • Search Example (2)
  • Simultaneously Modeling Semantics and Structure of Threaded Dis
  • Semantic amp structure
  • Optimize them together
  • Reply reconstruction
  • Baselines
  • Evaluation
  • Expert finding
  • Baselines (2)
  • Evaluation (2)
  • Summary
Page 55: An Introduction To Matrix Decomposition and Graphical Model

The E and M Steps of EM

bull The lower bound on the log likelihood is given by

bull EM alternates betweenbull E step optimize wrt distribution over hidden variables

holding params fixed

bull M step maximize wrt parameters holding hidden distribution fixed

The E Step

bull E step for fixed θ

bull The second term is the Kullback-Leibler divergencebull This means that for fixed θ F is bounded above by L and achieves

that bound whenbull So the E step simply sets

The M Step

bull M step maximize wrt parameters holding hidden distribution q fixed

bull The second equality comes from fact that entropy of q(X) does not depend directly on θ

bull The specific form of the M step depends on the model Often the maximum wrt θ can be found analytically

EM Never Decreases the Likelihood

bull The E and M steps together never decrease the log likelihood

bull The E step brings F(q θ) to the likelihood L(θ) 顶bull The M-step maximizes F(q θ) wrt θ 抬bull F(q θ) lt L(θ) by Jensen ndash or equivalently from the non-

negativity of KL

Reference

bull Zoubin Ghahramani Machine Learning (4F13) 2006 Cambridge(Unsupervised learning Lecture 5 Slides)

bull Christopher M Bishop (2006) Pattern Recognition and Machine Learning Springer

WHY DO WE NEED GRAPHICAL MODEL

Why Do We Need Graphical Models

bull Consndash Graphical model is so complex even with a few circleshellipndash We have to make too many assumptions

bull Prosndash We do need probability to explain our world But joint probability is

hard to computendash Graphical model can help us analyze and understand our problemsndash Graphs are an intuitive way of representing and visualizing the

relationships between many variablesndash With a graphical model we can decouple joint probability to

conditional probabilities which are usually easier

Directed Acyclic Graphical Models (Bayesian Networks)

bull A DAG Model Bayesian network corresponds to a factorization of the joint probability distribution

p(ABCDE) = p(A)p(B)p(C|AB)p(D|BC)p(E|CD)

bull In general

bull where pa(i) are the parents of node i

Directed Graphs for Statistical ModelsPlate Notation

bull A data set of N points generated from a Gaussian

PLSA ndash PROBABILISTIC LATENT SEMANTIC ANALYSIS

Latent Semantic Indexing (LSI) Review

bull For natural Language Queries simple term matching does not work effectivelyndash Ambiguous terms ndash Same Queries vary due to personal styles

bull Latent semantic indexingndash Creates this lsquolatent semantic spacersquo (hidden meaning)

bull LSI puts documents together even if they donrsquot have common words if the docs share frequently co-occurring terms

bull Disadvantagesndash Statistical foundation is missing

pLSA ndash Probabilistic Latent Semantic Analysis

bull Automated Document Indexing and Information retrievalbull Identification of Latent Classes using an Expectation

Maximization (EM) Algorithmbull Shown to solve

ndash Polysemybull Java could mean ldquocoffeerdquo and also the ldquoPL Javardquobull Cricket is a ldquogamerdquo and also an ldquoinsectrdquo

ndash Synonymybull ldquocomputerrdquo ldquopcrdquo ldquodesktoprdquo all could mean the same

bull Has a better statistical foundation than LSA

pLSA

M

Nd

d

z

w

M

d

z1

w1

z2

w2

z3

w3

zN

wN

hellip

z1 hellip zN are variables ziє[1K]K is the number of latent topics

pLSA

d1

z1

w1

z2

w2

zN1

wN1

hellip

d2

z1

w1

z2

w2

zN2

wN2

hellip

dM

z1

w1

z2

w2

zNm

wNm

hellip

p(w|z=1) p(w|z=2) p(w | z=Nm) are shared for all documents

Likelihood

Joint Probability vs Likelihood

bull Joint probability

bull Likelihood (only for observed variables)

bull p(d) is assumed to be uniform

Document Decomposition

bull Each document can be decomposed as

bull This is similar to the matrix decomposition if we consider each discrete distribution as a vector

p(w|d) = ZVtimesk p(z|d)

bull With many documents we hope to find latent topics as common basis

pLSA ndash Objective Function

bull pLSA tries to maximize the log likelihood

bull Due to the summation over z inside log we have to resort to EM

EM Steps

bull E-Stepndash Expectation of the likelihood function is calculated with the current

parameter valuesbull M-Step

ndash Update the parameters with the calculated posterior probabilitiesndash Find the parameters that maximizes the likelihood function

Lower Bounding the Log Likelihood

EM Steps

bull The E-Step

bull The M-Step

Latent Subspace

pLSA vs LSAbull LSA and PLSA perform dimensionality reduction

ndash In LSA by keeping only K singular valuesndash In PLSA by having K aspects

bull Comparison to SVDndash U Matrix related to P(z|d) (doc to aspect)ndash V Matrix related to P(w|z) (aspect to term)ndash E Matrix related to P(z) (aspect strength)

pLSA vs LSAbull The main difference is the way the approximation is done

bull PLSA generates a model (aspect model) and maximizes its predictive power

bull Selecting the proper value of K is heuristic in LSA

bull Model selection in statistics can determine optimal K in PLSA

Applications

bull Text mining topic discovering

bull Scene Classification

Text Mining

Scene Classification

Classification Result

Reference

bull Thomas Hofmann Probabilistic latent semantic analysis In Proc of Uncertainty in Artificial Intelligence UAI99 Stockholm 1999

bull Bosch A Zisserman A and Munoz X Scene Classification via pLSAProceedings of the European Conference on Computer Vision (2006)

bull Sivic J Russell B C Efros A A Zisserman A and Freeman W T Discovering Object Categories in Image Collections MIT AI Lab Memo AIM-2005-005 February 2005

LDA ndash LATENT DIRICHILET ALLOCATION

Problems in pLSA

bull pLSA provides no probabilistic model at the document level Each doc has its own topic mixture proportion

bull The number of parameters in the model grows linearly with M (the number of documents in the training set)

Problems in pLSA

bull There is no constraint for distributions p(z|di)

bull Easy to lead to serious problems with over-fitting

d1

z1

w1

z2

w2

zN1

wN1

hellip

d2

z1

w1

z2

w2

zN2

wN2

hellip

dm

z1

w1

z2

w2

zNm

wNm

hellip

p(z|d1) p(z|d2) p(z|dm)

Dirichlet Distribution

bull In the LDA model the topic mixture proportions for each document are assumed to follow some distribution

bull Requirement for such a distributionndash The samples (mixture proportions) generated from it are K-tuples of non-

negative numbers that sum to one That is the samples are multinormialsndash Easy to optimize

bull Dirichlet Distribution is one of such distributions

bull The space of all of these multinomials has a nice geometric interpretation as a (k-1)-simplex

Dirichlet Distribution

bull Definition

bull The density is zero outside this open (K minus 1)-dimensional simplex

k

i ii

ki ik

i i

ki i

kk

xx

xxxxp i

1

11

1

12121

1 0 st

)Γ()(Γ)|(

bull Various parameter α

(6 2 2) (3 7 5)

(2 3 4) (6 2 6)

Example Dirichlet Distributions (K=3)

Example Dirichlet Distributions (K=3)

bull Equal αi different

α0=01 α0=1 α0=10

k

i i10

The LDA Model

z4z3z2z1

w4w3w2w1

b

z4z3z2z1

w4w3w2w1

z4z3z2z1

w4w3w2w1

bull For each documentbull Choose ~Dirichlet()bull For each of the N words wn

ndash Choose a topic znraquo Multinomial()

ndash Choose a word wn from p(wn|znb) a multinomial probability conditioned on the topic zn

The LDA Model

bull For each documentbull Choose ~Dirichlet()bull For each of the N words wn

ndash Choose a topic znraquo Multinomial()

ndash Choose a word wn from p(wn|znb) a multinomial probability conditioned on the topic zn

Joint Probability

bull Given parameter α and β

where

Likelihood

bull Joint Probability

bull Marginal distribution of a document

bull Likelihood over all the documents

Inference

bull The likelihood can be computed by summing each documentbull Jansenrsquos inequality in EM

Inference

bull In E-Step we need to compute the posterior distribution of the hidden variables

bull Unfortunately this distribution is intractable to compute in general

bull We have to resort to variational approach

Variational Inference

bull In variational inference we consider a simplified graphical model with variational parameters and minimize the KL Divergence between the variational and posterior distributions

Variantional Inference

bull The difference between the lower bound and the likelihood is the KL divergence

bull Maximizing the lower bound L(b) with respect to and is equivalent to minimizing the KL divergence

VBEM vs EM

bull Only different in the E-Step

bull In standard EM q(X) is directly set to p(X|Dθ) and let KL=0bull In VBEM it is intractable to compute p(X|Dθ) Instead it

approximates p(X|Dθ) by a variational distribution q(X) by minimizing KL(q(X) | P(X|D θ)

bull This is also equivalent to maximizing the lower bound L(θ)

Parameter Estimation

bull Given a corpus of documents we would like to find the parameters and b which maximize the likelihood of the observed data

bull Strategy (Variational EM)

bull Lower bound log p(w|b) by a function L(b)bull Repeat until convergence

ndash E Maximize L(b) with respect to the variational parameters ndash M Maximize the bound with respect to parameters and b

Parameter Estimation

bull E-Step Variational Inference ndash repeat until convergence

bull M-Step Parameter estimation

β

α can be implemented using the Newton-Raphson method

Topic Examples in a 100-topic LDA Model)bull 16000 documents from a subset of the TREC AP corpus

Classification (50-topic LDA + SVM)

bull Reuters-21578 dataset ndash contains 8000 documents and 15818 words

(a) EARN vs NOT EARN (b) GRAIN vs NOT GRAIN

Problems in LDA

bull Dirichlet Distribution is helpful to avoid over-fitting But the assumption might be too strong

z4z3z2z1

w4w3w2w1

b

z4z3z2z1

w4w3w2w1

z4z3z2z1

w4w3w2w1

A Bayesian Hierarchical Model for Learning Natural Scene Categories

bull Incorporating category information

MNd

π

z

x

θ

β

Codebookbull 174 Local Image Patches

bull DetectionEvenly Sampled GridRandom SamplingSaliency DetectorLowersquos DoG Detector

bull RepresentationNormalized 11x11 gray values128-dim SIFT

Topic Distribution in Different Categories

Topic Hierarchical Clustering

More Topic Models

bull Dynamic topic models ICML 2006bull Correlated Topic Model NIPS 2005bull Hierarchical Dirichlet Process Journal of the American

Statistical Association 2003bull Nonparametric Bayes pachinko allocation UAI 2007bull Supervised LDA NIPS 2007bull MedLDA ndash Maximum Margin Discrimant LDA ICML 2009bull hellip

Are you really into Graphical Models

bull Describing Visual Scenes using Transformed Dirichlet Processes E Sudderth A Torralba W Freeman and A Willsky NIPS Dec 2005

Reference

bull David M Blei Andrew Y Ng Michael I Jordan Latent Dirichlet Allocation Journal of Machine Learning Research (JMLR) 2003

bull Matthew J Beal Variational Algorithms for Approximate Bayesian Inference PhD Thesis University of Cambridge 1998

bull L Fei-Fei and P Perona A Bayesian Hierarchical Model for Learning Natural Scene Categories CVPR 2005

Outline

bull Matrix Decompositionndash PCA SVD NMFndash LDA ICA Sparse Coding etc

bull Graphical Modelndash Basic concepts in probabilistic machine learningndash EMndash pLSAndash LDA

bull Two Applicationsndash Document decomposition for ldquolong queryrdquo retrieval ICCV 2009ndash Modeling Threaded Discussions SIGIR 2009

Large-Scale Indexing for ldquoLong Queryrdquo Retrieval (Similarity Search)

Xiao Zhang Zhiwei Li Lei Zhang Wei-Ying Ma and Heung-Yeung Shum

ICCV 2009

The Long Query Problem

bull If a query contains 1000 keywordsndash Need to access 1000 inverted listsndash The intersection of 1000 inverted lists may be emptyndash The union of 1000 inverted list may be the whole corpus

bull Dimension reductionTerm1 Term2 Term3 Term4 hellip TermN

Img1 1 2 0 0 hellip 2

f1 f2 hellip fM

Img1 02 01 hellip 003

Topic Projection

Dim = 1 million

Dim = 200

Key Idea Dimension Reduction + Residual Error Preservation

bull p original TF-IDF vector in vocabulary spacebull X projection matrix for dimension reductionbull w low dimensional feature vectorbull residual error

p Xw

An image = 0 01 02 03 04 05 06 07 08 09 1

0

2

4

6

8

10

12

14

16

+ a few words(10 words)

Orthogonal Decomposition

p Xw 1 11 1 1

2 21 2 1 2

1 1

1

k

k

W k W

W W Wk W

p x xp x x w

p wp x x

Base vector

Low dimensional representation

Residual

An image = 0 01 02 03 04 05 06 07 08 09 1

0

2

4

6

8

10

12

14

16

+ a few words(10 words)

T Tp q p q

p q

w w

X1 X2 X3 hellip Xk

A Probabilistic Implementation

x is a switch variable It controls a word generated from

bull a topic specific distribution

bull a document specific distribution

bull a background distribution

( | )p w d 1( 0 | ) ( | ) ( | )K

kp x d p w z k p z k d

( 1| ) ( | )p x d p w d ( 2 | ) ( | )p x d p w

C Chemudugunta etal Modeling General and Specific Aspects of Documents with a Probabilistic Topic Model NIPS 2006

Search (Online)

DS1 DS2 hellip

DS1 DS2 hellip

DS1 DS2 hellip

DS1 DS2 hellip

DS1 DS2 hellip

hellip

hellip

hellip

hellip

hellip

LSH Index

Doc 300 Doc 401 hellip

A query = 0 01 02 03 04 05 06 07 08 09 1

0

2

4

6

8

10

12

14

16

+ a few words

Re-rankingDoc 401 hellip

Doc 1

Doc 2

Doc 300

Doc 401

Doc N

Doc 300

Index 10M Images 46GBSearch Speed lt 100ms

Doc Meta

Search Example

Query Image

Search Example

Query Image

Simultaneously Modeling Semantics and Structure of Threaded Discussions A Sparse

Coding Approach and Its Applications

Chen LIN Jiang-Ming YANG Rui CAI Xin-jing WANG Wei WANG Lei ZHANG

SIGIR 2009

123

Semantic amp structure

124

SemanticTopics

StructureWho reply to who

Optimize them together

Model semantic

Model structure

125

Reply reconstruction

126

DocumentSimilarity

TopicSimilarity

StructureSimilarity

BaselinesNP

Reply to Nearest PostRR

Reply to RootDS

Document SimilarityLDA

Latent Dirichlet Allocation Project documents to topic space

SWB Special Words Topic Model with Background distribution Project documents to topic and junk topic space

127

Evaluation

method Slashdot Apple

All Posts Good Posts All Posts Good Posts

NP 0021 0012 0289 0239

RR 0183 0319 0269 0474

DS 0463 0643 0409 0628

LDA 0465 0644 0410 0648

SWB 0463 0644 0410 0641

SMSS 0524 0737 0517 0772

128

Expert finding

Reply reconstruction

Network construction

Expert finding

Methods

HITS

PageRank

hellip

129

Baselines

LMFormal Models for Expert Finding in Enterprise Corpora SIGIR

06Achieves stable performance in expert finding task using a

language modelPageRank

Benchmark nodal ranking methodHITS

Find hub nodes and authority nodeEABIF

Personalized Recommendation Driven by Information Flow SIGIR rsquo06

Find most influential node130

Evaluation

131

bull Bayesian estimate

Method MRR MAP P10

LM 0821 0698 0800

EABIF(ori) 0674 0362 0243

EABIF(rec) 0742 0318 0281

PageRank(ori) 0675 0377 0263

PageRank(rec) 0743 0321 0266

HITS(ori) 0906 0832 0900

HITS(rec) 0938 0822 0906

Summary

bull Matrix and probability are fundamental mathematics in information retrieval and computer visionndash Matrix decomposition ndash a good practice to learn matrixndash Graphical model ndash a good practice to learn probability

bull Graphical model is a good tool to analyze problems

bull The essence of decomposition is to discover a set of mid-level features to describe original documentsimages

bull It is more adaptable for various applications than matrix decomposition

  • An Introduction To Matrix Decomposition and Graphical Model
  • Outline
  • What Is Matrix Decomposition
  • Why We Need Matrix Decomposition
  • Why We Need Matrix Decomposition (2)
  • Principle Component Analysis
  • Definition ndash Eigenvalue amp Eigenvector
  • Definition ndash Principle Component Analysis
  • Principle Component Analysis (2)
  • Maximizing Variance
  • Optimization Problem
  • Property Data Decomposition
  • Face Recognition ndash Eigenface
  • Slide 14
  • Slide 15
  • PageRank ndash Power Iteration
  • Column-Stochastic amp Irreducible
  • Iterative PageRank Calculation
  • Convergence of the power iteration
  • Singular value decomposition
  • SVD - Definition
  • Singular Values And Singular Vectors
  • Matrix approximation
  • SVD and PCA
  • Example - LSI
  • SVD and PCA (2)
  • Latent Semantic Indexing (LSI)
  • Latent Semantic Indexing
  • Similarity Measures
  • Similarity Measures (2)
  • HITS (Hyperlink Induced Topic Search)
  • Power Iteration
  • HITS and SVD
  • HITS vs PageRank
  • NMF ndash Non-Negative Matrix Factorization
  • Definition
  • Motivation
  • Multiplicative Update Algorithm
  • Multiplicative Update Algorithm (2)
  • NMF vs PCA
  • Reference
  • Major Reference
  • Outline (2)
  • Not Included
  • Basic Concepts
  • What Is Machine Learning
  • Likelihood Function
  • Maximum Likelihood (ML)
  • IID ndash Independent Identically Distributed
  • Reference (2)
  • Expectation Maximization
  • Why We Need EM
  • More General
  • The Expectation Maximization (EM) Algorithm
  • Jensenrsquos Inequality
  • Lower Bounding the Log Likelihood
  • The E and M Steps of EM
  • The E Step
  • The M Step
  • EM Never Decreases the Likelihood
  • Reference (3)
  • Why Do We Need Graphical Model
  • Why Do We Need Graphical Models
  • Directed Acyclic Graphical Models (Bayesian Networks)
  • Directed Graphs for Statistical Models Plate Notation
  • pLSA ndash Probabilistic Latent Semantic Analysis
  • Latent Semantic Indexing (LSI) Review
  • pLSA ndash Probabilistic Latent Semantic Analysis (2)
  • pLSA
  • pLSA (2)
  • Joint Probability vs Likelihood
  • Document Decomposition
  • pLSA ndash Objective Function
  • EM Steps
  • Lower Bounding the Log Likelihood (2)
  • EM Steps (2)
  • Latent Subspace
  • pLSA vs LSA
  • pLSA vs LSA (2)
  • Applications
  • Text Mining
  • Scene Classification
  • Classification Result
  • Reference (4)
  • LDA ndash Latent Dirichilet Allocation
  • Problems in pLSA
  • Problems in pLSA (2)
  • Dirichlet Distribution
  • Dirichlet Distribution (2)
  • Example Dirichlet Distributions (K=3)
  • Example Dirichlet Distributions (K=3) (2)
  • The LDA Model
  • The LDA Model (2)
  • Joint Probability
  • Likelihood
  • Inference
  • Inference (2)
  • Variational Inference
  • Variantional Inference
  • VBEM vs EM
  • Parameter Estimation
  • Parameter Estimation (2)
  • Topic Examples in a 100-topic LDA Model)
  • Classification (50-topic LDA + SVM)
  • Problems in LDA
  • A Bayesian Hierarchical Model for Learning Natural Scene Catego
  • Codebook
  • Topic Distribution in Different Categories
  • Topic Hierarchical Clustering
  • More Topic Models
  • Are you really into Graphical Models
  • Reference (5)
  • Outline (3)
  • Slide 115
  • The Long Query Problem
  • Key Idea Dimension Reduction + Residual Error Preservation
  • Orthogonal Decomposition
  • A Probabilistic Implementation
  • Search (Online)
  • Search Example
  • Search Example (2)
  • Simultaneously Modeling Semantics and Structure of Threaded Dis
  • Semantic amp structure
  • Optimize them together
  • Reply reconstruction
  • Baselines
  • Evaluation
  • Expert finding
  • Baselines (2)
  • Evaluation (2)
  • Summary
Page 56: An Introduction To Matrix Decomposition and Graphical Model

The E Step

bull E step for fixed θ

bull The second term is the Kullback-Leibler divergencebull This means that for fixed θ F is bounded above by L and achieves

that bound whenbull So the E step simply sets

The M Step

bull M step maximize wrt parameters holding hidden distribution q fixed

bull The second equality comes from fact that entropy of q(X) does not depend directly on θ

bull The specific form of the M step depends on the model Often the maximum wrt θ can be found analytically

EM Never Decreases the Likelihood

bull The E and M steps together never decrease the log likelihood

bull The E step brings F(q θ) to the likelihood L(θ) 顶bull The M-step maximizes F(q θ) wrt θ 抬bull F(q θ) lt L(θ) by Jensen ndash or equivalently from the non-

negativity of KL

Reference

bull Zoubin Ghahramani Machine Learning (4F13) 2006 Cambridge(Unsupervised learning Lecture 5 Slides)

bull Christopher M Bishop (2006) Pattern Recognition and Machine Learning Springer

WHY DO WE NEED GRAPHICAL MODEL

Why Do We Need Graphical Models

bull Consndash Graphical model is so complex even with a few circleshellipndash We have to make too many assumptions

bull Prosndash We do need probability to explain our world But joint probability is

hard to computendash Graphical model can help us analyze and understand our problemsndash Graphs are an intuitive way of representing and visualizing the

relationships between many variablesndash With a graphical model we can decouple joint probability to

conditional probabilities which are usually easier

Directed Acyclic Graphical Models (Bayesian Networks)

bull A DAG Model Bayesian network corresponds to a factorization of the joint probability distribution

p(ABCDE) = p(A)p(B)p(C|AB)p(D|BC)p(E|CD)

bull In general

bull where pa(i) are the parents of node i

Directed Graphs for Statistical ModelsPlate Notation

bull A data set of N points generated from a Gaussian

PLSA ndash PROBABILISTIC LATENT SEMANTIC ANALYSIS

Latent Semantic Indexing (LSI) Review

bull For natural Language Queries simple term matching does not work effectivelyndash Ambiguous terms ndash Same Queries vary due to personal styles

bull Latent semantic indexingndash Creates this lsquolatent semantic spacersquo (hidden meaning)

bull LSI puts documents together even if they donrsquot have common words if the docs share frequently co-occurring terms

bull Disadvantagesndash Statistical foundation is missing

pLSA ndash Probabilistic Latent Semantic Analysis

bull Automated Document Indexing and Information retrievalbull Identification of Latent Classes using an Expectation

Maximization (EM) Algorithmbull Shown to solve

ndash Polysemybull Java could mean ldquocoffeerdquo and also the ldquoPL Javardquobull Cricket is a ldquogamerdquo and also an ldquoinsectrdquo

ndash Synonymybull ldquocomputerrdquo ldquopcrdquo ldquodesktoprdquo all could mean the same

bull Has a better statistical foundation than LSA

pLSA

M

Nd

d

z

w

M

d

z1

w1

z2

w2

z3

w3

zN

wN

hellip

z1 hellip zN are variables ziє[1K]K is the number of latent topics

pLSA

d1

z1

w1

z2

w2

zN1

wN1

hellip

d2

z1

w1

z2

w2

zN2

wN2

hellip

dM

z1

w1

z2

w2

zNm

wNm

hellip

p(w|z=1) p(w|z=2) p(w | z=Nm) are shared for all documents

Likelihood

Joint Probability vs Likelihood

bull Joint probability

bull Likelihood (only for observed variables)

bull p(d) is assumed to be uniform

Document Decomposition

bull Each document can be decomposed as

bull This is similar to the matrix decomposition if we consider each discrete distribution as a vector

p(w|d) = ZVtimesk p(z|d)

bull With many documents we hope to find latent topics as common basis

pLSA ndash Objective Function

bull pLSA tries to maximize the log likelihood

bull Due to the summation over z inside log we have to resort to EM

EM Steps

bull E-Stepndash Expectation of the likelihood function is calculated with the current

parameter valuesbull M-Step

ndash Update the parameters with the calculated posterior probabilitiesndash Find the parameters that maximizes the likelihood function

Lower Bounding the Log Likelihood

EM Steps

bull The E-Step

bull The M-Step

Latent Subspace

pLSA vs LSAbull LSA and PLSA perform dimensionality reduction

ndash In LSA by keeping only K singular valuesndash In PLSA by having K aspects

bull Comparison to SVDndash U Matrix related to P(z|d) (doc to aspect)ndash V Matrix related to P(w|z) (aspect to term)ndash E Matrix related to P(z) (aspect strength)

pLSA vs LSAbull The main difference is the way the approximation is done

bull PLSA generates a model (aspect model) and maximizes its predictive power

bull Selecting the proper value of K is heuristic in LSA

bull Model selection in statistics can determine optimal K in PLSA

Applications

bull Text mining topic discovering

bull Scene Classification

Text Mining

Scene Classification

Classification Result

Reference

bull Thomas Hofmann Probabilistic latent semantic analysis In Proc of Uncertainty in Artificial Intelligence UAI99 Stockholm 1999

bull Bosch A Zisserman A and Munoz X Scene Classification via pLSAProceedings of the European Conference on Computer Vision (2006)

bull Sivic J Russell B C Efros A A Zisserman A and Freeman W T Discovering Object Categories in Image Collections MIT AI Lab Memo AIM-2005-005 February 2005

LDA ndash LATENT DIRICHILET ALLOCATION

Problems in pLSA

bull pLSA provides no probabilistic model at the document level Each doc has its own topic mixture proportion

bull The number of parameters in the model grows linearly with M (the number of documents in the training set)

Problems in pLSA

bull There is no constraint for distributions p(z|di)

bull Easy to lead to serious problems with over-fitting

d1

z1

w1

z2

w2

zN1

wN1

hellip

d2

z1

w1

z2

w2

zN2

wN2

hellip

dm

z1

w1

z2

w2

zNm

wNm

hellip

p(z|d1) p(z|d2) p(z|dm)

Dirichlet Distribution

bull In the LDA model the topic mixture proportions for each document are assumed to follow some distribution

bull Requirement for such a distributionndash The samples (mixture proportions) generated from it are K-tuples of non-

negative numbers that sum to one That is the samples are multinormialsndash Easy to optimize

bull Dirichlet Distribution is one of such distributions

bull The space of all of these multinomials has a nice geometric interpretation as a (k-1)-simplex

Dirichlet Distribution

bull Definition

bull The density is zero outside this open (K minus 1)-dimensional simplex

k

i ii

ki ik

i i

ki i

kk

xx

xxxxp i

1

11

1

12121

1 0 st

)Γ()(Γ)|(

bull Various parameter α

(6 2 2) (3 7 5)

(2 3 4) (6 2 6)

Example Dirichlet Distributions (K=3)

Example Dirichlet Distributions (K=3)

bull Equal αi different

α0=01 α0=1 α0=10

k

i i10

The LDA Model

z4z3z2z1

w4w3w2w1

b

z4z3z2z1

w4w3w2w1

z4z3z2z1

w4w3w2w1

bull For each documentbull Choose ~Dirichlet()bull For each of the N words wn

ndash Choose a topic znraquo Multinomial()

ndash Choose a word wn from p(wn|znb) a multinomial probability conditioned on the topic zn

The LDA Model

bull For each documentbull Choose ~Dirichlet()bull For each of the N words wn

ndash Choose a topic znraquo Multinomial()

ndash Choose a word wn from p(wn|znb) a multinomial probability conditioned on the topic zn

Joint Probability

bull Given parameter α and β

where

Likelihood

bull Joint Probability

bull Marginal distribution of a document

bull Likelihood over all the documents

Inference

bull The likelihood can be computed by summing each documentbull Jansenrsquos inequality in EM

Inference

bull In E-Step we need to compute the posterior distribution of the hidden variables

bull Unfortunately this distribution is intractable to compute in general

bull We have to resort to variational approach

Variational Inference

bull In variational inference we consider a simplified graphical model with variational parameters and minimize the KL Divergence between the variational and posterior distributions

Variantional Inference

bull The difference between the lower bound and the likelihood is the KL divergence

bull Maximizing the lower bound L(b) with respect to and is equivalent to minimizing the KL divergence

VBEM vs EM

bull Only different in the E-Step

bull In standard EM q(X) is directly set to p(X|Dθ) and let KL=0bull In VBEM it is intractable to compute p(X|Dθ) Instead it

approximates p(X|Dθ) by a variational distribution q(X) by minimizing KL(q(X) | P(X|D θ)

bull This is also equivalent to maximizing the lower bound L(θ)

Parameter Estimation

bull Given a corpus of documents we would like to find the parameters and b which maximize the likelihood of the observed data

bull Strategy (Variational EM)

bull Lower bound log p(w|b) by a function L(b)bull Repeat until convergence

ndash E Maximize L(b) with respect to the variational parameters ndash M Maximize the bound with respect to parameters and b

Parameter Estimation

bull E-Step Variational Inference ndash repeat until convergence

bull M-Step Parameter estimation

β

α can be implemented using the Newton-Raphson method

Topic Examples in a 100-topic LDA Model)bull 16000 documents from a subset of the TREC AP corpus

Classification (50-topic LDA + SVM)

bull Reuters-21578 dataset ndash contains 8000 documents and 15818 words

(a) EARN vs NOT EARN (b) GRAIN vs NOT GRAIN

Problems in LDA

bull Dirichlet Distribution is helpful to avoid over-fitting But the assumption might be too strong

z4z3z2z1

w4w3w2w1

b

z4z3z2z1

w4w3w2w1

z4z3z2z1

w4w3w2w1

A Bayesian Hierarchical Model for Learning Natural Scene Categories

bull Incorporating category information

MNd

π

z

x

θ

β

Codebookbull 174 Local Image Patches

bull DetectionEvenly Sampled GridRandom SamplingSaliency DetectorLowersquos DoG Detector

bull RepresentationNormalized 11x11 gray values128-dim SIFT

Topic Distribution in Different Categories

Topic Hierarchical Clustering

More Topic Models

bull Dynamic topic models ICML 2006bull Correlated Topic Model NIPS 2005bull Hierarchical Dirichlet Process Journal of the American

Statistical Association 2003bull Nonparametric Bayes pachinko allocation UAI 2007bull Supervised LDA NIPS 2007bull MedLDA ndash Maximum Margin Discrimant LDA ICML 2009bull hellip

Are you really into Graphical Models

bull Describing Visual Scenes using Transformed Dirichlet Processes E Sudderth A Torralba W Freeman and A Willsky NIPS Dec 2005

Reference

bull David M Blei Andrew Y Ng Michael I Jordan Latent Dirichlet Allocation Journal of Machine Learning Research (JMLR) 2003

bull Matthew J Beal Variational Algorithms for Approximate Bayesian Inference PhD Thesis University of Cambridge 1998

bull L Fei-Fei and P Perona A Bayesian Hierarchical Model for Learning Natural Scene Categories CVPR 2005

Outline

bull Matrix Decompositionndash PCA SVD NMFndash LDA ICA Sparse Coding etc

bull Graphical Modelndash Basic concepts in probabilistic machine learningndash EMndash pLSAndash LDA

bull Two Applicationsndash Document decomposition for ldquolong queryrdquo retrieval ICCV 2009ndash Modeling Threaded Discussions SIGIR 2009

Large-Scale Indexing for ldquoLong Queryrdquo Retrieval (Similarity Search)

Xiao Zhang Zhiwei Li Lei Zhang Wei-Ying Ma and Heung-Yeung Shum

ICCV 2009

The Long Query Problem

bull If a query contains 1000 keywordsndash Need to access 1000 inverted listsndash The intersection of 1000 inverted lists may be emptyndash The union of 1000 inverted list may be the whole corpus

bull Dimension reductionTerm1 Term2 Term3 Term4 hellip TermN

Img1 1 2 0 0 hellip 2

f1 f2 hellip fM

Img1 02 01 hellip 003

Topic Projection

Dim = 1 million

Dim = 200

Key Idea Dimension Reduction + Residual Error Preservation

bull p original TF-IDF vector in vocabulary spacebull X projection matrix for dimension reductionbull w low dimensional feature vectorbull residual error

p Xw

An image = 0 01 02 03 04 05 06 07 08 09 1

0

2

4

6

8

10

12

14

16

+ a few words(10 words)

Orthogonal Decomposition

p Xw 1 11 1 1

2 21 2 1 2

1 1

1

k

k

W k W

W W Wk W

p x xp x x w

p wp x x

Base vector

Low dimensional representation

Residual

An image = 0 01 02 03 04 05 06 07 08 09 1

0

2

4

6

8

10

12

14

16

+ a few words(10 words)

T Tp q p q

p q

w w

X1 X2 X3 hellip Xk

A Probabilistic Implementation

x is a switch variable It controls a word generated from

bull a topic specific distribution

bull a document specific distribution

bull a background distribution

( | )p w d 1( 0 | ) ( | ) ( | )K

kp x d p w z k p z k d

( 1| ) ( | )p x d p w d ( 2 | ) ( | )p x d p w

C Chemudugunta etal Modeling General and Specific Aspects of Documents with a Probabilistic Topic Model NIPS 2006

Search (Online)

DS1 DS2 hellip

DS1 DS2 hellip

DS1 DS2 hellip

DS1 DS2 hellip

DS1 DS2 hellip

hellip

hellip

hellip

hellip

hellip

LSH Index

Doc 300 Doc 401 hellip

A query = 0 01 02 03 04 05 06 07 08 09 1

0

2

4

6

8

10

12

14

16

+ a few words

Re-rankingDoc 401 hellip

Doc 1

Doc 2

Doc 300

Doc 401

Doc N

Doc 300

Index 10M Images 46GBSearch Speed lt 100ms

Doc Meta

Search Example

Query Image

Search Example

Query Image

Simultaneously Modeling Semantics and Structure of Threaded Discussions A Sparse

Coding Approach and Its Applications

Chen LIN Jiang-Ming YANG Rui CAI Xin-jing WANG Wei WANG Lei ZHANG

SIGIR 2009

123

Semantic amp structure

124

SemanticTopics

StructureWho reply to who

Optimize them together

Model semantic

Model structure

125

Reply reconstruction

126

DocumentSimilarity

TopicSimilarity

StructureSimilarity

BaselinesNP

Reply to Nearest PostRR

Reply to RootDS

Document SimilarityLDA

Latent Dirichlet Allocation Project documents to topic space

SWB Special Words Topic Model with Background distribution Project documents to topic and junk topic space

127

Evaluation

method Slashdot Apple

All Posts Good Posts All Posts Good Posts

NP 0021 0012 0289 0239

RR 0183 0319 0269 0474

DS 0463 0643 0409 0628

LDA 0465 0644 0410 0648

SWB 0463 0644 0410 0641

SMSS 0524 0737 0517 0772

128

Expert finding

Reply reconstruction

Network construction

Expert finding

Methods

HITS

PageRank

hellip

129

Baselines

LMFormal Models for Expert Finding in Enterprise Corpora SIGIR

06Achieves stable performance in expert finding task using a

language modelPageRank

Benchmark nodal ranking methodHITS

Find hub nodes and authority nodeEABIF

Personalized Recommendation Driven by Information Flow SIGIR rsquo06

Find most influential node130

Evaluation

131

bull Bayesian estimate

Method MRR MAP P10

LM 0821 0698 0800

EABIF(ori) 0674 0362 0243

EABIF(rec) 0742 0318 0281

PageRank(ori) 0675 0377 0263

PageRank(rec) 0743 0321 0266

HITS(ori) 0906 0832 0900

HITS(rec) 0938 0822 0906

Summary

bull Matrix and probability are fundamental mathematics in information retrieval and computer visionndash Matrix decomposition ndash a good practice to learn matrixndash Graphical model ndash a good practice to learn probability

bull Graphical model is a good tool to analyze problems

bull The essence of decomposition is to discover a set of mid-level features to describe original documentsimages

bull It is more adaptable for various applications than matrix decomposition

  • An Introduction To Matrix Decomposition and Graphical Model
  • Outline
  • What Is Matrix Decomposition
  • Why We Need Matrix Decomposition
  • Why We Need Matrix Decomposition (2)
  • Principle Component Analysis
  • Definition ndash Eigenvalue amp Eigenvector
  • Definition ndash Principle Component Analysis
  • Principle Component Analysis (2)
  • Maximizing Variance
  • Optimization Problem
  • Property Data Decomposition
  • Face Recognition ndash Eigenface
  • Slide 14
  • Slide 15
  • PageRank ndash Power Iteration
  • Column-Stochastic amp Irreducible
  • Iterative PageRank Calculation
  • Convergence of the power iteration
  • Singular value decomposition
  • SVD - Definition
  • Singular Values And Singular Vectors
  • Matrix approximation
  • SVD and PCA
  • Example - LSI
  • SVD and PCA (2)
  • Latent Semantic Indexing (LSI)
  • Latent Semantic Indexing
  • Similarity Measures
  • Similarity Measures (2)
  • HITS (Hyperlink Induced Topic Search)
  • Power Iteration
  • HITS and SVD
  • HITS vs PageRank
  • NMF ndash Non-Negative Matrix Factorization
  • Definition
  • Motivation
  • Multiplicative Update Algorithm
  • Multiplicative Update Algorithm (2)
  • NMF vs PCA
  • Reference
  • Major Reference
  • Outline (2)
  • Not Included
  • Basic Concepts
  • What Is Machine Learning
  • Likelihood Function
  • Maximum Likelihood (ML)
  • IID ndash Independent Identically Distributed
  • Reference (2)
  • Expectation Maximization
  • Why We Need EM
  • More General
  • The Expectation Maximization (EM) Algorithm
  • Jensenrsquos Inequality
  • Lower Bounding the Log Likelihood
  • The E and M Steps of EM
  • The E Step
  • The M Step
  • EM Never Decreases the Likelihood
  • Reference (3)
  • Why Do We Need Graphical Model
  • Why Do We Need Graphical Models
  • Directed Acyclic Graphical Models (Bayesian Networks)
  • Directed Graphs for Statistical Models Plate Notation
  • pLSA ndash Probabilistic Latent Semantic Analysis
  • Latent Semantic Indexing (LSI) Review
  • pLSA ndash Probabilistic Latent Semantic Analysis (2)
  • pLSA
  • pLSA (2)
  • Joint Probability vs Likelihood
  • Document Decomposition
  • pLSA ndash Objective Function
  • EM Steps
  • Lower Bounding the Log Likelihood (2)
  • EM Steps (2)
  • Latent Subspace
  • pLSA vs LSA
  • pLSA vs LSA (2)
  • Applications
  • Text Mining
  • Scene Classification
  • Classification Result
  • Reference (4)
  • LDA ndash Latent Dirichilet Allocation
  • Problems in pLSA
  • Problems in pLSA (2)
  • Dirichlet Distribution
  • Dirichlet Distribution (2)
  • Example Dirichlet Distributions (K=3)
  • Example Dirichlet Distributions (K=3) (2)
  • The LDA Model
  • The LDA Model (2)
  • Joint Probability
  • Likelihood
  • Inference
  • Inference (2)
  • Variational Inference
  • Variantional Inference
  • VBEM vs EM
  • Parameter Estimation
  • Parameter Estimation (2)
  • Topic Examples in a 100-topic LDA Model)
  • Classification (50-topic LDA + SVM)
  • Problems in LDA
  • A Bayesian Hierarchical Model for Learning Natural Scene Catego
  • Codebook
  • Topic Distribution in Different Categories
  • Topic Hierarchical Clustering
  • More Topic Models
  • Are you really into Graphical Models
  • Reference (5)
  • Outline (3)
  • Slide 115
  • The Long Query Problem
  • Key Idea Dimension Reduction + Residual Error Preservation
  • Orthogonal Decomposition
  • A Probabilistic Implementation
  • Search (Online)
  • Search Example
  • Search Example (2)
  • Simultaneously Modeling Semantics and Structure of Threaded Dis
  • Semantic amp structure
  • Optimize them together
  • Reply reconstruction
  • Baselines
  • Evaluation
  • Expert finding
  • Baselines (2)
  • Evaluation (2)
  • Summary
Page 57: An Introduction To Matrix Decomposition and Graphical Model

The M Step

bull M step maximize wrt parameters holding hidden distribution q fixed

bull The second equality comes from fact that entropy of q(X) does not depend directly on θ

bull The specific form of the M step depends on the model Often the maximum wrt θ can be found analytically

EM Never Decreases the Likelihood

bull The E and M steps together never decrease the log likelihood

bull The E step brings F(q θ) to the likelihood L(θ) 顶bull The M-step maximizes F(q θ) wrt θ 抬bull F(q θ) lt L(θ) by Jensen ndash or equivalently from the non-

negativity of KL

Reference

bull Zoubin Ghahramani Machine Learning (4F13) 2006 Cambridge(Unsupervised learning Lecture 5 Slides)

bull Christopher M Bishop (2006) Pattern Recognition and Machine Learning Springer

WHY DO WE NEED GRAPHICAL MODEL

Why Do We Need Graphical Models

bull Consndash Graphical model is so complex even with a few circleshellipndash We have to make too many assumptions

bull Prosndash We do need probability to explain our world But joint probability is

hard to computendash Graphical model can help us analyze and understand our problemsndash Graphs are an intuitive way of representing and visualizing the

relationships between many variablesndash With a graphical model we can decouple joint probability to

conditional probabilities which are usually easier

Directed Acyclic Graphical Models (Bayesian Networks)

bull A DAG Model Bayesian network corresponds to a factorization of the joint probability distribution

p(ABCDE) = p(A)p(B)p(C|AB)p(D|BC)p(E|CD)

bull In general

bull where pa(i) are the parents of node i

Directed Graphs for Statistical ModelsPlate Notation

bull A data set of N points generated from a Gaussian

PLSA ndash PROBABILISTIC LATENT SEMANTIC ANALYSIS

Latent Semantic Indexing (LSI) Review

bull For natural Language Queries simple term matching does not work effectivelyndash Ambiguous terms ndash Same Queries vary due to personal styles

bull Latent semantic indexingndash Creates this lsquolatent semantic spacersquo (hidden meaning)

bull LSI puts documents together even if they donrsquot have common words if the docs share frequently co-occurring terms

bull Disadvantagesndash Statistical foundation is missing

pLSA ndash Probabilistic Latent Semantic Analysis

bull Automated Document Indexing and Information retrievalbull Identification of Latent Classes using an Expectation

Maximization (EM) Algorithmbull Shown to solve

ndash Polysemybull Java could mean ldquocoffeerdquo and also the ldquoPL Javardquobull Cricket is a ldquogamerdquo and also an ldquoinsectrdquo

ndash Synonymybull ldquocomputerrdquo ldquopcrdquo ldquodesktoprdquo all could mean the same

bull Has a better statistical foundation than LSA

pLSA

M

Nd

d

z

w

M

d

z1

w1

z2

w2

z3

w3

zN

wN

hellip

z1 hellip zN are variables ziє[1K]K is the number of latent topics

pLSA

d1

z1

w1

z2

w2

zN1

wN1

hellip

d2

z1

w1

z2

w2

zN2

wN2

hellip

dM

z1

w1

z2

w2

zNm

wNm

hellip

p(w|z=1) p(w|z=2) p(w | z=Nm) are shared for all documents

Likelihood

Joint Probability vs Likelihood

bull Joint probability

bull Likelihood (only for observed variables)

bull p(d) is assumed to be uniform

Document Decomposition

bull Each document can be decomposed as

bull This is similar to the matrix decomposition if we consider each discrete distribution as a vector

p(w|d) = ZVtimesk p(z|d)

bull With many documents we hope to find latent topics as common basis

pLSA ndash Objective Function

bull pLSA tries to maximize the log likelihood

bull Due to the summation over z inside log we have to resort to EM

EM Steps

bull E-Stepndash Expectation of the likelihood function is calculated with the current

parameter valuesbull M-Step

ndash Update the parameters with the calculated posterior probabilitiesndash Find the parameters that maximizes the likelihood function

Lower Bounding the Log Likelihood

EM Steps

bull The E-Step

bull The M-Step

Latent Subspace

pLSA vs LSAbull LSA and PLSA perform dimensionality reduction

ndash In LSA by keeping only K singular valuesndash In PLSA by having K aspects

bull Comparison to SVDndash U Matrix related to P(z|d) (doc to aspect)ndash V Matrix related to P(w|z) (aspect to term)ndash E Matrix related to P(z) (aspect strength)

pLSA vs LSAbull The main difference is the way the approximation is done

bull PLSA generates a model (aspect model) and maximizes its predictive power

bull Selecting the proper value of K is heuristic in LSA

bull Model selection in statistics can determine optimal K in PLSA

Applications

bull Text mining topic discovering

bull Scene Classification

Text Mining

Scene Classification

Classification Result

Reference

bull Thomas Hofmann Probabilistic latent semantic analysis In Proc of Uncertainty in Artificial Intelligence UAI99 Stockholm 1999

bull Bosch A Zisserman A and Munoz X Scene Classification via pLSAProceedings of the European Conference on Computer Vision (2006)

bull Sivic J Russell B C Efros A A Zisserman A and Freeman W T Discovering Object Categories in Image Collections MIT AI Lab Memo AIM-2005-005 February 2005

LDA ndash LATENT DIRICHILET ALLOCATION

Problems in pLSA

bull pLSA provides no probabilistic model at the document level Each doc has its own topic mixture proportion

bull The number of parameters in the model grows linearly with M (the number of documents in the training set)

Problems in pLSA

bull There is no constraint for distributions p(z|di)

bull Easy to lead to serious problems with over-fitting

d1

z1

w1

z2

w2

zN1

wN1

hellip

d2

z1

w1

z2

w2

zN2

wN2

hellip

dm

z1

w1

z2

w2

zNm

wNm

hellip

p(z|d1) p(z|d2) p(z|dm)

Dirichlet Distribution

bull In the LDA model the topic mixture proportions for each document are assumed to follow some distribution

bull Requirement for such a distributionndash The samples (mixture proportions) generated from it are K-tuples of non-

negative numbers that sum to one That is the samples are multinormialsndash Easy to optimize

bull Dirichlet Distribution is one of such distributions

bull The space of all of these multinomials has a nice geometric interpretation as a (k-1)-simplex

Dirichlet Distribution

bull Definition

bull The density is zero outside this open (K minus 1)-dimensional simplex

k

i ii

ki ik

i i

ki i

kk

xx

xxxxp i

1

11

1

12121

1 0 st

)Γ()(Γ)|(

bull Various parameter α

(6 2 2) (3 7 5)

(2 3 4) (6 2 6)

Example Dirichlet Distributions (K=3)

Example Dirichlet Distributions (K=3)

bull Equal αi different

α0=01 α0=1 α0=10

k

i i10

The LDA Model

z4z3z2z1

w4w3w2w1

b

z4z3z2z1

w4w3w2w1

z4z3z2z1

w4w3w2w1

bull For each documentbull Choose ~Dirichlet()bull For each of the N words wn

ndash Choose a topic znraquo Multinomial()

ndash Choose a word wn from p(wn|znb) a multinomial probability conditioned on the topic zn

The LDA Model

bull For each documentbull Choose ~Dirichlet()bull For each of the N words wn

ndash Choose a topic znraquo Multinomial()

ndash Choose a word wn from p(wn|znb) a multinomial probability conditioned on the topic zn

Joint Probability

bull Given parameter α and β

where

Likelihood

bull Joint Probability

bull Marginal distribution of a document

bull Likelihood over all the documents

Inference

bull The likelihood can be computed by summing each documentbull Jansenrsquos inequality in EM

Inference

bull In E-Step we need to compute the posterior distribution of the hidden variables

bull Unfortunately this distribution is intractable to compute in general

bull We have to resort to variational approach

Variational Inference

bull In variational inference we consider a simplified graphical model with variational parameters and minimize the KL Divergence between the variational and posterior distributions

Variantional Inference

bull The difference between the lower bound and the likelihood is the KL divergence

bull Maximizing the lower bound L(b) with respect to and is equivalent to minimizing the KL divergence

VBEM vs EM

bull Only different in the E-Step

bull In standard EM q(X) is directly set to p(X|Dθ) and let KL=0bull In VBEM it is intractable to compute p(X|Dθ) Instead it

approximates p(X|Dθ) by a variational distribution q(X) by minimizing KL(q(X) | P(X|D θ)

bull This is also equivalent to maximizing the lower bound L(θ)

Parameter Estimation

bull Given a corpus of documents we would like to find the parameters and b which maximize the likelihood of the observed data

bull Strategy (Variational EM)

bull Lower bound log p(w|b) by a function L(b)bull Repeat until convergence

ndash E Maximize L(b) with respect to the variational parameters ndash M Maximize the bound with respect to parameters and b

Parameter Estimation

bull E-Step Variational Inference ndash repeat until convergence

bull M-Step Parameter estimation

β

α can be implemented using the Newton-Raphson method

Topic Examples in a 100-topic LDA Model)bull 16000 documents from a subset of the TREC AP corpus

Classification (50-topic LDA + SVM)

bull Reuters-21578 dataset ndash contains 8000 documents and 15818 words

(a) EARN vs NOT EARN (b) GRAIN vs NOT GRAIN

Problems in LDA

bull Dirichlet Distribution is helpful to avoid over-fitting But the assumption might be too strong

z4z3z2z1

w4w3w2w1

b

z4z3z2z1

w4w3w2w1

z4z3z2z1

w4w3w2w1

A Bayesian Hierarchical Model for Learning Natural Scene Categories

bull Incorporating category information

MNd

π

z

x

θ

β

Codebookbull 174 Local Image Patches

bull DetectionEvenly Sampled GridRandom SamplingSaliency DetectorLowersquos DoG Detector

bull RepresentationNormalized 11x11 gray values128-dim SIFT

Topic Distribution in Different Categories

Topic Hierarchical Clustering

More Topic Models

bull Dynamic topic models ICML 2006bull Correlated Topic Model NIPS 2005bull Hierarchical Dirichlet Process Journal of the American

Statistical Association 2003bull Nonparametric Bayes pachinko allocation UAI 2007bull Supervised LDA NIPS 2007bull MedLDA ndash Maximum Margin Discrimant LDA ICML 2009bull hellip

Are you really into Graphical Models

bull Describing Visual Scenes using Transformed Dirichlet Processes E Sudderth A Torralba W Freeman and A Willsky NIPS Dec 2005

Reference

bull David M Blei Andrew Y Ng Michael I Jordan Latent Dirichlet Allocation Journal of Machine Learning Research (JMLR) 2003

bull Matthew J Beal Variational Algorithms for Approximate Bayesian Inference PhD Thesis University of Cambridge 1998

bull L Fei-Fei and P Perona A Bayesian Hierarchical Model for Learning Natural Scene Categories CVPR 2005

Outline

bull Matrix Decompositionndash PCA SVD NMFndash LDA ICA Sparse Coding etc

bull Graphical Modelndash Basic concepts in probabilistic machine learningndash EMndash pLSAndash LDA

bull Two Applicationsndash Document decomposition for ldquolong queryrdquo retrieval ICCV 2009ndash Modeling Threaded Discussions SIGIR 2009

Large-Scale Indexing for ldquoLong Queryrdquo Retrieval (Similarity Search)

Xiao Zhang Zhiwei Li Lei Zhang Wei-Ying Ma and Heung-Yeung Shum

ICCV 2009

The Long Query Problem

bull If a query contains 1000 keywordsndash Need to access 1000 inverted listsndash The intersection of 1000 inverted lists may be emptyndash The union of 1000 inverted list may be the whole corpus

bull Dimension reductionTerm1 Term2 Term3 Term4 hellip TermN

Img1 1 2 0 0 hellip 2

f1 f2 hellip fM

Img1 02 01 hellip 003

Topic Projection

Dim = 1 million

Dim = 200

Key Idea Dimension Reduction + Residual Error Preservation

bull p original TF-IDF vector in vocabulary spacebull X projection matrix for dimension reductionbull w low dimensional feature vectorbull residual error

p Xw

An image = 0 01 02 03 04 05 06 07 08 09 1

0

2

4

6

8

10

12

14

16

+ a few words(10 words)

Orthogonal Decomposition

p Xw 1 11 1 1

2 21 2 1 2

1 1

1

k

k

W k W

W W Wk W

p x xp x x w

p wp x x

Base vector

Low dimensional representation

Residual

An image = 0 01 02 03 04 05 06 07 08 09 1

0

2

4

6

8

10

12

14

16

+ a few words(10 words)

T Tp q p q

p q

w w

X1 X2 X3 hellip Xk

A Probabilistic Implementation

x is a switch variable It controls a word generated from

bull a topic specific distribution

bull a document specific distribution

bull a background distribution

( | )p w d 1( 0 | ) ( | ) ( | )K

kp x d p w z k p z k d

( 1| ) ( | )p x d p w d ( 2 | ) ( | )p x d p w

C Chemudugunta etal Modeling General and Specific Aspects of Documents with a Probabilistic Topic Model NIPS 2006

Search (Online)

DS1 DS2 hellip

DS1 DS2 hellip

DS1 DS2 hellip

DS1 DS2 hellip

DS1 DS2 hellip

hellip

hellip

hellip

hellip

hellip

LSH Index

Doc 300 Doc 401 hellip

A query = 0 01 02 03 04 05 06 07 08 09 1

0

2

4

6

8

10

12

14

16

+ a few words

Re-rankingDoc 401 hellip

Doc 1

Doc 2

Doc 300

Doc 401

Doc N

Doc 300

Index 10M Images 46GBSearch Speed lt 100ms

Doc Meta

Search Example

Query Image

Search Example

Query Image

Simultaneously Modeling Semantics and Structure of Threaded Discussions A Sparse

Coding Approach and Its Applications

Chen LIN Jiang-Ming YANG Rui CAI Xin-jing WANG Wei WANG Lei ZHANG

SIGIR 2009

123

Semantic amp structure

124

SemanticTopics

StructureWho reply to who

Optimize them together

Model semantic

Model structure

125

Reply reconstruction

126

DocumentSimilarity

TopicSimilarity

StructureSimilarity

BaselinesNP

Reply to Nearest PostRR

Reply to RootDS

Document SimilarityLDA

Latent Dirichlet Allocation Project documents to topic space

SWB Special Words Topic Model with Background distribution Project documents to topic and junk topic space

127

Evaluation

method Slashdot Apple

All Posts Good Posts All Posts Good Posts

NP 0021 0012 0289 0239

RR 0183 0319 0269 0474

DS 0463 0643 0409 0628

LDA 0465 0644 0410 0648

SWB 0463 0644 0410 0641

SMSS 0524 0737 0517 0772

128

Expert finding

Reply reconstruction

Network construction

Expert finding

Methods

HITS

PageRank

hellip

129

Baselines

LMFormal Models for Expert Finding in Enterprise Corpora SIGIR

06Achieves stable performance in expert finding task using a

language modelPageRank

Benchmark nodal ranking methodHITS

Find hub nodes and authority nodeEABIF

Personalized Recommendation Driven by Information Flow SIGIR rsquo06

Find most influential node130

Evaluation

131

bull Bayesian estimate

Method MRR MAP P10

LM 0821 0698 0800

EABIF(ori) 0674 0362 0243

EABIF(rec) 0742 0318 0281

PageRank(ori) 0675 0377 0263

PageRank(rec) 0743 0321 0266

HITS(ori) 0906 0832 0900

HITS(rec) 0938 0822 0906

Summary

bull Matrix and probability are fundamental mathematics in information retrieval and computer visionndash Matrix decomposition ndash a good practice to learn matrixndash Graphical model ndash a good practice to learn probability

bull Graphical model is a good tool to analyze problems

bull The essence of decomposition is to discover a set of mid-level features to describe original documentsimages

bull It is more adaptable for various applications than matrix decomposition

  • An Introduction To Matrix Decomposition and Graphical Model
  • Outline
  • What Is Matrix Decomposition
  • Why We Need Matrix Decomposition
  • Why We Need Matrix Decomposition (2)
  • Principle Component Analysis
  • Definition ndash Eigenvalue amp Eigenvector
  • Definition ndash Principle Component Analysis
  • Principle Component Analysis (2)
  • Maximizing Variance
  • Optimization Problem
  • Property Data Decomposition
  • Face Recognition ndash Eigenface
  • Slide 14
  • Slide 15
  • PageRank ndash Power Iteration
  • Column-Stochastic amp Irreducible
  • Iterative PageRank Calculation
  • Convergence of the power iteration
  • Singular value decomposition
  • SVD - Definition
  • Singular Values And Singular Vectors
  • Matrix approximation
  • SVD and PCA
  • Example - LSI
  • SVD and PCA (2)
  • Latent Semantic Indexing (LSI)
  • Latent Semantic Indexing
  • Similarity Measures
  • Similarity Measures (2)
  • HITS (Hyperlink Induced Topic Search)
  • Power Iteration
  • HITS and SVD
  • HITS vs PageRank
  • NMF ndash Non-Negative Matrix Factorization
  • Definition
  • Motivation
  • Multiplicative Update Algorithm
  • Multiplicative Update Algorithm (2)
  • NMF vs PCA
  • Reference
  • Major Reference
  • Outline (2)
  • Not Included
  • Basic Concepts
  • What Is Machine Learning
  • Likelihood Function
  • Maximum Likelihood (ML)
  • IID ndash Independent Identically Distributed
  • Reference (2)
  • Expectation Maximization
  • Why We Need EM
  • More General
  • The Expectation Maximization (EM) Algorithm
  • Jensenrsquos Inequality
  • Lower Bounding the Log Likelihood
  • The E and M Steps of EM
  • The E Step
  • The M Step
  • EM Never Decreases the Likelihood
  • Reference (3)
  • Why Do We Need Graphical Model
  • Why Do We Need Graphical Models
  • Directed Acyclic Graphical Models (Bayesian Networks)
  • Directed Graphs for Statistical Models Plate Notation
  • pLSA ndash Probabilistic Latent Semantic Analysis
  • Latent Semantic Indexing (LSI) Review
  • pLSA ndash Probabilistic Latent Semantic Analysis (2)
  • pLSA
  • pLSA (2)
  • Joint Probability vs Likelihood
  • Document Decomposition
  • pLSA ndash Objective Function
  • EM Steps
  • Lower Bounding the Log Likelihood (2)
  • EM Steps (2)
  • Latent Subspace
  • pLSA vs LSA
  • pLSA vs LSA (2)
  • Applications
  • Text Mining
  • Scene Classification
  • Classification Result
  • Reference (4)
  • LDA ndash Latent Dirichilet Allocation
  • Problems in pLSA
  • Problems in pLSA (2)
  • Dirichlet Distribution
  • Dirichlet Distribution (2)
  • Example Dirichlet Distributions (K=3)
  • Example Dirichlet Distributions (K=3) (2)
  • The LDA Model
  • The LDA Model (2)
  • Joint Probability
  • Likelihood
  • Inference
  • Inference (2)
  • Variational Inference
  • Variantional Inference
  • VBEM vs EM
  • Parameter Estimation
  • Parameter Estimation (2)
  • Topic Examples in a 100-topic LDA Model)
  • Classification (50-topic LDA + SVM)
  • Problems in LDA
  • A Bayesian Hierarchical Model for Learning Natural Scene Catego
  • Codebook
  • Topic Distribution in Different Categories
  • Topic Hierarchical Clustering
  • More Topic Models
  • Are you really into Graphical Models
  • Reference (5)
  • Outline (3)
  • Slide 115
  • The Long Query Problem
  • Key Idea Dimension Reduction + Residual Error Preservation
  • Orthogonal Decomposition
  • A Probabilistic Implementation
  • Search (Online)
  • Search Example
  • Search Example (2)
  • Simultaneously Modeling Semantics and Structure of Threaded Dis
  • Semantic amp structure
  • Optimize them together
  • Reply reconstruction
  • Baselines
  • Evaluation
  • Expert finding
  • Baselines (2)
  • Evaluation (2)
  • Summary
Page 58: An Introduction To Matrix Decomposition and Graphical Model

EM Never Decreases the Likelihood

bull The E and M steps together never decrease the log likelihood

bull The E step brings F(q θ) to the likelihood L(θ) 顶bull The M-step maximizes F(q θ) wrt θ 抬bull F(q θ) lt L(θ) by Jensen ndash or equivalently from the non-

negativity of KL

Reference

bull Zoubin Ghahramani Machine Learning (4F13) 2006 Cambridge(Unsupervised learning Lecture 5 Slides)

bull Christopher M Bishop (2006) Pattern Recognition and Machine Learning Springer

WHY DO WE NEED GRAPHICAL MODEL

Why Do We Need Graphical Models

bull Consndash Graphical model is so complex even with a few circleshellipndash We have to make too many assumptions

bull Prosndash We do need probability to explain our world But joint probability is

hard to computendash Graphical model can help us analyze and understand our problemsndash Graphs are an intuitive way of representing and visualizing the

relationships between many variablesndash With a graphical model we can decouple joint probability to

conditional probabilities which are usually easier

Directed Acyclic Graphical Models (Bayesian Networks)

bull A DAG Model Bayesian network corresponds to a factorization of the joint probability distribution

p(ABCDE) = p(A)p(B)p(C|AB)p(D|BC)p(E|CD)

bull In general

bull where pa(i) are the parents of node i

Directed Graphs for Statistical ModelsPlate Notation

bull A data set of N points generated from a Gaussian

PLSA ndash PROBABILISTIC LATENT SEMANTIC ANALYSIS

Latent Semantic Indexing (LSI) Review

bull For natural Language Queries simple term matching does not work effectivelyndash Ambiguous terms ndash Same Queries vary due to personal styles

bull Latent semantic indexingndash Creates this lsquolatent semantic spacersquo (hidden meaning)

bull LSI puts documents together even if they donrsquot have common words if the docs share frequently co-occurring terms

bull Disadvantagesndash Statistical foundation is missing

pLSA ndash Probabilistic Latent Semantic Analysis

bull Automated Document Indexing and Information retrievalbull Identification of Latent Classes using an Expectation

Maximization (EM) Algorithmbull Shown to solve

ndash Polysemybull Java could mean ldquocoffeerdquo and also the ldquoPL Javardquobull Cricket is a ldquogamerdquo and also an ldquoinsectrdquo

ndash Synonymybull ldquocomputerrdquo ldquopcrdquo ldquodesktoprdquo all could mean the same

bull Has a better statistical foundation than LSA

pLSA

M

Nd

d

z

w

M

d

z1

w1

z2

w2

z3

w3

zN

wN

hellip

z1 hellip zN are variables ziє[1K]K is the number of latent topics

pLSA

d1

z1

w1

z2

w2

zN1

wN1

hellip

d2

z1

w1

z2

w2

zN2

wN2

hellip

dM

z1

w1

z2

w2

zNm

wNm

hellip

p(w|z=1) p(w|z=2) p(w | z=Nm) are shared for all documents

Likelihood

Joint Probability vs Likelihood

bull Joint probability

bull Likelihood (only for observed variables)

bull p(d) is assumed to be uniform

Document Decomposition

bull Each document can be decomposed as

bull This is similar to the matrix decomposition if we consider each discrete distribution as a vector

p(w|d) = ZVtimesk p(z|d)

bull With many documents we hope to find latent topics as common basis

pLSA ndash Objective Function

bull pLSA tries to maximize the log likelihood

bull Due to the summation over z inside log we have to resort to EM

EM Steps

bull E-Stepndash Expectation of the likelihood function is calculated with the current

parameter valuesbull M-Step

ndash Update the parameters with the calculated posterior probabilitiesndash Find the parameters that maximizes the likelihood function

Lower Bounding the Log Likelihood

EM Steps

bull The E-Step

bull The M-Step

Latent Subspace

pLSA vs LSAbull LSA and PLSA perform dimensionality reduction

ndash In LSA by keeping only K singular valuesndash In PLSA by having K aspects

bull Comparison to SVDndash U Matrix related to P(z|d) (doc to aspect)ndash V Matrix related to P(w|z) (aspect to term)ndash E Matrix related to P(z) (aspect strength)

pLSA vs LSAbull The main difference is the way the approximation is done

bull PLSA generates a model (aspect model) and maximizes its predictive power

bull Selecting the proper value of K is heuristic in LSA

bull Model selection in statistics can determine optimal K in PLSA

Applications

bull Text mining topic discovering

bull Scene Classification

Text Mining

Scene Classification

Classification Result

Reference

bull Thomas Hofmann Probabilistic latent semantic analysis In Proc of Uncertainty in Artificial Intelligence UAI99 Stockholm 1999

bull Bosch A Zisserman A and Munoz X Scene Classification via pLSAProceedings of the European Conference on Computer Vision (2006)

bull Sivic J Russell B C Efros A A Zisserman A and Freeman W T Discovering Object Categories in Image Collections MIT AI Lab Memo AIM-2005-005 February 2005

LDA ndash LATENT DIRICHILET ALLOCATION

Problems in pLSA

bull pLSA provides no probabilistic model at the document level Each doc has its own topic mixture proportion

bull The number of parameters in the model grows linearly with M (the number of documents in the training set)

Problems in pLSA

bull There is no constraint for distributions p(z|di)

bull Easy to lead to serious problems with over-fitting

d1

z1

w1

z2

w2

zN1

wN1

hellip

d2

z1

w1

z2

w2

zN2

wN2

hellip

dm

z1

w1

z2

w2

zNm

wNm

hellip

p(z|d1) p(z|d2) p(z|dm)

Dirichlet Distribution

bull In the LDA model the topic mixture proportions for each document are assumed to follow some distribution

bull Requirement for such a distributionndash The samples (mixture proportions) generated from it are K-tuples of non-

negative numbers that sum to one That is the samples are multinormialsndash Easy to optimize

bull Dirichlet Distribution is one of such distributions

bull The space of all of these multinomials has a nice geometric interpretation as a (k-1)-simplex

Dirichlet Distribution

bull Definition

bull The density is zero outside this open (K minus 1)-dimensional simplex

k

i ii

ki ik

i i

ki i

kk

xx

xxxxp i

1

11

1

12121

1 0 st

)Γ()(Γ)|(

bull Various parameter α

(6 2 2) (3 7 5)

(2 3 4) (6 2 6)

Example Dirichlet Distributions (K=3)

Example Dirichlet Distributions (K=3)

bull Equal αi different

α0=01 α0=1 α0=10

k

i i10

The LDA Model

z4z3z2z1

w4w3w2w1

b

z4z3z2z1

w4w3w2w1

z4z3z2z1

w4w3w2w1

bull For each documentbull Choose ~Dirichlet()bull For each of the N words wn

ndash Choose a topic znraquo Multinomial()

ndash Choose a word wn from p(wn|znb) a multinomial probability conditioned on the topic zn

The LDA Model

bull For each documentbull Choose ~Dirichlet()bull For each of the N words wn

ndash Choose a topic znraquo Multinomial()

ndash Choose a word wn from p(wn|znb) a multinomial probability conditioned on the topic zn

Joint Probability

bull Given parameter α and β

where

Likelihood

bull Joint Probability

bull Marginal distribution of a document

bull Likelihood over all the documents

Inference

bull The likelihood can be computed by summing each documentbull Jansenrsquos inequality in EM

Inference

bull In E-Step we need to compute the posterior distribution of the hidden variables

bull Unfortunately this distribution is intractable to compute in general

bull We have to resort to variational approach

Variational Inference

bull In variational inference we consider a simplified graphical model with variational parameters and minimize the KL Divergence between the variational and posterior distributions

Variantional Inference

bull The difference between the lower bound and the likelihood is the KL divergence

bull Maximizing the lower bound L(b) with respect to and is equivalent to minimizing the KL divergence

VBEM vs EM

bull Only different in the E-Step

bull In standard EM q(X) is directly set to p(X|Dθ) and let KL=0bull In VBEM it is intractable to compute p(X|Dθ) Instead it

approximates p(X|Dθ) by a variational distribution q(X) by minimizing KL(q(X) | P(X|D θ)

bull This is also equivalent to maximizing the lower bound L(θ)

Parameter Estimation

bull Given a corpus of documents we would like to find the parameters and b which maximize the likelihood of the observed data

bull Strategy (Variational EM)

bull Lower bound log p(w|b) by a function L(b)bull Repeat until convergence

ndash E Maximize L(b) with respect to the variational parameters ndash M Maximize the bound with respect to parameters and b

Parameter Estimation

bull E-Step Variational Inference ndash repeat until convergence

bull M-Step Parameter estimation

β

α can be implemented using the Newton-Raphson method

Topic Examples in a 100-topic LDA Model)bull 16000 documents from a subset of the TREC AP corpus

Classification (50-topic LDA + SVM)

bull Reuters-21578 dataset ndash contains 8000 documents and 15818 words

(a) EARN vs NOT EARN (b) GRAIN vs NOT GRAIN

Problems in LDA

bull Dirichlet Distribution is helpful to avoid over-fitting But the assumption might be too strong

z4z3z2z1

w4w3w2w1

b

z4z3z2z1

w4w3w2w1

z4z3z2z1

w4w3w2w1

A Bayesian Hierarchical Model for Learning Natural Scene Categories

bull Incorporating category information

MNd

π

z

x

θ

β

Codebookbull 174 Local Image Patches

bull DetectionEvenly Sampled GridRandom SamplingSaliency DetectorLowersquos DoG Detector

bull RepresentationNormalized 11x11 gray values128-dim SIFT

Topic Distribution in Different Categories

Topic Hierarchical Clustering

More Topic Models

bull Dynamic topic models ICML 2006bull Correlated Topic Model NIPS 2005bull Hierarchical Dirichlet Process Journal of the American

Statistical Association 2003bull Nonparametric Bayes pachinko allocation UAI 2007bull Supervised LDA NIPS 2007bull MedLDA ndash Maximum Margin Discrimant LDA ICML 2009bull hellip

Are you really into Graphical Models

bull Describing Visual Scenes using Transformed Dirichlet Processes E Sudderth A Torralba W Freeman and A Willsky NIPS Dec 2005

Reference

bull David M Blei Andrew Y Ng Michael I Jordan Latent Dirichlet Allocation Journal of Machine Learning Research (JMLR) 2003

bull Matthew J Beal Variational Algorithms for Approximate Bayesian Inference PhD Thesis University of Cambridge 1998

bull L Fei-Fei and P Perona A Bayesian Hierarchical Model for Learning Natural Scene Categories CVPR 2005

Outline

bull Matrix Decompositionndash PCA SVD NMFndash LDA ICA Sparse Coding etc

bull Graphical Modelndash Basic concepts in probabilistic machine learningndash EMndash pLSAndash LDA

bull Two Applicationsndash Document decomposition for ldquolong queryrdquo retrieval ICCV 2009ndash Modeling Threaded Discussions SIGIR 2009

Large-Scale Indexing for ldquoLong Queryrdquo Retrieval (Similarity Search)

Xiao Zhang Zhiwei Li Lei Zhang Wei-Ying Ma and Heung-Yeung Shum

ICCV 2009

The Long Query Problem

bull If a query contains 1000 keywordsndash Need to access 1000 inverted listsndash The intersection of 1000 inverted lists may be emptyndash The union of 1000 inverted list may be the whole corpus

bull Dimension reductionTerm1 Term2 Term3 Term4 hellip TermN

Img1 1 2 0 0 hellip 2

f1 f2 hellip fM

Img1 02 01 hellip 003

Topic Projection

Dim = 1 million

Dim = 200

Key Idea Dimension Reduction + Residual Error Preservation

bull p original TF-IDF vector in vocabulary spacebull X projection matrix for dimension reductionbull w low dimensional feature vectorbull residual error

p Xw

An image = 0 01 02 03 04 05 06 07 08 09 1

0

2

4

6

8

10

12

14

16

+ a few words(10 words)

Orthogonal Decomposition

p Xw 1 11 1 1

2 21 2 1 2

1 1

1

k

k

W k W

W W Wk W

p x xp x x w

p wp x x

Base vector

Low dimensional representation

Residual

An image = 0 01 02 03 04 05 06 07 08 09 1

0

2

4

6

8

10

12

14

16

+ a few words(10 words)

T Tp q p q

p q

w w

X1 X2 X3 hellip Xk

A Probabilistic Implementation

x is a switch variable It controls a word generated from

bull a topic specific distribution

bull a document specific distribution

bull a background distribution

( | )p w d 1( 0 | ) ( | ) ( | )K

kp x d p w z k p z k d

( 1| ) ( | )p x d p w d ( 2 | ) ( | )p x d p w

C Chemudugunta etal Modeling General and Specific Aspects of Documents with a Probabilistic Topic Model NIPS 2006

Search (Online)

DS1 DS2 hellip

DS1 DS2 hellip

DS1 DS2 hellip

DS1 DS2 hellip

DS1 DS2 hellip

hellip

hellip

hellip

hellip

hellip

LSH Index

Doc 300 Doc 401 hellip

A query = 0 01 02 03 04 05 06 07 08 09 1

0

2

4

6

8

10

12

14

16

+ a few words

Re-rankingDoc 401 hellip

Doc 1

Doc 2

Doc 300

Doc 401

Doc N

Doc 300

Index 10M Images 46GBSearch Speed lt 100ms

Doc Meta

Search Example

Query Image

Search Example

Query Image

Simultaneously Modeling Semantics and Structure of Threaded Discussions A Sparse

Coding Approach and Its Applications

Chen LIN Jiang-Ming YANG Rui CAI Xin-jing WANG Wei WANG Lei ZHANG

SIGIR 2009

123

Semantic amp structure

124

SemanticTopics

StructureWho reply to who

Optimize them together

Model semantic

Model structure

125

Reply reconstruction

126

DocumentSimilarity

TopicSimilarity

StructureSimilarity

BaselinesNP

Reply to Nearest PostRR

Reply to RootDS

Document SimilarityLDA

Latent Dirichlet Allocation Project documents to topic space

SWB Special Words Topic Model with Background distribution Project documents to topic and junk topic space

127

Evaluation

method Slashdot Apple

All Posts Good Posts All Posts Good Posts

NP 0021 0012 0289 0239

RR 0183 0319 0269 0474

DS 0463 0643 0409 0628

LDA 0465 0644 0410 0648

SWB 0463 0644 0410 0641

SMSS 0524 0737 0517 0772

128

Expert finding

Reply reconstruction

Network construction

Expert finding

Methods

HITS

PageRank

hellip

129

Baselines

LMFormal Models for Expert Finding in Enterprise Corpora SIGIR

06Achieves stable performance in expert finding task using a

language modelPageRank

Benchmark nodal ranking methodHITS

Find hub nodes and authority nodeEABIF

Personalized Recommendation Driven by Information Flow SIGIR rsquo06

Find most influential node130

Evaluation

131

bull Bayesian estimate

Method MRR MAP P10

LM 0821 0698 0800

EABIF(ori) 0674 0362 0243

EABIF(rec) 0742 0318 0281

PageRank(ori) 0675 0377 0263

PageRank(rec) 0743 0321 0266

HITS(ori) 0906 0832 0900

HITS(rec) 0938 0822 0906

Summary

bull Matrix and probability are fundamental mathematics in information retrieval and computer visionndash Matrix decomposition ndash a good practice to learn matrixndash Graphical model ndash a good practice to learn probability

bull Graphical model is a good tool to analyze problems

bull The essence of decomposition is to discover a set of mid-level features to describe original documentsimages

bull It is more adaptable for various applications than matrix decomposition

  • An Introduction To Matrix Decomposition and Graphical Model
  • Outline
  • What Is Matrix Decomposition
  • Why We Need Matrix Decomposition
  • Why We Need Matrix Decomposition (2)
  • Principle Component Analysis
  • Definition ndash Eigenvalue amp Eigenvector
  • Definition ndash Principle Component Analysis
  • Principle Component Analysis (2)
  • Maximizing Variance
  • Optimization Problem
  • Property Data Decomposition
  • Face Recognition ndash Eigenface
  • Slide 14
  • Slide 15
  • PageRank ndash Power Iteration
  • Column-Stochastic amp Irreducible
  • Iterative PageRank Calculation
  • Convergence of the power iteration
  • Singular value decomposition
  • SVD - Definition
  • Singular Values And Singular Vectors
  • Matrix approximation
  • SVD and PCA
  • Example - LSI
  • SVD and PCA (2)
  • Latent Semantic Indexing (LSI)
  • Latent Semantic Indexing
  • Similarity Measures
  • Similarity Measures (2)
  • HITS (Hyperlink Induced Topic Search)
  • Power Iteration
  • HITS and SVD
  • HITS vs PageRank
  • NMF ndash Non-Negative Matrix Factorization
  • Definition
  • Motivation
  • Multiplicative Update Algorithm
  • Multiplicative Update Algorithm (2)
  • NMF vs PCA
  • Reference
  • Major Reference
  • Outline (2)
  • Not Included
  • Basic Concepts
  • What Is Machine Learning
  • Likelihood Function
  • Maximum Likelihood (ML)
  • IID ndash Independent Identically Distributed
  • Reference (2)
  • Expectation Maximization
  • Why We Need EM
  • More General
  • The Expectation Maximization (EM) Algorithm
  • Jensenrsquos Inequality
  • Lower Bounding the Log Likelihood
  • The E and M Steps of EM
  • The E Step
  • The M Step
  • EM Never Decreases the Likelihood
  • Reference (3)
  • Why Do We Need Graphical Model
  • Why Do We Need Graphical Models
  • Directed Acyclic Graphical Models (Bayesian Networks)
  • Directed Graphs for Statistical Models Plate Notation
  • pLSA ndash Probabilistic Latent Semantic Analysis
  • Latent Semantic Indexing (LSI) Review
  • pLSA ndash Probabilistic Latent Semantic Analysis (2)
  • pLSA
  • pLSA (2)
  • Joint Probability vs Likelihood
  • Document Decomposition
  • pLSA ndash Objective Function
  • EM Steps
  • Lower Bounding the Log Likelihood (2)
  • EM Steps (2)
  • Latent Subspace
  • pLSA vs LSA
  • pLSA vs LSA (2)
  • Applications
  • Text Mining
  • Scene Classification
  • Classification Result
  • Reference (4)
  • LDA ndash Latent Dirichilet Allocation
  • Problems in pLSA
  • Problems in pLSA (2)
  • Dirichlet Distribution
  • Dirichlet Distribution (2)
  • Example Dirichlet Distributions (K=3)
  • Example Dirichlet Distributions (K=3) (2)
  • The LDA Model
  • The LDA Model (2)
  • Joint Probability
  • Likelihood
  • Inference
  • Inference (2)
  • Variational Inference
  • Variantional Inference
  • VBEM vs EM
  • Parameter Estimation
  • Parameter Estimation (2)
  • Topic Examples in a 100-topic LDA Model)
  • Classification (50-topic LDA + SVM)
  • Problems in LDA
  • A Bayesian Hierarchical Model for Learning Natural Scene Catego
  • Codebook
  • Topic Distribution in Different Categories
  • Topic Hierarchical Clustering
  • More Topic Models
  • Are you really into Graphical Models
  • Reference (5)
  • Outline (3)
  • Slide 115
  • The Long Query Problem
  • Key Idea Dimension Reduction + Residual Error Preservation
  • Orthogonal Decomposition
  • A Probabilistic Implementation
  • Search (Online)
  • Search Example
  • Search Example (2)
  • Simultaneously Modeling Semantics and Structure of Threaded Dis
  • Semantic amp structure
  • Optimize them together
  • Reply reconstruction
  • Baselines
  • Evaluation
  • Expert finding
  • Baselines (2)
  • Evaluation (2)
  • Summary
Page 59: An Introduction To Matrix Decomposition and Graphical Model

Reference

bull Zoubin Ghahramani Machine Learning (4F13) 2006 Cambridge(Unsupervised learning Lecture 5 Slides)

bull Christopher M Bishop (2006) Pattern Recognition and Machine Learning Springer

WHY DO WE NEED GRAPHICAL MODEL

Why Do We Need Graphical Models

bull Consndash Graphical model is so complex even with a few circleshellipndash We have to make too many assumptions

bull Prosndash We do need probability to explain our world But joint probability is

hard to computendash Graphical model can help us analyze and understand our problemsndash Graphs are an intuitive way of representing and visualizing the

relationships between many variablesndash With a graphical model we can decouple joint probability to

conditional probabilities which are usually easier

Directed Acyclic Graphical Models (Bayesian Networks)

bull A DAG Model Bayesian network corresponds to a factorization of the joint probability distribution

p(ABCDE) = p(A)p(B)p(C|AB)p(D|BC)p(E|CD)

bull In general

bull where pa(i) are the parents of node i

Directed Graphs for Statistical ModelsPlate Notation

bull A data set of N points generated from a Gaussian

PLSA ndash PROBABILISTIC LATENT SEMANTIC ANALYSIS

Latent Semantic Indexing (LSI) Review

bull For natural Language Queries simple term matching does not work effectivelyndash Ambiguous terms ndash Same Queries vary due to personal styles

bull Latent semantic indexingndash Creates this lsquolatent semantic spacersquo (hidden meaning)

bull LSI puts documents together even if they donrsquot have common words if the docs share frequently co-occurring terms

bull Disadvantagesndash Statistical foundation is missing

pLSA ndash Probabilistic Latent Semantic Analysis

bull Automated Document Indexing and Information retrievalbull Identification of Latent Classes using an Expectation

Maximization (EM) Algorithmbull Shown to solve

ndash Polysemybull Java could mean ldquocoffeerdquo and also the ldquoPL Javardquobull Cricket is a ldquogamerdquo and also an ldquoinsectrdquo

ndash Synonymybull ldquocomputerrdquo ldquopcrdquo ldquodesktoprdquo all could mean the same

bull Has a better statistical foundation than LSA

pLSA

M

Nd

d

z

w

M

d

z1

w1

z2

w2

z3

w3

zN

wN

hellip

z1 hellip zN are variables ziє[1K]K is the number of latent topics

pLSA

d1

z1

w1

z2

w2

zN1

wN1

hellip

d2

z1

w1

z2

w2

zN2

wN2

hellip

dM

z1

w1

z2

w2

zNm

wNm

hellip

p(w|z=1) p(w|z=2) p(w | z=Nm) are shared for all documents

Likelihood

Joint Probability vs Likelihood

bull Joint probability

bull Likelihood (only for observed variables)

bull p(d) is assumed to be uniform

Document Decomposition

bull Each document can be decomposed as

bull This is similar to the matrix decomposition if we consider each discrete distribution as a vector

p(w|d) = ZVtimesk p(z|d)

bull With many documents we hope to find latent topics as common basis

pLSA ndash Objective Function

bull pLSA tries to maximize the log likelihood

bull Due to the summation over z inside log we have to resort to EM

EM Steps

bull E-Stepndash Expectation of the likelihood function is calculated with the current

parameter valuesbull M-Step

ndash Update the parameters with the calculated posterior probabilitiesndash Find the parameters that maximizes the likelihood function

Lower Bounding the Log Likelihood

EM Steps

bull The E-Step

bull The M-Step

Latent Subspace

pLSA vs LSAbull LSA and PLSA perform dimensionality reduction

ndash In LSA by keeping only K singular valuesndash In PLSA by having K aspects

bull Comparison to SVDndash U Matrix related to P(z|d) (doc to aspect)ndash V Matrix related to P(w|z) (aspect to term)ndash E Matrix related to P(z) (aspect strength)

pLSA vs LSAbull The main difference is the way the approximation is done

bull PLSA generates a model (aspect model) and maximizes its predictive power

bull Selecting the proper value of K is heuristic in LSA

bull Model selection in statistics can determine optimal K in PLSA

Applications

bull Text mining topic discovering

bull Scene Classification

Text Mining

Scene Classification

Classification Result

Reference

bull Thomas Hofmann Probabilistic latent semantic analysis In Proc of Uncertainty in Artificial Intelligence UAI99 Stockholm 1999

bull Bosch A Zisserman A and Munoz X Scene Classification via pLSAProceedings of the European Conference on Computer Vision (2006)

bull Sivic J Russell B C Efros A A Zisserman A and Freeman W T Discovering Object Categories in Image Collections MIT AI Lab Memo AIM-2005-005 February 2005

LDA ndash LATENT DIRICHILET ALLOCATION

Problems in pLSA

bull pLSA provides no probabilistic model at the document level Each doc has its own topic mixture proportion

bull The number of parameters in the model grows linearly with M (the number of documents in the training set)

Problems in pLSA

bull There is no constraint for distributions p(z|di)

bull Easy to lead to serious problems with over-fitting

d1

z1

w1

z2

w2

zN1

wN1

hellip

d2

z1

w1

z2

w2

zN2

wN2

hellip

dm

z1

w1

z2

w2

zNm

wNm

hellip

p(z|d1) p(z|d2) p(z|dm)

Dirichlet Distribution

bull In the LDA model the topic mixture proportions for each document are assumed to follow some distribution

bull Requirement for such a distributionndash The samples (mixture proportions) generated from it are K-tuples of non-

negative numbers that sum to one That is the samples are multinormialsndash Easy to optimize

bull Dirichlet Distribution is one of such distributions

bull The space of all of these multinomials has a nice geometric interpretation as a (k-1)-simplex

Dirichlet Distribution

bull Definition

bull The density is zero outside this open (K minus 1)-dimensional simplex

k

i ii

ki ik

i i

ki i

kk

xx

xxxxp i

1

11

1

12121

1 0 st

)Γ()(Γ)|(

bull Various parameter α

(6 2 2) (3 7 5)

(2 3 4) (6 2 6)

Example Dirichlet Distributions (K=3)

Example Dirichlet Distributions (K=3)

bull Equal αi different

α0=01 α0=1 α0=10

k

i i10

The LDA Model

z4z3z2z1

w4w3w2w1

b

z4z3z2z1

w4w3w2w1

z4z3z2z1

w4w3w2w1

bull For each documentbull Choose ~Dirichlet()bull For each of the N words wn

ndash Choose a topic znraquo Multinomial()

ndash Choose a word wn from p(wn|znb) a multinomial probability conditioned on the topic zn

The LDA Model

bull For each documentbull Choose ~Dirichlet()bull For each of the N words wn

ndash Choose a topic znraquo Multinomial()

ndash Choose a word wn from p(wn|znb) a multinomial probability conditioned on the topic zn

Joint Probability

bull Given parameter α and β

where

Likelihood

bull Joint Probability

bull Marginal distribution of a document

bull Likelihood over all the documents

Inference

bull The likelihood can be computed by summing each documentbull Jansenrsquos inequality in EM

Inference

bull In E-Step we need to compute the posterior distribution of the hidden variables

bull Unfortunately this distribution is intractable to compute in general

bull We have to resort to variational approach

Variational Inference

bull In variational inference we consider a simplified graphical model with variational parameters and minimize the KL Divergence between the variational and posterior distributions

Variantional Inference

bull The difference between the lower bound and the likelihood is the KL divergence

bull Maximizing the lower bound L(b) with respect to and is equivalent to minimizing the KL divergence

VBEM vs EM

bull Only different in the E-Step

bull In standard EM q(X) is directly set to p(X|Dθ) and let KL=0bull In VBEM it is intractable to compute p(X|Dθ) Instead it

approximates p(X|Dθ) by a variational distribution q(X) by minimizing KL(q(X) | P(X|D θ)

bull This is also equivalent to maximizing the lower bound L(θ)

Parameter Estimation

bull Given a corpus of documents we would like to find the parameters and b which maximize the likelihood of the observed data

bull Strategy (Variational EM)

bull Lower bound log p(w|b) by a function L(b)bull Repeat until convergence

ndash E Maximize L(b) with respect to the variational parameters ndash M Maximize the bound with respect to parameters and b

Parameter Estimation

bull E-Step Variational Inference ndash repeat until convergence

bull M-Step Parameter estimation

β

α can be implemented using the Newton-Raphson method

Topic Examples in a 100-topic LDA Model)bull 16000 documents from a subset of the TREC AP corpus

Classification (50-topic LDA + SVM)

bull Reuters-21578 dataset ndash contains 8000 documents and 15818 words

(a) EARN vs NOT EARN (b) GRAIN vs NOT GRAIN

Problems in LDA

bull Dirichlet Distribution is helpful to avoid over-fitting But the assumption might be too strong

z4z3z2z1

w4w3w2w1

b

z4z3z2z1

w4w3w2w1

z4z3z2z1

w4w3w2w1

A Bayesian Hierarchical Model for Learning Natural Scene Categories

bull Incorporating category information

MNd

π

z

x

θ

β

Codebookbull 174 Local Image Patches

bull DetectionEvenly Sampled GridRandom SamplingSaliency DetectorLowersquos DoG Detector

bull RepresentationNormalized 11x11 gray values128-dim SIFT

Topic Distribution in Different Categories

Topic Hierarchical Clustering

More Topic Models

bull Dynamic topic models ICML 2006bull Correlated Topic Model NIPS 2005bull Hierarchical Dirichlet Process Journal of the American

Statistical Association 2003bull Nonparametric Bayes pachinko allocation UAI 2007bull Supervised LDA NIPS 2007bull MedLDA ndash Maximum Margin Discrimant LDA ICML 2009bull hellip

Are you really into Graphical Models

bull Describing Visual Scenes using Transformed Dirichlet Processes E Sudderth A Torralba W Freeman and A Willsky NIPS Dec 2005

Reference

bull David M Blei Andrew Y Ng Michael I Jordan Latent Dirichlet Allocation Journal of Machine Learning Research (JMLR) 2003

bull Matthew J Beal Variational Algorithms for Approximate Bayesian Inference PhD Thesis University of Cambridge 1998

bull L Fei-Fei and P Perona A Bayesian Hierarchical Model for Learning Natural Scene Categories CVPR 2005

Outline

bull Matrix Decompositionndash PCA SVD NMFndash LDA ICA Sparse Coding etc

bull Graphical Modelndash Basic concepts in probabilistic machine learningndash EMndash pLSAndash LDA

bull Two Applicationsndash Document decomposition for ldquolong queryrdquo retrieval ICCV 2009ndash Modeling Threaded Discussions SIGIR 2009

Large-Scale Indexing for ldquoLong Queryrdquo Retrieval (Similarity Search)

Xiao Zhang Zhiwei Li Lei Zhang Wei-Ying Ma and Heung-Yeung Shum

ICCV 2009

The Long Query Problem

bull If a query contains 1000 keywordsndash Need to access 1000 inverted listsndash The intersection of 1000 inverted lists may be emptyndash The union of 1000 inverted list may be the whole corpus

bull Dimension reductionTerm1 Term2 Term3 Term4 hellip TermN

Img1 1 2 0 0 hellip 2

f1 f2 hellip fM

Img1 02 01 hellip 003

Topic Projection

Dim = 1 million

Dim = 200

Key Idea Dimension Reduction + Residual Error Preservation

bull p original TF-IDF vector in vocabulary spacebull X projection matrix for dimension reductionbull w low dimensional feature vectorbull residual error

p Xw

An image = 0 01 02 03 04 05 06 07 08 09 1

0

2

4

6

8

10

12

14

16

+ a few words(10 words)

Orthogonal Decomposition

p Xw 1 11 1 1

2 21 2 1 2

1 1

1

k

k

W k W

W W Wk W

p x xp x x w

p wp x x

Base vector

Low dimensional representation

Residual

An image = 0 01 02 03 04 05 06 07 08 09 1

0

2

4

6

8

10

12

14

16

+ a few words(10 words)

T Tp q p q

p q

w w

X1 X2 X3 hellip Xk

A Probabilistic Implementation

x is a switch variable It controls a word generated from

bull a topic specific distribution

bull a document specific distribution

bull a background distribution

( | )p w d 1( 0 | ) ( | ) ( | )K

kp x d p w z k p z k d

( 1| ) ( | )p x d p w d ( 2 | ) ( | )p x d p w

C Chemudugunta etal Modeling General and Specific Aspects of Documents with a Probabilistic Topic Model NIPS 2006

Search (Online)

DS1 DS2 hellip

DS1 DS2 hellip

DS1 DS2 hellip

DS1 DS2 hellip

DS1 DS2 hellip

hellip

hellip

hellip

hellip

hellip

LSH Index

Doc 300 Doc 401 hellip

A query = 0 01 02 03 04 05 06 07 08 09 1

0

2

4

6

8

10

12

14

16

+ a few words

Re-rankingDoc 401 hellip

Doc 1

Doc 2

Doc 300

Doc 401

Doc N

Doc 300

Index 10M Images 46GBSearch Speed lt 100ms

Doc Meta

Search Example

Query Image

Search Example

Query Image

Simultaneously Modeling Semantics and Structure of Threaded Discussions A Sparse

Coding Approach and Its Applications

Chen LIN Jiang-Ming YANG Rui CAI Xin-jing WANG Wei WANG Lei ZHANG

SIGIR 2009

123

Semantic amp structure

124

SemanticTopics

StructureWho reply to who

Optimize them together

Model semantic

Model structure

125

Reply reconstruction

126

DocumentSimilarity

TopicSimilarity

StructureSimilarity

BaselinesNP

Reply to Nearest PostRR

Reply to RootDS

Document SimilarityLDA

Latent Dirichlet Allocation Project documents to topic space

SWB Special Words Topic Model with Background distribution Project documents to topic and junk topic space

127

Evaluation

method Slashdot Apple

All Posts Good Posts All Posts Good Posts

NP 0021 0012 0289 0239

RR 0183 0319 0269 0474

DS 0463 0643 0409 0628

LDA 0465 0644 0410 0648

SWB 0463 0644 0410 0641

SMSS 0524 0737 0517 0772

128

Expert finding

Reply reconstruction

Network construction

Expert finding

Methods

HITS

PageRank

hellip

129

Baselines

LMFormal Models for Expert Finding in Enterprise Corpora SIGIR

06Achieves stable performance in expert finding task using a

language modelPageRank

Benchmark nodal ranking methodHITS

Find hub nodes and authority nodeEABIF

Personalized Recommendation Driven by Information Flow SIGIR rsquo06

Find most influential node130

Evaluation

131

bull Bayesian estimate

Method MRR MAP P10

LM 0821 0698 0800

EABIF(ori) 0674 0362 0243

EABIF(rec) 0742 0318 0281

PageRank(ori) 0675 0377 0263

PageRank(rec) 0743 0321 0266

HITS(ori) 0906 0832 0900

HITS(rec) 0938 0822 0906

Summary

bull Matrix and probability are fundamental mathematics in information retrieval and computer visionndash Matrix decomposition ndash a good practice to learn matrixndash Graphical model ndash a good practice to learn probability

bull Graphical model is a good tool to analyze problems

bull The essence of decomposition is to discover a set of mid-level features to describe original documentsimages

bull It is more adaptable for various applications than matrix decomposition

  • An Introduction To Matrix Decomposition and Graphical Model
  • Outline
  • What Is Matrix Decomposition
  • Why We Need Matrix Decomposition
  • Why We Need Matrix Decomposition (2)
  • Principle Component Analysis
  • Definition ndash Eigenvalue amp Eigenvector
  • Definition ndash Principle Component Analysis
  • Principle Component Analysis (2)
  • Maximizing Variance
  • Optimization Problem
  • Property Data Decomposition
  • Face Recognition ndash Eigenface
  • Slide 14
  • Slide 15
  • PageRank ndash Power Iteration
  • Column-Stochastic amp Irreducible
  • Iterative PageRank Calculation
  • Convergence of the power iteration
  • Singular value decomposition
  • SVD - Definition
  • Singular Values And Singular Vectors
  • Matrix approximation
  • SVD and PCA
  • Example - LSI
  • SVD and PCA (2)
  • Latent Semantic Indexing (LSI)
  • Latent Semantic Indexing
  • Similarity Measures
  • Similarity Measures (2)
  • HITS (Hyperlink Induced Topic Search)
  • Power Iteration
  • HITS and SVD
  • HITS vs PageRank
  • NMF ndash Non-Negative Matrix Factorization
  • Definition
  • Motivation
  • Multiplicative Update Algorithm
  • Multiplicative Update Algorithm (2)
  • NMF vs PCA
  • Reference
  • Major Reference
  • Outline (2)
  • Not Included
  • Basic Concepts
  • What Is Machine Learning
  • Likelihood Function
  • Maximum Likelihood (ML)
  • IID ndash Independent Identically Distributed
  • Reference (2)
  • Expectation Maximization
  • Why We Need EM
  • More General
  • The Expectation Maximization (EM) Algorithm
  • Jensenrsquos Inequality
  • Lower Bounding the Log Likelihood
  • The E and M Steps of EM
  • The E Step
  • The M Step
  • EM Never Decreases the Likelihood
  • Reference (3)
  • Why Do We Need Graphical Model
  • Why Do We Need Graphical Models
  • Directed Acyclic Graphical Models (Bayesian Networks)
  • Directed Graphs for Statistical Models Plate Notation
  • pLSA ndash Probabilistic Latent Semantic Analysis
  • Latent Semantic Indexing (LSI) Review
  • pLSA ndash Probabilistic Latent Semantic Analysis (2)
  • pLSA
  • pLSA (2)
  • Joint Probability vs Likelihood
  • Document Decomposition
  • pLSA ndash Objective Function
  • EM Steps
  • Lower Bounding the Log Likelihood (2)
  • EM Steps (2)
  • Latent Subspace
  • pLSA vs LSA
  • pLSA vs LSA (2)
  • Applications
  • Text Mining
  • Scene Classification
  • Classification Result
  • Reference (4)
  • LDA ndash Latent Dirichilet Allocation
  • Problems in pLSA
  • Problems in pLSA (2)
  • Dirichlet Distribution
  • Dirichlet Distribution (2)
  • Example Dirichlet Distributions (K=3)
  • Example Dirichlet Distributions (K=3) (2)
  • The LDA Model
  • The LDA Model (2)
  • Joint Probability
  • Likelihood
  • Inference
  • Inference (2)
  • Variational Inference
  • Variantional Inference
  • VBEM vs EM
  • Parameter Estimation
  • Parameter Estimation (2)
  • Topic Examples in a 100-topic LDA Model)
  • Classification (50-topic LDA + SVM)
  • Problems in LDA
  • A Bayesian Hierarchical Model for Learning Natural Scene Catego
  • Codebook
  • Topic Distribution in Different Categories
  • Topic Hierarchical Clustering
  • More Topic Models
  • Are you really into Graphical Models
  • Reference (5)
  • Outline (3)
  • Slide 115
  • The Long Query Problem
  • Key Idea Dimension Reduction + Residual Error Preservation
  • Orthogonal Decomposition
  • A Probabilistic Implementation
  • Search (Online)
  • Search Example
  • Search Example (2)
  • Simultaneously Modeling Semantics and Structure of Threaded Dis
  • Semantic amp structure
  • Optimize them together
  • Reply reconstruction
  • Baselines
  • Evaluation
  • Expert finding
  • Baselines (2)
  • Evaluation (2)
  • Summary
Page 60: An Introduction To Matrix Decomposition and Graphical Model

WHY DO WE NEED GRAPHICAL MODEL

Why Do We Need Graphical Models

bull Consndash Graphical model is so complex even with a few circleshellipndash We have to make too many assumptions

bull Prosndash We do need probability to explain our world But joint probability is

hard to computendash Graphical model can help us analyze and understand our problemsndash Graphs are an intuitive way of representing and visualizing the

relationships between many variablesndash With a graphical model we can decouple joint probability to

conditional probabilities which are usually easier

Directed Acyclic Graphical Models (Bayesian Networks)

bull A DAG Model Bayesian network corresponds to a factorization of the joint probability distribution

p(ABCDE) = p(A)p(B)p(C|AB)p(D|BC)p(E|CD)

bull In general

bull where pa(i) are the parents of node i

Directed Graphs for Statistical ModelsPlate Notation

bull A data set of N points generated from a Gaussian

PLSA ndash PROBABILISTIC LATENT SEMANTIC ANALYSIS

Latent Semantic Indexing (LSI) Review

bull For natural Language Queries simple term matching does not work effectivelyndash Ambiguous terms ndash Same Queries vary due to personal styles

bull Latent semantic indexingndash Creates this lsquolatent semantic spacersquo (hidden meaning)

bull LSI puts documents together even if they donrsquot have common words if the docs share frequently co-occurring terms

bull Disadvantagesndash Statistical foundation is missing

pLSA ndash Probabilistic Latent Semantic Analysis

bull Automated Document Indexing and Information retrievalbull Identification of Latent Classes using an Expectation

Maximization (EM) Algorithmbull Shown to solve

ndash Polysemybull Java could mean ldquocoffeerdquo and also the ldquoPL Javardquobull Cricket is a ldquogamerdquo and also an ldquoinsectrdquo

ndash Synonymybull ldquocomputerrdquo ldquopcrdquo ldquodesktoprdquo all could mean the same

bull Has a better statistical foundation than LSA

pLSA

M

Nd

d

z

w

M

d

z1

w1

z2

w2

z3

w3

zN

wN

hellip

z1 hellip zN are variables ziє[1K]K is the number of latent topics

pLSA

d1

z1

w1

z2

w2

zN1

wN1

hellip

d2

z1

w1

z2

w2

zN2

wN2

hellip

dM

z1

w1

z2

w2

zNm

wNm

hellip

p(w|z=1) p(w|z=2) p(w | z=Nm) are shared for all documents

Likelihood

Joint Probability vs Likelihood

bull Joint probability

bull Likelihood (only for observed variables)

bull p(d) is assumed to be uniform

Document Decomposition

bull Each document can be decomposed as

bull This is similar to the matrix decomposition if we consider each discrete distribution as a vector

p(w|d) = ZVtimesk p(z|d)

bull With many documents we hope to find latent topics as common basis

pLSA ndash Objective Function

bull pLSA tries to maximize the log likelihood

bull Due to the summation over z inside log we have to resort to EM

EM Steps

bull E-Stepndash Expectation of the likelihood function is calculated with the current

parameter valuesbull M-Step

ndash Update the parameters with the calculated posterior probabilitiesndash Find the parameters that maximizes the likelihood function

Lower Bounding the Log Likelihood

EM Steps

bull The E-Step

bull The M-Step

Latent Subspace

pLSA vs LSAbull LSA and PLSA perform dimensionality reduction

ndash In LSA by keeping only K singular valuesndash In PLSA by having K aspects

bull Comparison to SVDndash U Matrix related to P(z|d) (doc to aspect)ndash V Matrix related to P(w|z) (aspect to term)ndash E Matrix related to P(z) (aspect strength)

pLSA vs LSAbull The main difference is the way the approximation is done

bull PLSA generates a model (aspect model) and maximizes its predictive power

bull Selecting the proper value of K is heuristic in LSA

bull Model selection in statistics can determine optimal K in PLSA

Applications

bull Text mining topic discovering

bull Scene Classification

Text Mining

Scene Classification

Classification Result

Reference

bull Thomas Hofmann Probabilistic latent semantic analysis In Proc of Uncertainty in Artificial Intelligence UAI99 Stockholm 1999

bull Bosch A Zisserman A and Munoz X Scene Classification via pLSAProceedings of the European Conference on Computer Vision (2006)

bull Sivic J Russell B C Efros A A Zisserman A and Freeman W T Discovering Object Categories in Image Collections MIT AI Lab Memo AIM-2005-005 February 2005

LDA ndash LATENT DIRICHILET ALLOCATION

Problems in pLSA

bull pLSA provides no probabilistic model at the document level Each doc has its own topic mixture proportion

bull The number of parameters in the model grows linearly with M (the number of documents in the training set)

Problems in pLSA

bull There is no constraint for distributions p(z|di)

bull Easy to lead to serious problems with over-fitting

d1

z1

w1

z2

w2

zN1

wN1

hellip

d2

z1

w1

z2

w2

zN2

wN2

hellip

dm

z1

w1

z2

w2

zNm

wNm

hellip

p(z|d1) p(z|d2) p(z|dm)

Dirichlet Distribution

bull In the LDA model the topic mixture proportions for each document are assumed to follow some distribution

bull Requirement for such a distributionndash The samples (mixture proportions) generated from it are K-tuples of non-

negative numbers that sum to one That is the samples are multinormialsndash Easy to optimize

bull Dirichlet Distribution is one of such distributions

bull The space of all of these multinomials has a nice geometric interpretation as a (k-1)-simplex

Dirichlet Distribution

bull Definition

bull The density is zero outside this open (K minus 1)-dimensional simplex

k

i ii

ki ik

i i

ki i

kk

xx

xxxxp i

1

11

1

12121

1 0 st

)Γ()(Γ)|(

bull Various parameter α

(6 2 2) (3 7 5)

(2 3 4) (6 2 6)

Example Dirichlet Distributions (K=3)

Example Dirichlet Distributions (K=3)

bull Equal αi different

α0=01 α0=1 α0=10

k

i i10

The LDA Model

z4z3z2z1

w4w3w2w1

b

z4z3z2z1

w4w3w2w1

z4z3z2z1

w4w3w2w1

bull For each documentbull Choose ~Dirichlet()bull For each of the N words wn

ndash Choose a topic znraquo Multinomial()

ndash Choose a word wn from p(wn|znb) a multinomial probability conditioned on the topic zn

The LDA Model

bull For each documentbull Choose ~Dirichlet()bull For each of the N words wn

ndash Choose a topic znraquo Multinomial()

ndash Choose a word wn from p(wn|znb) a multinomial probability conditioned on the topic zn

Joint Probability

bull Given parameter α and β

where

Likelihood

bull Joint Probability

bull Marginal distribution of a document

bull Likelihood over all the documents

Inference

bull The likelihood can be computed by summing each documentbull Jansenrsquos inequality in EM

Inference

bull In E-Step we need to compute the posterior distribution of the hidden variables

bull Unfortunately this distribution is intractable to compute in general

bull We have to resort to variational approach

Variational Inference

bull In variational inference we consider a simplified graphical model with variational parameters and minimize the KL Divergence between the variational and posterior distributions

Variantional Inference

bull The difference between the lower bound and the likelihood is the KL divergence

bull Maximizing the lower bound L(b) with respect to and is equivalent to minimizing the KL divergence

VBEM vs EM

bull Only different in the E-Step

bull In standard EM q(X) is directly set to p(X|Dθ) and let KL=0bull In VBEM it is intractable to compute p(X|Dθ) Instead it

approximates p(X|Dθ) by a variational distribution q(X) by minimizing KL(q(X) | P(X|D θ)

bull This is also equivalent to maximizing the lower bound L(θ)

Parameter Estimation

bull Given a corpus of documents we would like to find the parameters and b which maximize the likelihood of the observed data

bull Strategy (Variational EM)

bull Lower bound log p(w|b) by a function L(b)bull Repeat until convergence

ndash E Maximize L(b) with respect to the variational parameters ndash M Maximize the bound with respect to parameters and b

Parameter Estimation

bull E-Step Variational Inference ndash repeat until convergence

bull M-Step Parameter estimation

β

α can be implemented using the Newton-Raphson method

Topic Examples in a 100-topic LDA Model)bull 16000 documents from a subset of the TREC AP corpus

Classification (50-topic LDA + SVM)

bull Reuters-21578 dataset ndash contains 8000 documents and 15818 words

(a) EARN vs NOT EARN (b) GRAIN vs NOT GRAIN

Problems in LDA

bull Dirichlet Distribution is helpful to avoid over-fitting But the assumption might be too strong

z4z3z2z1

w4w3w2w1

b

z4z3z2z1

w4w3w2w1

z4z3z2z1

w4w3w2w1

A Bayesian Hierarchical Model for Learning Natural Scene Categories

bull Incorporating category information

MNd

π

z

x

θ

β

Codebookbull 174 Local Image Patches

bull DetectionEvenly Sampled GridRandom SamplingSaliency DetectorLowersquos DoG Detector

bull RepresentationNormalized 11x11 gray values128-dim SIFT

Topic Distribution in Different Categories

Topic Hierarchical Clustering

More Topic Models

bull Dynamic topic models ICML 2006bull Correlated Topic Model NIPS 2005bull Hierarchical Dirichlet Process Journal of the American

Statistical Association 2003bull Nonparametric Bayes pachinko allocation UAI 2007bull Supervised LDA NIPS 2007bull MedLDA ndash Maximum Margin Discrimant LDA ICML 2009bull hellip

Are you really into Graphical Models

bull Describing Visual Scenes using Transformed Dirichlet Processes E Sudderth A Torralba W Freeman and A Willsky NIPS Dec 2005

Reference

bull David M Blei Andrew Y Ng Michael I Jordan Latent Dirichlet Allocation Journal of Machine Learning Research (JMLR) 2003

bull Matthew J Beal Variational Algorithms for Approximate Bayesian Inference PhD Thesis University of Cambridge 1998

bull L Fei-Fei and P Perona A Bayesian Hierarchical Model for Learning Natural Scene Categories CVPR 2005

Outline

bull Matrix Decompositionndash PCA SVD NMFndash LDA ICA Sparse Coding etc

bull Graphical Modelndash Basic concepts in probabilistic machine learningndash EMndash pLSAndash LDA

bull Two Applicationsndash Document decomposition for ldquolong queryrdquo retrieval ICCV 2009ndash Modeling Threaded Discussions SIGIR 2009

Large-Scale Indexing for ldquoLong Queryrdquo Retrieval (Similarity Search)

Xiao Zhang Zhiwei Li Lei Zhang Wei-Ying Ma and Heung-Yeung Shum

ICCV 2009

The Long Query Problem

bull If a query contains 1000 keywordsndash Need to access 1000 inverted listsndash The intersection of 1000 inverted lists may be emptyndash The union of 1000 inverted list may be the whole corpus

bull Dimension reductionTerm1 Term2 Term3 Term4 hellip TermN

Img1 1 2 0 0 hellip 2

f1 f2 hellip fM

Img1 02 01 hellip 003

Topic Projection

Dim = 1 million

Dim = 200

Key Idea Dimension Reduction + Residual Error Preservation

bull p original TF-IDF vector in vocabulary spacebull X projection matrix for dimension reductionbull w low dimensional feature vectorbull residual error

p Xw

An image = 0 01 02 03 04 05 06 07 08 09 1

0

2

4

6

8

10

12

14

16

+ a few words(10 words)

Orthogonal Decomposition

p Xw 1 11 1 1

2 21 2 1 2

1 1

1

k

k

W k W

W W Wk W

p x xp x x w

p wp x x

Base vector

Low dimensional representation

Residual

An image = 0 01 02 03 04 05 06 07 08 09 1

0

2

4

6

8

10

12

14

16

+ a few words(10 words)

T Tp q p q

p q

w w

X1 X2 X3 hellip Xk

A Probabilistic Implementation

x is a switch variable It controls a word generated from

bull a topic specific distribution

bull a document specific distribution

bull a background distribution

( | )p w d 1( 0 | ) ( | ) ( | )K

kp x d p w z k p z k d

( 1| ) ( | )p x d p w d ( 2 | ) ( | )p x d p w

C Chemudugunta etal Modeling General and Specific Aspects of Documents with a Probabilistic Topic Model NIPS 2006

Search (Online)

DS1 DS2 hellip

DS1 DS2 hellip

DS1 DS2 hellip

DS1 DS2 hellip

DS1 DS2 hellip

hellip

hellip

hellip

hellip

hellip

LSH Index

Doc 300 Doc 401 hellip

A query = 0 01 02 03 04 05 06 07 08 09 1

0

2

4

6

8

10

12

14

16

+ a few words

Re-rankingDoc 401 hellip

Doc 1

Doc 2

Doc 300

Doc 401

Doc N

Doc 300

Index 10M Images 46GBSearch Speed lt 100ms

Doc Meta

Search Example

Query Image

Search Example

Query Image

Simultaneously Modeling Semantics and Structure of Threaded Discussions A Sparse

Coding Approach and Its Applications

Chen LIN Jiang-Ming YANG Rui CAI Xin-jing WANG Wei WANG Lei ZHANG

SIGIR 2009

123

Semantic amp structure

124

SemanticTopics

StructureWho reply to who

Optimize them together

Model semantic

Model structure

125

Reply reconstruction

126

DocumentSimilarity

TopicSimilarity

StructureSimilarity

BaselinesNP

Reply to Nearest PostRR

Reply to RootDS

Document SimilarityLDA

Latent Dirichlet Allocation Project documents to topic space

SWB Special Words Topic Model with Background distribution Project documents to topic and junk topic space

127

Evaluation

method Slashdot Apple

All Posts Good Posts All Posts Good Posts

NP 0021 0012 0289 0239

RR 0183 0319 0269 0474

DS 0463 0643 0409 0628

LDA 0465 0644 0410 0648

SWB 0463 0644 0410 0641

SMSS 0524 0737 0517 0772

128

Expert finding

Reply reconstruction

Network construction

Expert finding

Methods

HITS

PageRank

hellip

129

Baselines

LMFormal Models for Expert Finding in Enterprise Corpora SIGIR

06Achieves stable performance in expert finding task using a

language modelPageRank

Benchmark nodal ranking methodHITS

Find hub nodes and authority nodeEABIF

Personalized Recommendation Driven by Information Flow SIGIR rsquo06

Find most influential node130

Evaluation

131

bull Bayesian estimate

Method MRR MAP P10

LM 0821 0698 0800

EABIF(ori) 0674 0362 0243

EABIF(rec) 0742 0318 0281

PageRank(ori) 0675 0377 0263

PageRank(rec) 0743 0321 0266

HITS(ori) 0906 0832 0900

HITS(rec) 0938 0822 0906

Summary

bull Matrix and probability are fundamental mathematics in information retrieval and computer visionndash Matrix decomposition ndash a good practice to learn matrixndash Graphical model ndash a good practice to learn probability

bull Graphical model is a good tool to analyze problems

bull The essence of decomposition is to discover a set of mid-level features to describe original documentsimages

bull It is more adaptable for various applications than matrix decomposition

  • An Introduction To Matrix Decomposition and Graphical Model
  • Outline
  • What Is Matrix Decomposition
  • Why We Need Matrix Decomposition
  • Why We Need Matrix Decomposition (2)
  • Principle Component Analysis
  • Definition ndash Eigenvalue amp Eigenvector
  • Definition ndash Principle Component Analysis
  • Principle Component Analysis (2)
  • Maximizing Variance
  • Optimization Problem
  • Property Data Decomposition
  • Face Recognition ndash Eigenface
  • Slide 14
  • Slide 15
  • PageRank ndash Power Iteration
  • Column-Stochastic amp Irreducible
  • Iterative PageRank Calculation
  • Convergence of the power iteration
  • Singular value decomposition
  • SVD - Definition
  • Singular Values And Singular Vectors
  • Matrix approximation
  • SVD and PCA
  • Example - LSI
  • SVD and PCA (2)
  • Latent Semantic Indexing (LSI)
  • Latent Semantic Indexing
  • Similarity Measures
  • Similarity Measures (2)
  • HITS (Hyperlink Induced Topic Search)
  • Power Iteration
  • HITS and SVD
  • HITS vs PageRank
  • NMF ndash Non-Negative Matrix Factorization
  • Definition
  • Motivation
  • Multiplicative Update Algorithm
  • Multiplicative Update Algorithm (2)
  • NMF vs PCA
  • Reference
  • Major Reference
  • Outline (2)
  • Not Included
  • Basic Concepts
  • What Is Machine Learning
  • Likelihood Function
  • Maximum Likelihood (ML)
  • IID ndash Independent Identically Distributed
  • Reference (2)
  • Expectation Maximization
  • Why We Need EM
  • More General
  • The Expectation Maximization (EM) Algorithm
  • Jensenrsquos Inequality
  • Lower Bounding the Log Likelihood
  • The E and M Steps of EM
  • The E Step
  • The M Step
  • EM Never Decreases the Likelihood
  • Reference (3)
  • Why Do We Need Graphical Model
  • Why Do We Need Graphical Models
  • Directed Acyclic Graphical Models (Bayesian Networks)
  • Directed Graphs for Statistical Models Plate Notation
  • pLSA ndash Probabilistic Latent Semantic Analysis
  • Latent Semantic Indexing (LSI) Review
  • pLSA ndash Probabilistic Latent Semantic Analysis (2)
  • pLSA
  • pLSA (2)
  • Joint Probability vs Likelihood
  • Document Decomposition
  • pLSA ndash Objective Function
  • EM Steps
  • Lower Bounding the Log Likelihood (2)
  • EM Steps (2)
  • Latent Subspace
  • pLSA vs LSA
  • pLSA vs LSA (2)
  • Applications
  • Text Mining
  • Scene Classification
  • Classification Result
  • Reference (4)
  • LDA ndash Latent Dirichilet Allocation
  • Problems in pLSA
  • Problems in pLSA (2)
  • Dirichlet Distribution
  • Dirichlet Distribution (2)
  • Example Dirichlet Distributions (K=3)
  • Example Dirichlet Distributions (K=3) (2)
  • The LDA Model
  • The LDA Model (2)
  • Joint Probability
  • Likelihood
  • Inference
  • Inference (2)
  • Variational Inference
  • Variantional Inference
  • VBEM vs EM
  • Parameter Estimation
  • Parameter Estimation (2)
  • Topic Examples in a 100-topic LDA Model)
  • Classification (50-topic LDA + SVM)
  • Problems in LDA
  • A Bayesian Hierarchical Model for Learning Natural Scene Catego
  • Codebook
  • Topic Distribution in Different Categories
  • Topic Hierarchical Clustering
  • More Topic Models
  • Are you really into Graphical Models
  • Reference (5)
  • Outline (3)
  • Slide 115
  • The Long Query Problem
  • Key Idea Dimension Reduction + Residual Error Preservation
  • Orthogonal Decomposition
  • A Probabilistic Implementation
  • Search (Online)
  • Search Example
  • Search Example (2)
  • Simultaneously Modeling Semantics and Structure of Threaded Dis
  • Semantic amp structure
  • Optimize them together
  • Reply reconstruction
  • Baselines
  • Evaluation
  • Expert finding
  • Baselines (2)
  • Evaluation (2)
  • Summary
Page 61: An Introduction To Matrix Decomposition and Graphical Model

Why Do We Need Graphical Models

bull Consndash Graphical model is so complex even with a few circleshellipndash We have to make too many assumptions

bull Prosndash We do need probability to explain our world But joint probability is

hard to computendash Graphical model can help us analyze and understand our problemsndash Graphs are an intuitive way of representing and visualizing the

relationships between many variablesndash With a graphical model we can decouple joint probability to

conditional probabilities which are usually easier

Directed Acyclic Graphical Models (Bayesian Networks)

bull A DAG Model Bayesian network corresponds to a factorization of the joint probability distribution

p(ABCDE) = p(A)p(B)p(C|AB)p(D|BC)p(E|CD)

bull In general

bull where pa(i) are the parents of node i

Directed Graphs for Statistical ModelsPlate Notation

bull A data set of N points generated from a Gaussian

PLSA ndash PROBABILISTIC LATENT SEMANTIC ANALYSIS

Latent Semantic Indexing (LSI) Review

bull For natural Language Queries simple term matching does not work effectivelyndash Ambiguous terms ndash Same Queries vary due to personal styles

bull Latent semantic indexingndash Creates this lsquolatent semantic spacersquo (hidden meaning)

bull LSI puts documents together even if they donrsquot have common words if the docs share frequently co-occurring terms

bull Disadvantagesndash Statistical foundation is missing

pLSA ndash Probabilistic Latent Semantic Analysis

bull Automated Document Indexing and Information retrievalbull Identification of Latent Classes using an Expectation

Maximization (EM) Algorithmbull Shown to solve

ndash Polysemybull Java could mean ldquocoffeerdquo and also the ldquoPL Javardquobull Cricket is a ldquogamerdquo and also an ldquoinsectrdquo

ndash Synonymybull ldquocomputerrdquo ldquopcrdquo ldquodesktoprdquo all could mean the same

bull Has a better statistical foundation than LSA

pLSA

M

Nd

d

z

w

M

d

z1

w1

z2

w2

z3

w3

zN

wN

hellip

z1 hellip zN are variables ziє[1K]K is the number of latent topics

pLSA

d1

z1

w1

z2

w2

zN1

wN1

hellip

d2

z1

w1

z2

w2

zN2

wN2

hellip

dM

z1

w1

z2

w2

zNm

wNm

hellip

p(w|z=1) p(w|z=2) p(w | z=Nm) are shared for all documents

Likelihood

Joint Probability vs Likelihood

bull Joint probability

bull Likelihood (only for observed variables)

bull p(d) is assumed to be uniform

Document Decomposition

bull Each document can be decomposed as

bull This is similar to the matrix decomposition if we consider each discrete distribution as a vector

p(w|d) = ZVtimesk p(z|d)

bull With many documents we hope to find latent topics as common basis

pLSA ndash Objective Function

bull pLSA tries to maximize the log likelihood

bull Due to the summation over z inside log we have to resort to EM

EM Steps

bull E-Stepndash Expectation of the likelihood function is calculated with the current

parameter valuesbull M-Step

ndash Update the parameters with the calculated posterior probabilitiesndash Find the parameters that maximizes the likelihood function

Lower Bounding the Log Likelihood

EM Steps

bull The E-Step

bull The M-Step

Latent Subspace

pLSA vs LSAbull LSA and PLSA perform dimensionality reduction

ndash In LSA by keeping only K singular valuesndash In PLSA by having K aspects

bull Comparison to SVDndash U Matrix related to P(z|d) (doc to aspect)ndash V Matrix related to P(w|z) (aspect to term)ndash E Matrix related to P(z) (aspect strength)

pLSA vs LSAbull The main difference is the way the approximation is done

bull PLSA generates a model (aspect model) and maximizes its predictive power

bull Selecting the proper value of K is heuristic in LSA

bull Model selection in statistics can determine optimal K in PLSA

Applications

bull Text mining topic discovering

bull Scene Classification

Text Mining

Scene Classification

Classification Result

Reference

bull Thomas Hofmann Probabilistic latent semantic analysis In Proc of Uncertainty in Artificial Intelligence UAI99 Stockholm 1999

bull Bosch A Zisserman A and Munoz X Scene Classification via pLSAProceedings of the European Conference on Computer Vision (2006)

bull Sivic J Russell B C Efros A A Zisserman A and Freeman W T Discovering Object Categories in Image Collections MIT AI Lab Memo AIM-2005-005 February 2005

LDA ndash LATENT DIRICHILET ALLOCATION

Problems in pLSA

bull pLSA provides no probabilistic model at the document level Each doc has its own topic mixture proportion

bull The number of parameters in the model grows linearly with M (the number of documents in the training set)

Problems in pLSA

bull There is no constraint for distributions p(z|di)

bull Easy to lead to serious problems with over-fitting

d1

z1

w1

z2

w2

zN1

wN1

hellip

d2

z1

w1

z2

w2

zN2

wN2

hellip

dm

z1

w1

z2

w2

zNm

wNm

hellip

p(z|d1) p(z|d2) p(z|dm)

Dirichlet Distribution

bull In the LDA model the topic mixture proportions for each document are assumed to follow some distribution

bull Requirement for such a distributionndash The samples (mixture proportions) generated from it are K-tuples of non-

negative numbers that sum to one That is the samples are multinormialsndash Easy to optimize

bull Dirichlet Distribution is one of such distributions

bull The space of all of these multinomials has a nice geometric interpretation as a (k-1)-simplex

Dirichlet Distribution

bull Definition

bull The density is zero outside this open (K minus 1)-dimensional simplex

k

i ii

ki ik

i i

ki i

kk

xx

xxxxp i

1

11

1

12121

1 0 st

)Γ()(Γ)|(

bull Various parameter α

(6 2 2) (3 7 5)

(2 3 4) (6 2 6)

Example Dirichlet Distributions (K=3)

Example Dirichlet Distributions (K=3)

bull Equal αi different

α0=01 α0=1 α0=10

k

i i10

The LDA Model

z4z3z2z1

w4w3w2w1

b

z4z3z2z1

w4w3w2w1

z4z3z2z1

w4w3w2w1

bull For each documentbull Choose ~Dirichlet()bull For each of the N words wn

ndash Choose a topic znraquo Multinomial()

ndash Choose a word wn from p(wn|znb) a multinomial probability conditioned on the topic zn

The LDA Model

bull For each documentbull Choose ~Dirichlet()bull For each of the N words wn

ndash Choose a topic znraquo Multinomial()

ndash Choose a word wn from p(wn|znb) a multinomial probability conditioned on the topic zn

Joint Probability

bull Given parameter α and β

where

Likelihood

bull Joint Probability

bull Marginal distribution of a document

bull Likelihood over all the documents

Inference

bull The likelihood can be computed by summing each documentbull Jansenrsquos inequality in EM

Inference

bull In E-Step we need to compute the posterior distribution of the hidden variables

bull Unfortunately this distribution is intractable to compute in general

bull We have to resort to variational approach

Variational Inference

bull In variational inference we consider a simplified graphical model with variational parameters and minimize the KL Divergence between the variational and posterior distributions

Variantional Inference

bull The difference between the lower bound and the likelihood is the KL divergence

bull Maximizing the lower bound L(b) with respect to and is equivalent to minimizing the KL divergence

VBEM vs EM

bull Only different in the E-Step

bull In standard EM q(X) is directly set to p(X|Dθ) and let KL=0bull In VBEM it is intractable to compute p(X|Dθ) Instead it

approximates p(X|Dθ) by a variational distribution q(X) by minimizing KL(q(X) | P(X|D θ)

bull This is also equivalent to maximizing the lower bound L(θ)

Parameter Estimation

bull Given a corpus of documents we would like to find the parameters and b which maximize the likelihood of the observed data

bull Strategy (Variational EM)

bull Lower bound log p(w|b) by a function L(b)bull Repeat until convergence

ndash E Maximize L(b) with respect to the variational parameters ndash M Maximize the bound with respect to parameters and b

Parameter Estimation

bull E-Step Variational Inference ndash repeat until convergence

bull M-Step Parameter estimation

β

α can be implemented using the Newton-Raphson method

Topic Examples in a 100-topic LDA Model)bull 16000 documents from a subset of the TREC AP corpus

Classification (50-topic LDA + SVM)

bull Reuters-21578 dataset ndash contains 8000 documents and 15818 words

(a) EARN vs NOT EARN (b) GRAIN vs NOT GRAIN

Problems in LDA

bull Dirichlet Distribution is helpful to avoid over-fitting But the assumption might be too strong

z4z3z2z1

w4w3w2w1

b

z4z3z2z1

w4w3w2w1

z4z3z2z1

w4w3w2w1

A Bayesian Hierarchical Model for Learning Natural Scene Categories

bull Incorporating category information

MNd

π

z

x

θ

β

Codebookbull 174 Local Image Patches

bull DetectionEvenly Sampled GridRandom SamplingSaliency DetectorLowersquos DoG Detector

bull RepresentationNormalized 11x11 gray values128-dim SIFT

Topic Distribution in Different Categories

Topic Hierarchical Clustering

More Topic Models

bull Dynamic topic models ICML 2006bull Correlated Topic Model NIPS 2005bull Hierarchical Dirichlet Process Journal of the American

Statistical Association 2003bull Nonparametric Bayes pachinko allocation UAI 2007bull Supervised LDA NIPS 2007bull MedLDA ndash Maximum Margin Discrimant LDA ICML 2009bull hellip

Are you really into Graphical Models

bull Describing Visual Scenes using Transformed Dirichlet Processes E Sudderth A Torralba W Freeman and A Willsky NIPS Dec 2005

Reference

bull David M Blei Andrew Y Ng Michael I Jordan Latent Dirichlet Allocation Journal of Machine Learning Research (JMLR) 2003

bull Matthew J Beal Variational Algorithms for Approximate Bayesian Inference PhD Thesis University of Cambridge 1998

bull L Fei-Fei and P Perona A Bayesian Hierarchical Model for Learning Natural Scene Categories CVPR 2005

Outline

bull Matrix Decompositionndash PCA SVD NMFndash LDA ICA Sparse Coding etc

bull Graphical Modelndash Basic concepts in probabilistic machine learningndash EMndash pLSAndash LDA

bull Two Applicationsndash Document decomposition for ldquolong queryrdquo retrieval ICCV 2009ndash Modeling Threaded Discussions SIGIR 2009

Large-Scale Indexing for ldquoLong Queryrdquo Retrieval (Similarity Search)

Xiao Zhang Zhiwei Li Lei Zhang Wei-Ying Ma and Heung-Yeung Shum

ICCV 2009

The Long Query Problem

bull If a query contains 1000 keywordsndash Need to access 1000 inverted listsndash The intersection of 1000 inverted lists may be emptyndash The union of 1000 inverted list may be the whole corpus

bull Dimension reductionTerm1 Term2 Term3 Term4 hellip TermN

Img1 1 2 0 0 hellip 2

f1 f2 hellip fM

Img1 02 01 hellip 003

Topic Projection

Dim = 1 million

Dim = 200

Key Idea Dimension Reduction + Residual Error Preservation

bull p original TF-IDF vector in vocabulary spacebull X projection matrix for dimension reductionbull w low dimensional feature vectorbull residual error

p Xw

An image = 0 01 02 03 04 05 06 07 08 09 1

0

2

4

6

8

10

12

14

16

+ a few words(10 words)

Orthogonal Decomposition

p Xw 1 11 1 1

2 21 2 1 2

1 1

1

k

k

W k W

W W Wk W

p x xp x x w

p wp x x

Base vector

Low dimensional representation

Residual

An image = 0 01 02 03 04 05 06 07 08 09 1

0

2

4

6

8

10

12

14

16

+ a few words(10 words)

T Tp q p q

p q

w w

X1 X2 X3 hellip Xk

A Probabilistic Implementation

x is a switch variable It controls a word generated from

bull a topic specific distribution

bull a document specific distribution

bull a background distribution

( | )p w d 1( 0 | ) ( | ) ( | )K

kp x d p w z k p z k d

( 1| ) ( | )p x d p w d ( 2 | ) ( | )p x d p w

C Chemudugunta etal Modeling General and Specific Aspects of Documents with a Probabilistic Topic Model NIPS 2006

Search (Online)

DS1 DS2 hellip

DS1 DS2 hellip

DS1 DS2 hellip

DS1 DS2 hellip

DS1 DS2 hellip

hellip

hellip

hellip

hellip

hellip

LSH Index

Doc 300 Doc 401 hellip

A query = 0 01 02 03 04 05 06 07 08 09 1

0

2

4

6

8

10

12

14

16

+ a few words

Re-rankingDoc 401 hellip

Doc 1

Doc 2

Doc 300

Doc 401

Doc N

Doc 300

Index 10M Images 46GBSearch Speed lt 100ms

Doc Meta

Search Example

Query Image

Search Example

Query Image

Simultaneously Modeling Semantics and Structure of Threaded Discussions A Sparse

Coding Approach and Its Applications

Chen LIN Jiang-Ming YANG Rui CAI Xin-jing WANG Wei WANG Lei ZHANG

SIGIR 2009

123

Semantic amp structure

124

SemanticTopics

StructureWho reply to who

Optimize them together

Model semantic

Model structure

125

Reply reconstruction

126

DocumentSimilarity

TopicSimilarity

StructureSimilarity

BaselinesNP

Reply to Nearest PostRR

Reply to RootDS

Document SimilarityLDA

Latent Dirichlet Allocation Project documents to topic space

SWB Special Words Topic Model with Background distribution Project documents to topic and junk topic space

127

Evaluation

method Slashdot Apple

All Posts Good Posts All Posts Good Posts

NP 0021 0012 0289 0239

RR 0183 0319 0269 0474

DS 0463 0643 0409 0628

LDA 0465 0644 0410 0648

SWB 0463 0644 0410 0641

SMSS 0524 0737 0517 0772

128

Expert finding

Reply reconstruction

Network construction

Expert finding

Methods

HITS

PageRank

hellip

129

Baselines

LMFormal Models for Expert Finding in Enterprise Corpora SIGIR

06Achieves stable performance in expert finding task using a

language modelPageRank

Benchmark nodal ranking methodHITS

Find hub nodes and authority nodeEABIF

Personalized Recommendation Driven by Information Flow SIGIR rsquo06

Find most influential node130

Evaluation

131

bull Bayesian estimate

Method MRR MAP P10

LM 0821 0698 0800

EABIF(ori) 0674 0362 0243

EABIF(rec) 0742 0318 0281

PageRank(ori) 0675 0377 0263

PageRank(rec) 0743 0321 0266

HITS(ori) 0906 0832 0900

HITS(rec) 0938 0822 0906

Summary

bull Matrix and probability are fundamental mathematics in information retrieval and computer visionndash Matrix decomposition ndash a good practice to learn matrixndash Graphical model ndash a good practice to learn probability

bull Graphical model is a good tool to analyze problems

bull The essence of decomposition is to discover a set of mid-level features to describe original documentsimages

bull It is more adaptable for various applications than matrix decomposition

  • An Introduction To Matrix Decomposition and Graphical Model
  • Outline
  • What Is Matrix Decomposition
  • Why We Need Matrix Decomposition
  • Why We Need Matrix Decomposition (2)
  • Principle Component Analysis
  • Definition ndash Eigenvalue amp Eigenvector
  • Definition ndash Principle Component Analysis
  • Principle Component Analysis (2)
  • Maximizing Variance
  • Optimization Problem
  • Property Data Decomposition
  • Face Recognition ndash Eigenface
  • Slide 14
  • Slide 15
  • PageRank ndash Power Iteration
  • Column-Stochastic amp Irreducible
  • Iterative PageRank Calculation
  • Convergence of the power iteration
  • Singular value decomposition
  • SVD - Definition
  • Singular Values And Singular Vectors
  • Matrix approximation
  • SVD and PCA
  • Example - LSI
  • SVD and PCA (2)
  • Latent Semantic Indexing (LSI)
  • Latent Semantic Indexing
  • Similarity Measures
  • Similarity Measures (2)
  • HITS (Hyperlink Induced Topic Search)
  • Power Iteration
  • HITS and SVD
  • HITS vs PageRank
  • NMF ndash Non-Negative Matrix Factorization
  • Definition
  • Motivation
  • Multiplicative Update Algorithm
  • Multiplicative Update Algorithm (2)
  • NMF vs PCA
  • Reference
  • Major Reference
  • Outline (2)
  • Not Included
  • Basic Concepts
  • What Is Machine Learning
  • Likelihood Function
  • Maximum Likelihood (ML)
  • IID ndash Independent Identically Distributed
  • Reference (2)
  • Expectation Maximization
  • Why We Need EM
  • More General
  • The Expectation Maximization (EM) Algorithm
  • Jensenrsquos Inequality
  • Lower Bounding the Log Likelihood
  • The E and M Steps of EM
  • The E Step
  • The M Step
  • EM Never Decreases the Likelihood
  • Reference (3)
  • Why Do We Need Graphical Model
  • Why Do We Need Graphical Models
  • Directed Acyclic Graphical Models (Bayesian Networks)
  • Directed Graphs for Statistical Models Plate Notation
  • pLSA ndash Probabilistic Latent Semantic Analysis
  • Latent Semantic Indexing (LSI) Review
  • pLSA ndash Probabilistic Latent Semantic Analysis (2)
  • pLSA
  • pLSA (2)
  • Joint Probability vs Likelihood
  • Document Decomposition
  • pLSA ndash Objective Function
  • EM Steps
  • Lower Bounding the Log Likelihood (2)
  • EM Steps (2)
  • Latent Subspace
  • pLSA vs LSA
  • pLSA vs LSA (2)
  • Applications
  • Text Mining
  • Scene Classification
  • Classification Result
  • Reference (4)
  • LDA ndash Latent Dirichilet Allocation
  • Problems in pLSA
  • Problems in pLSA (2)
  • Dirichlet Distribution
  • Dirichlet Distribution (2)
  • Example Dirichlet Distributions (K=3)
  • Example Dirichlet Distributions (K=3) (2)
  • The LDA Model
  • The LDA Model (2)
  • Joint Probability
  • Likelihood
  • Inference
  • Inference (2)
  • Variational Inference
  • Variantional Inference
  • VBEM vs EM
  • Parameter Estimation
  • Parameter Estimation (2)
  • Topic Examples in a 100-topic LDA Model)
  • Classification (50-topic LDA + SVM)
  • Problems in LDA
  • A Bayesian Hierarchical Model for Learning Natural Scene Catego
  • Codebook
  • Topic Distribution in Different Categories
  • Topic Hierarchical Clustering
  • More Topic Models
  • Are you really into Graphical Models
  • Reference (5)
  • Outline (3)
  • Slide 115
  • The Long Query Problem
  • Key Idea Dimension Reduction + Residual Error Preservation
  • Orthogonal Decomposition
  • A Probabilistic Implementation
  • Search (Online)
  • Search Example
  • Search Example (2)
  • Simultaneously Modeling Semantics and Structure of Threaded Dis
  • Semantic amp structure
  • Optimize them together
  • Reply reconstruction
  • Baselines
  • Evaluation
  • Expert finding
  • Baselines (2)
  • Evaluation (2)
  • Summary
Page 62: An Introduction To Matrix Decomposition and Graphical Model

Directed Acyclic Graphical Models (Bayesian Networks)

bull A DAG Model Bayesian network corresponds to a factorization of the joint probability distribution

p(ABCDE) = p(A)p(B)p(C|AB)p(D|BC)p(E|CD)

bull In general

bull where pa(i) are the parents of node i

Directed Graphs for Statistical ModelsPlate Notation

bull A data set of N points generated from a Gaussian

PLSA ndash PROBABILISTIC LATENT SEMANTIC ANALYSIS

Latent Semantic Indexing (LSI) Review

bull For natural Language Queries simple term matching does not work effectivelyndash Ambiguous terms ndash Same Queries vary due to personal styles

bull Latent semantic indexingndash Creates this lsquolatent semantic spacersquo (hidden meaning)

bull LSI puts documents together even if they donrsquot have common words if the docs share frequently co-occurring terms

bull Disadvantagesndash Statistical foundation is missing

pLSA ndash Probabilistic Latent Semantic Analysis

bull Automated Document Indexing and Information retrievalbull Identification of Latent Classes using an Expectation

Maximization (EM) Algorithmbull Shown to solve

ndash Polysemybull Java could mean ldquocoffeerdquo and also the ldquoPL Javardquobull Cricket is a ldquogamerdquo and also an ldquoinsectrdquo

ndash Synonymybull ldquocomputerrdquo ldquopcrdquo ldquodesktoprdquo all could mean the same

bull Has a better statistical foundation than LSA

pLSA

M

Nd

d

z

w

M

d

z1

w1

z2

w2

z3

w3

zN

wN

hellip

z1 hellip zN are variables ziє[1K]K is the number of latent topics

pLSA

d1

z1

w1

z2

w2

zN1

wN1

hellip

d2

z1

w1

z2

w2

zN2

wN2

hellip

dM

z1

w1

z2

w2

zNm

wNm

hellip

p(w|z=1) p(w|z=2) p(w | z=Nm) are shared for all documents

Likelihood

Joint Probability vs Likelihood

bull Joint probability

bull Likelihood (only for observed variables)

bull p(d) is assumed to be uniform

Document Decomposition

bull Each document can be decomposed as

bull This is similar to the matrix decomposition if we consider each discrete distribution as a vector

p(w|d) = ZVtimesk p(z|d)

bull With many documents we hope to find latent topics as common basis

pLSA ndash Objective Function

bull pLSA tries to maximize the log likelihood

bull Due to the summation over z inside log we have to resort to EM

EM Steps

bull E-Stepndash Expectation of the likelihood function is calculated with the current

parameter valuesbull M-Step

ndash Update the parameters with the calculated posterior probabilitiesndash Find the parameters that maximizes the likelihood function

Lower Bounding the Log Likelihood

EM Steps

bull The E-Step

bull The M-Step

Latent Subspace

pLSA vs LSAbull LSA and PLSA perform dimensionality reduction

ndash In LSA by keeping only K singular valuesndash In PLSA by having K aspects

bull Comparison to SVDndash U Matrix related to P(z|d) (doc to aspect)ndash V Matrix related to P(w|z) (aspect to term)ndash E Matrix related to P(z) (aspect strength)

pLSA vs LSAbull The main difference is the way the approximation is done

bull PLSA generates a model (aspect model) and maximizes its predictive power

bull Selecting the proper value of K is heuristic in LSA

bull Model selection in statistics can determine optimal K in PLSA

Applications

bull Text mining topic discovering

bull Scene Classification

Text Mining

Scene Classification

Classification Result

Reference

bull Thomas Hofmann Probabilistic latent semantic analysis In Proc of Uncertainty in Artificial Intelligence UAI99 Stockholm 1999

bull Bosch A Zisserman A and Munoz X Scene Classification via pLSAProceedings of the European Conference on Computer Vision (2006)

bull Sivic J Russell B C Efros A A Zisserman A and Freeman W T Discovering Object Categories in Image Collections MIT AI Lab Memo AIM-2005-005 February 2005

LDA ndash LATENT DIRICHILET ALLOCATION

Problems in pLSA

bull pLSA provides no probabilistic model at the document level Each doc has its own topic mixture proportion

bull The number of parameters in the model grows linearly with M (the number of documents in the training set)

Problems in pLSA

bull There is no constraint for distributions p(z|di)

bull Easy to lead to serious problems with over-fitting

d1

z1

w1

z2

w2

zN1

wN1

hellip

d2

z1

w1

z2

w2

zN2

wN2

hellip

dm

z1

w1

z2

w2

zNm

wNm

hellip

p(z|d1) p(z|d2) p(z|dm)

Dirichlet Distribution

bull In the LDA model the topic mixture proportions for each document are assumed to follow some distribution

bull Requirement for such a distributionndash The samples (mixture proportions) generated from it are K-tuples of non-

negative numbers that sum to one That is the samples are multinormialsndash Easy to optimize

bull Dirichlet Distribution is one of such distributions

bull The space of all of these multinomials has a nice geometric interpretation as a (k-1)-simplex

Dirichlet Distribution

bull Definition

bull The density is zero outside this open (K minus 1)-dimensional simplex

k

i ii

ki ik

i i

ki i

kk

xx

xxxxp i

1

11

1

12121

1 0 st

)Γ()(Γ)|(

bull Various parameter α

(6 2 2) (3 7 5)

(2 3 4) (6 2 6)

Example Dirichlet Distributions (K=3)

Example Dirichlet Distributions (K=3)

bull Equal αi different

α0=01 α0=1 α0=10

k

i i10

The LDA Model

z4z3z2z1

w4w3w2w1

b

z4z3z2z1

w4w3w2w1

z4z3z2z1

w4w3w2w1

bull For each documentbull Choose ~Dirichlet()bull For each of the N words wn

ndash Choose a topic znraquo Multinomial()

ndash Choose a word wn from p(wn|znb) a multinomial probability conditioned on the topic zn

The LDA Model

bull For each documentbull Choose ~Dirichlet()bull For each of the N words wn

ndash Choose a topic znraquo Multinomial()

ndash Choose a word wn from p(wn|znb) a multinomial probability conditioned on the topic zn

Joint Probability

bull Given parameter α and β

where

Likelihood

bull Joint Probability

bull Marginal distribution of a document

bull Likelihood over all the documents

Inference

bull The likelihood can be computed by summing each documentbull Jansenrsquos inequality in EM

Inference

bull In E-Step we need to compute the posterior distribution of the hidden variables

bull Unfortunately this distribution is intractable to compute in general

bull We have to resort to variational approach

Variational Inference

bull In variational inference we consider a simplified graphical model with variational parameters and minimize the KL Divergence between the variational and posterior distributions

Variantional Inference

bull The difference between the lower bound and the likelihood is the KL divergence

bull Maximizing the lower bound L(b) with respect to and is equivalent to minimizing the KL divergence

VBEM vs EM

bull Only different in the E-Step

bull In standard EM q(X) is directly set to p(X|Dθ) and let KL=0bull In VBEM it is intractable to compute p(X|Dθ) Instead it

approximates p(X|Dθ) by a variational distribution q(X) by minimizing KL(q(X) | P(X|D θ)

bull This is also equivalent to maximizing the lower bound L(θ)

Parameter Estimation

bull Given a corpus of documents we would like to find the parameters and b which maximize the likelihood of the observed data

bull Strategy (Variational EM)

bull Lower bound log p(w|b) by a function L(b)bull Repeat until convergence

ndash E Maximize L(b) with respect to the variational parameters ndash M Maximize the bound with respect to parameters and b

Parameter Estimation

bull E-Step Variational Inference ndash repeat until convergence

bull M-Step Parameter estimation

β

α can be implemented using the Newton-Raphson method

Topic Examples in a 100-topic LDA Model)bull 16000 documents from a subset of the TREC AP corpus

Classification (50-topic LDA + SVM)

bull Reuters-21578 dataset ndash contains 8000 documents and 15818 words

(a) EARN vs NOT EARN (b) GRAIN vs NOT GRAIN

Problems in LDA

bull Dirichlet Distribution is helpful to avoid over-fitting But the assumption might be too strong

z4z3z2z1

w4w3w2w1

b

z4z3z2z1

w4w3w2w1

z4z3z2z1

w4w3w2w1

A Bayesian Hierarchical Model for Learning Natural Scene Categories

bull Incorporating category information

MNd

π

z

x

θ

β

Codebookbull 174 Local Image Patches

bull DetectionEvenly Sampled GridRandom SamplingSaliency DetectorLowersquos DoG Detector

bull RepresentationNormalized 11x11 gray values128-dim SIFT

Topic Distribution in Different Categories

Topic Hierarchical Clustering

More Topic Models

bull Dynamic topic models ICML 2006bull Correlated Topic Model NIPS 2005bull Hierarchical Dirichlet Process Journal of the American

Statistical Association 2003bull Nonparametric Bayes pachinko allocation UAI 2007bull Supervised LDA NIPS 2007bull MedLDA ndash Maximum Margin Discrimant LDA ICML 2009bull hellip

Are you really into Graphical Models

bull Describing Visual Scenes using Transformed Dirichlet Processes E Sudderth A Torralba W Freeman and A Willsky NIPS Dec 2005

Reference

bull David M Blei Andrew Y Ng Michael I Jordan Latent Dirichlet Allocation Journal of Machine Learning Research (JMLR) 2003

bull Matthew J Beal Variational Algorithms for Approximate Bayesian Inference PhD Thesis University of Cambridge 1998

bull L Fei-Fei and P Perona A Bayesian Hierarchical Model for Learning Natural Scene Categories CVPR 2005

Outline

bull Matrix Decompositionndash PCA SVD NMFndash LDA ICA Sparse Coding etc

bull Graphical Modelndash Basic concepts in probabilistic machine learningndash EMndash pLSAndash LDA

bull Two Applicationsndash Document decomposition for ldquolong queryrdquo retrieval ICCV 2009ndash Modeling Threaded Discussions SIGIR 2009

Large-Scale Indexing for ldquoLong Queryrdquo Retrieval (Similarity Search)

Xiao Zhang Zhiwei Li Lei Zhang Wei-Ying Ma and Heung-Yeung Shum

ICCV 2009

The Long Query Problem

bull If a query contains 1000 keywordsndash Need to access 1000 inverted listsndash The intersection of 1000 inverted lists may be emptyndash The union of 1000 inverted list may be the whole corpus

bull Dimension reductionTerm1 Term2 Term3 Term4 hellip TermN

Img1 1 2 0 0 hellip 2

f1 f2 hellip fM

Img1 02 01 hellip 003

Topic Projection

Dim = 1 million

Dim = 200

Key Idea Dimension Reduction + Residual Error Preservation

bull p original TF-IDF vector in vocabulary spacebull X projection matrix for dimension reductionbull w low dimensional feature vectorbull residual error

p Xw

An image = 0 01 02 03 04 05 06 07 08 09 1

0

2

4

6

8

10

12

14

16

+ a few words(10 words)

Orthogonal Decomposition

p Xw 1 11 1 1

2 21 2 1 2

1 1

1

k

k

W k W

W W Wk W

p x xp x x w

p wp x x

Base vector

Low dimensional representation

Residual

An image = 0 01 02 03 04 05 06 07 08 09 1

0

2

4

6

8

10

12

14

16

+ a few words(10 words)

T Tp q p q

p q

w w

X1 X2 X3 hellip Xk

A Probabilistic Implementation

x is a switch variable It controls a word generated from

bull a topic specific distribution

bull a document specific distribution

bull a background distribution

( | )p w d 1( 0 | ) ( | ) ( | )K

kp x d p w z k p z k d

( 1| ) ( | )p x d p w d ( 2 | ) ( | )p x d p w

C Chemudugunta etal Modeling General and Specific Aspects of Documents with a Probabilistic Topic Model NIPS 2006

Search (Online)

DS1 DS2 hellip

DS1 DS2 hellip

DS1 DS2 hellip

DS1 DS2 hellip

DS1 DS2 hellip

hellip

hellip

hellip

hellip

hellip

LSH Index

Doc 300 Doc 401 hellip

A query = 0 01 02 03 04 05 06 07 08 09 1

0

2

4

6

8

10

12

14

16

+ a few words

Re-rankingDoc 401 hellip

Doc 1

Doc 2

Doc 300

Doc 401

Doc N

Doc 300

Index 10M Images 46GBSearch Speed lt 100ms

Doc Meta

Search Example

Query Image

Search Example

Query Image

Simultaneously Modeling Semantics and Structure of Threaded Discussions A Sparse

Coding Approach and Its Applications

Chen LIN Jiang-Ming YANG Rui CAI Xin-jing WANG Wei WANG Lei ZHANG

SIGIR 2009

123

Semantic amp structure

124

SemanticTopics

StructureWho reply to who

Optimize them together

Model semantic

Model structure

125

Reply reconstruction

126

DocumentSimilarity

TopicSimilarity

StructureSimilarity

BaselinesNP

Reply to Nearest PostRR

Reply to RootDS

Document SimilarityLDA

Latent Dirichlet Allocation Project documents to topic space

SWB Special Words Topic Model with Background distribution Project documents to topic and junk topic space

127

Evaluation

method Slashdot Apple

All Posts Good Posts All Posts Good Posts

NP 0021 0012 0289 0239

RR 0183 0319 0269 0474

DS 0463 0643 0409 0628

LDA 0465 0644 0410 0648

SWB 0463 0644 0410 0641

SMSS 0524 0737 0517 0772

128

Expert finding

Reply reconstruction

Network construction

Expert finding

Methods

HITS

PageRank

hellip

129

Baselines

LMFormal Models for Expert Finding in Enterprise Corpora SIGIR

06Achieves stable performance in expert finding task using a

language modelPageRank

Benchmark nodal ranking methodHITS

Find hub nodes and authority nodeEABIF

Personalized Recommendation Driven by Information Flow SIGIR rsquo06

Find most influential node130

Evaluation

131

bull Bayesian estimate

Method MRR MAP P10

LM 0821 0698 0800

EABIF(ori) 0674 0362 0243

EABIF(rec) 0742 0318 0281

PageRank(ori) 0675 0377 0263

PageRank(rec) 0743 0321 0266

HITS(ori) 0906 0832 0900

HITS(rec) 0938 0822 0906

Summary

bull Matrix and probability are fundamental mathematics in information retrieval and computer visionndash Matrix decomposition ndash a good practice to learn matrixndash Graphical model ndash a good practice to learn probability

bull Graphical model is a good tool to analyze problems

bull The essence of decomposition is to discover a set of mid-level features to describe original documentsimages

bull It is more adaptable for various applications than matrix decomposition

  • An Introduction To Matrix Decomposition and Graphical Model
  • Outline
  • What Is Matrix Decomposition
  • Why We Need Matrix Decomposition
  • Why We Need Matrix Decomposition (2)
  • Principle Component Analysis
  • Definition ndash Eigenvalue amp Eigenvector
  • Definition ndash Principle Component Analysis
  • Principle Component Analysis (2)
  • Maximizing Variance
  • Optimization Problem
  • Property Data Decomposition
  • Face Recognition ndash Eigenface
  • Slide 14
  • Slide 15
  • PageRank ndash Power Iteration
  • Column-Stochastic amp Irreducible
  • Iterative PageRank Calculation
  • Convergence of the power iteration
  • Singular value decomposition
  • SVD - Definition
  • Singular Values And Singular Vectors
  • Matrix approximation
  • SVD and PCA
  • Example - LSI
  • SVD and PCA (2)
  • Latent Semantic Indexing (LSI)
  • Latent Semantic Indexing
  • Similarity Measures
  • Similarity Measures (2)
  • HITS (Hyperlink Induced Topic Search)
  • Power Iteration
  • HITS and SVD
  • HITS vs PageRank
  • NMF ndash Non-Negative Matrix Factorization
  • Definition
  • Motivation
  • Multiplicative Update Algorithm
  • Multiplicative Update Algorithm (2)
  • NMF vs PCA
  • Reference
  • Major Reference
  • Outline (2)
  • Not Included
  • Basic Concepts
  • What Is Machine Learning
  • Likelihood Function
  • Maximum Likelihood (ML)
  • IID ndash Independent Identically Distributed
  • Reference (2)
  • Expectation Maximization
  • Why We Need EM
  • More General
  • The Expectation Maximization (EM) Algorithm
  • Jensenrsquos Inequality
  • Lower Bounding the Log Likelihood
  • The E and M Steps of EM
  • The E Step
  • The M Step
  • EM Never Decreases the Likelihood
  • Reference (3)
  • Why Do We Need Graphical Model
  • Why Do We Need Graphical Models
  • Directed Acyclic Graphical Models (Bayesian Networks)
  • Directed Graphs for Statistical Models Plate Notation
  • pLSA ndash Probabilistic Latent Semantic Analysis
  • Latent Semantic Indexing (LSI) Review
  • pLSA ndash Probabilistic Latent Semantic Analysis (2)
  • pLSA
  • pLSA (2)
  • Joint Probability vs Likelihood
  • Document Decomposition
  • pLSA ndash Objective Function
  • EM Steps
  • Lower Bounding the Log Likelihood (2)
  • EM Steps (2)
  • Latent Subspace
  • pLSA vs LSA
  • pLSA vs LSA (2)
  • Applications
  • Text Mining
  • Scene Classification
  • Classification Result
  • Reference (4)
  • LDA ndash Latent Dirichilet Allocation
  • Problems in pLSA
  • Problems in pLSA (2)
  • Dirichlet Distribution
  • Dirichlet Distribution (2)
  • Example Dirichlet Distributions (K=3)
  • Example Dirichlet Distributions (K=3) (2)
  • The LDA Model
  • The LDA Model (2)
  • Joint Probability
  • Likelihood
  • Inference
  • Inference (2)
  • Variational Inference
  • Variantional Inference
  • VBEM vs EM
  • Parameter Estimation
  • Parameter Estimation (2)
  • Topic Examples in a 100-topic LDA Model)
  • Classification (50-topic LDA + SVM)
  • Problems in LDA
  • A Bayesian Hierarchical Model for Learning Natural Scene Catego
  • Codebook
  • Topic Distribution in Different Categories
  • Topic Hierarchical Clustering
  • More Topic Models
  • Are you really into Graphical Models
  • Reference (5)
  • Outline (3)
  • Slide 115
  • The Long Query Problem
  • Key Idea Dimension Reduction + Residual Error Preservation
  • Orthogonal Decomposition
  • A Probabilistic Implementation
  • Search (Online)
  • Search Example
  • Search Example (2)
  • Simultaneously Modeling Semantics and Structure of Threaded Dis
  • Semantic amp structure
  • Optimize them together
  • Reply reconstruction
  • Baselines
  • Evaluation
  • Expert finding
  • Baselines (2)
  • Evaluation (2)
  • Summary
Page 63: An Introduction To Matrix Decomposition and Graphical Model

Directed Graphs for Statistical ModelsPlate Notation

bull A data set of N points generated from a Gaussian

PLSA ndash PROBABILISTIC LATENT SEMANTIC ANALYSIS

Latent Semantic Indexing (LSI) Review

bull For natural Language Queries simple term matching does not work effectivelyndash Ambiguous terms ndash Same Queries vary due to personal styles

bull Latent semantic indexingndash Creates this lsquolatent semantic spacersquo (hidden meaning)

bull LSI puts documents together even if they donrsquot have common words if the docs share frequently co-occurring terms

bull Disadvantagesndash Statistical foundation is missing

pLSA ndash Probabilistic Latent Semantic Analysis

bull Automated Document Indexing and Information retrievalbull Identification of Latent Classes using an Expectation

Maximization (EM) Algorithmbull Shown to solve

ndash Polysemybull Java could mean ldquocoffeerdquo and also the ldquoPL Javardquobull Cricket is a ldquogamerdquo and also an ldquoinsectrdquo

ndash Synonymybull ldquocomputerrdquo ldquopcrdquo ldquodesktoprdquo all could mean the same

bull Has a better statistical foundation than LSA

pLSA

M

Nd

d

z

w

M

d

z1

w1

z2

w2

z3

w3

zN

wN

hellip

z1 hellip zN are variables ziє[1K]K is the number of latent topics

pLSA

d1

z1

w1

z2

w2

zN1

wN1

hellip

d2

z1

w1

z2

w2

zN2

wN2

hellip

dM

z1

w1

z2

w2

zNm

wNm

hellip

p(w|z=1) p(w|z=2) p(w | z=Nm) are shared for all documents

Likelihood

Joint Probability vs Likelihood

bull Joint probability

bull Likelihood (only for observed variables)

bull p(d) is assumed to be uniform

Document Decomposition

bull Each document can be decomposed as

bull This is similar to the matrix decomposition if we consider each discrete distribution as a vector

p(w|d) = ZVtimesk p(z|d)

bull With many documents we hope to find latent topics as common basis

pLSA ndash Objective Function

bull pLSA tries to maximize the log likelihood

bull Due to the summation over z inside log we have to resort to EM

EM Steps

bull E-Stepndash Expectation of the likelihood function is calculated with the current

parameter valuesbull M-Step

ndash Update the parameters with the calculated posterior probabilitiesndash Find the parameters that maximizes the likelihood function

Lower Bounding the Log Likelihood

EM Steps

bull The E-Step

bull The M-Step

Latent Subspace

pLSA vs LSAbull LSA and PLSA perform dimensionality reduction

ndash In LSA by keeping only K singular valuesndash In PLSA by having K aspects

bull Comparison to SVDndash U Matrix related to P(z|d) (doc to aspect)ndash V Matrix related to P(w|z) (aspect to term)ndash E Matrix related to P(z) (aspect strength)

pLSA vs LSAbull The main difference is the way the approximation is done

bull PLSA generates a model (aspect model) and maximizes its predictive power

bull Selecting the proper value of K is heuristic in LSA

bull Model selection in statistics can determine optimal K in PLSA

Applications

bull Text mining topic discovering

bull Scene Classification

Text Mining

Scene Classification

Classification Result

Reference

bull Thomas Hofmann Probabilistic latent semantic analysis In Proc of Uncertainty in Artificial Intelligence UAI99 Stockholm 1999

bull Bosch A Zisserman A and Munoz X Scene Classification via pLSAProceedings of the European Conference on Computer Vision (2006)

bull Sivic J Russell B C Efros A A Zisserman A and Freeman W T Discovering Object Categories in Image Collections MIT AI Lab Memo AIM-2005-005 February 2005

LDA ndash LATENT DIRICHILET ALLOCATION

Problems in pLSA

bull pLSA provides no probabilistic model at the document level Each doc has its own topic mixture proportion

bull The number of parameters in the model grows linearly with M (the number of documents in the training set)

Problems in pLSA

bull There is no constraint for distributions p(z|di)

bull Easy to lead to serious problems with over-fitting

d1

z1

w1

z2

w2

zN1

wN1

hellip

d2

z1

w1

z2

w2

zN2

wN2

hellip

dm

z1

w1

z2

w2

zNm

wNm

hellip

p(z|d1) p(z|d2) p(z|dm)

Dirichlet Distribution

bull In the LDA model the topic mixture proportions for each document are assumed to follow some distribution

bull Requirement for such a distributionndash The samples (mixture proportions) generated from it are K-tuples of non-

negative numbers that sum to one That is the samples are multinormialsndash Easy to optimize

bull Dirichlet Distribution is one of such distributions

bull The space of all of these multinomials has a nice geometric interpretation as a (k-1)-simplex

Dirichlet Distribution

bull Definition

bull The density is zero outside this open (K minus 1)-dimensional simplex

k

i ii

ki ik

i i

ki i

kk

xx

xxxxp i

1

11

1

12121

1 0 st

)Γ()(Γ)|(

bull Various parameter α

(6 2 2) (3 7 5)

(2 3 4) (6 2 6)

Example Dirichlet Distributions (K=3)

Example Dirichlet Distributions (K=3)

bull Equal αi different

α0=01 α0=1 α0=10

k

i i10

The LDA Model

z4z3z2z1

w4w3w2w1

b

z4z3z2z1

w4w3w2w1

z4z3z2z1

w4w3w2w1

bull For each documentbull Choose ~Dirichlet()bull For each of the N words wn

ndash Choose a topic znraquo Multinomial()

ndash Choose a word wn from p(wn|znb) a multinomial probability conditioned on the topic zn

The LDA Model

bull For each documentbull Choose ~Dirichlet()bull For each of the N words wn

ndash Choose a topic znraquo Multinomial()

ndash Choose a word wn from p(wn|znb) a multinomial probability conditioned on the topic zn

Joint Probability

bull Given parameter α and β

where

Likelihood

bull Joint Probability

bull Marginal distribution of a document

bull Likelihood over all the documents

Inference

bull The likelihood can be computed by summing each documentbull Jansenrsquos inequality in EM

Inference

bull In E-Step we need to compute the posterior distribution of the hidden variables

bull Unfortunately this distribution is intractable to compute in general

bull We have to resort to variational approach

Variational Inference

bull In variational inference we consider a simplified graphical model with variational parameters and minimize the KL Divergence between the variational and posterior distributions

Variantional Inference

bull The difference between the lower bound and the likelihood is the KL divergence

bull Maximizing the lower bound L(b) with respect to and is equivalent to minimizing the KL divergence

VBEM vs EM

bull Only different in the E-Step

bull In standard EM q(X) is directly set to p(X|Dθ) and let KL=0bull In VBEM it is intractable to compute p(X|Dθ) Instead it

approximates p(X|Dθ) by a variational distribution q(X) by minimizing KL(q(X) | P(X|D θ)

bull This is also equivalent to maximizing the lower bound L(θ)

Parameter Estimation

bull Given a corpus of documents we would like to find the parameters and b which maximize the likelihood of the observed data

bull Strategy (Variational EM)

bull Lower bound log p(w|b) by a function L(b)bull Repeat until convergence

ndash E Maximize L(b) with respect to the variational parameters ndash M Maximize the bound with respect to parameters and b

Parameter Estimation

bull E-Step Variational Inference ndash repeat until convergence

bull M-Step Parameter estimation

β

α can be implemented using the Newton-Raphson method

Topic Examples in a 100-topic LDA Model)bull 16000 documents from a subset of the TREC AP corpus

Classification (50-topic LDA + SVM)

bull Reuters-21578 dataset ndash contains 8000 documents and 15818 words

(a) EARN vs NOT EARN (b) GRAIN vs NOT GRAIN

Problems in LDA

bull Dirichlet Distribution is helpful to avoid over-fitting But the assumption might be too strong

z4z3z2z1

w4w3w2w1

b

z4z3z2z1

w4w3w2w1

z4z3z2z1

w4w3w2w1

A Bayesian Hierarchical Model for Learning Natural Scene Categories

bull Incorporating category information

MNd

π

z

x

θ

β

Codebookbull 174 Local Image Patches

bull DetectionEvenly Sampled GridRandom SamplingSaliency DetectorLowersquos DoG Detector

bull RepresentationNormalized 11x11 gray values128-dim SIFT

Topic Distribution in Different Categories

Topic Hierarchical Clustering

More Topic Models

bull Dynamic topic models ICML 2006bull Correlated Topic Model NIPS 2005bull Hierarchical Dirichlet Process Journal of the American

Statistical Association 2003bull Nonparametric Bayes pachinko allocation UAI 2007bull Supervised LDA NIPS 2007bull MedLDA ndash Maximum Margin Discrimant LDA ICML 2009bull hellip

Are you really into Graphical Models

bull Describing Visual Scenes using Transformed Dirichlet Processes E Sudderth A Torralba W Freeman and A Willsky NIPS Dec 2005

Reference

bull David M Blei Andrew Y Ng Michael I Jordan Latent Dirichlet Allocation Journal of Machine Learning Research (JMLR) 2003

bull Matthew J Beal Variational Algorithms for Approximate Bayesian Inference PhD Thesis University of Cambridge 1998

bull L Fei-Fei and P Perona A Bayesian Hierarchical Model for Learning Natural Scene Categories CVPR 2005

Outline

bull Matrix Decompositionndash PCA SVD NMFndash LDA ICA Sparse Coding etc

bull Graphical Modelndash Basic concepts in probabilistic machine learningndash EMndash pLSAndash LDA

bull Two Applicationsndash Document decomposition for ldquolong queryrdquo retrieval ICCV 2009ndash Modeling Threaded Discussions SIGIR 2009

Large-Scale Indexing for ldquoLong Queryrdquo Retrieval (Similarity Search)

Xiao Zhang Zhiwei Li Lei Zhang Wei-Ying Ma and Heung-Yeung Shum

ICCV 2009

The Long Query Problem

bull If a query contains 1000 keywordsndash Need to access 1000 inverted listsndash The intersection of 1000 inverted lists may be emptyndash The union of 1000 inverted list may be the whole corpus

bull Dimension reductionTerm1 Term2 Term3 Term4 hellip TermN

Img1 1 2 0 0 hellip 2

f1 f2 hellip fM

Img1 02 01 hellip 003

Topic Projection

Dim = 1 million

Dim = 200

Key Idea Dimension Reduction + Residual Error Preservation

bull p original TF-IDF vector in vocabulary spacebull X projection matrix for dimension reductionbull w low dimensional feature vectorbull residual error

p Xw

An image = 0 01 02 03 04 05 06 07 08 09 1

0

2

4

6

8

10

12

14

16

+ a few words(10 words)

Orthogonal Decomposition

p Xw 1 11 1 1

2 21 2 1 2

1 1

1

k

k

W k W

W W Wk W

p x xp x x w

p wp x x

Base vector

Low dimensional representation

Residual

An image = 0 01 02 03 04 05 06 07 08 09 1

0

2

4

6

8

10

12

14

16

+ a few words(10 words)

T Tp q p q

p q

w w

X1 X2 X3 hellip Xk

A Probabilistic Implementation

x is a switch variable It controls a word generated from

bull a topic specific distribution

bull a document specific distribution

bull a background distribution

( | )p w d 1( 0 | ) ( | ) ( | )K

kp x d p w z k p z k d

( 1| ) ( | )p x d p w d ( 2 | ) ( | )p x d p w

C Chemudugunta etal Modeling General and Specific Aspects of Documents with a Probabilistic Topic Model NIPS 2006

Search (Online)

DS1 DS2 hellip

DS1 DS2 hellip

DS1 DS2 hellip

DS1 DS2 hellip

DS1 DS2 hellip

hellip

hellip

hellip

hellip

hellip

LSH Index

Doc 300 Doc 401 hellip

A query = 0 01 02 03 04 05 06 07 08 09 1

0

2

4

6

8

10

12

14

16

+ a few words

Re-rankingDoc 401 hellip

Doc 1

Doc 2

Doc 300

Doc 401

Doc N

Doc 300

Index 10M Images 46GBSearch Speed lt 100ms

Doc Meta

Search Example

Query Image

Search Example

Query Image

Simultaneously Modeling Semantics and Structure of Threaded Discussions A Sparse

Coding Approach and Its Applications

Chen LIN Jiang-Ming YANG Rui CAI Xin-jing WANG Wei WANG Lei ZHANG

SIGIR 2009

123

Semantic amp structure

124

SemanticTopics

StructureWho reply to who

Optimize them together

Model semantic

Model structure

125

Reply reconstruction

126

DocumentSimilarity

TopicSimilarity

StructureSimilarity

BaselinesNP

Reply to Nearest PostRR

Reply to RootDS

Document SimilarityLDA

Latent Dirichlet Allocation Project documents to topic space

SWB Special Words Topic Model with Background distribution Project documents to topic and junk topic space

127

Evaluation

method Slashdot Apple

All Posts Good Posts All Posts Good Posts

NP 0021 0012 0289 0239

RR 0183 0319 0269 0474

DS 0463 0643 0409 0628

LDA 0465 0644 0410 0648

SWB 0463 0644 0410 0641

SMSS 0524 0737 0517 0772

128

Expert finding

Reply reconstruction

Network construction

Expert finding

Methods

HITS

PageRank

hellip

129

Baselines

LMFormal Models for Expert Finding in Enterprise Corpora SIGIR

06Achieves stable performance in expert finding task using a

language modelPageRank

Benchmark nodal ranking methodHITS

Find hub nodes and authority nodeEABIF

Personalized Recommendation Driven by Information Flow SIGIR rsquo06

Find most influential node130

Evaluation

131

bull Bayesian estimate

Method MRR MAP P10

LM 0821 0698 0800

EABIF(ori) 0674 0362 0243

EABIF(rec) 0742 0318 0281

PageRank(ori) 0675 0377 0263

PageRank(rec) 0743 0321 0266

HITS(ori) 0906 0832 0900

HITS(rec) 0938 0822 0906

Summary

bull Matrix and probability are fundamental mathematics in information retrieval and computer visionndash Matrix decomposition ndash a good practice to learn matrixndash Graphical model ndash a good practice to learn probability

bull Graphical model is a good tool to analyze problems

bull The essence of decomposition is to discover a set of mid-level features to describe original documentsimages

bull It is more adaptable for various applications than matrix decomposition

  • An Introduction To Matrix Decomposition and Graphical Model
  • Outline
  • What Is Matrix Decomposition
  • Why We Need Matrix Decomposition
  • Why We Need Matrix Decomposition (2)
  • Principle Component Analysis
  • Definition ndash Eigenvalue amp Eigenvector
  • Definition ndash Principle Component Analysis
  • Principle Component Analysis (2)
  • Maximizing Variance
  • Optimization Problem
  • Property Data Decomposition
  • Face Recognition ndash Eigenface
  • Slide 14
  • Slide 15
  • PageRank ndash Power Iteration
  • Column-Stochastic amp Irreducible
  • Iterative PageRank Calculation
  • Convergence of the power iteration
  • Singular value decomposition
  • SVD - Definition
  • Singular Values And Singular Vectors
  • Matrix approximation
  • SVD and PCA
  • Example - LSI
  • SVD and PCA (2)
  • Latent Semantic Indexing (LSI)
  • Latent Semantic Indexing
  • Similarity Measures
  • Similarity Measures (2)
  • HITS (Hyperlink Induced Topic Search)
  • Power Iteration
  • HITS and SVD
  • HITS vs PageRank
  • NMF ndash Non-Negative Matrix Factorization
  • Definition
  • Motivation
  • Multiplicative Update Algorithm
  • Multiplicative Update Algorithm (2)
  • NMF vs PCA
  • Reference
  • Major Reference
  • Outline (2)
  • Not Included
  • Basic Concepts
  • What Is Machine Learning
  • Likelihood Function
  • Maximum Likelihood (ML)
  • IID ndash Independent Identically Distributed
  • Reference (2)
  • Expectation Maximization
  • Why We Need EM
  • More General
  • The Expectation Maximization (EM) Algorithm
  • Jensenrsquos Inequality
  • Lower Bounding the Log Likelihood
  • The E and M Steps of EM
  • The E Step
  • The M Step
  • EM Never Decreases the Likelihood
  • Reference (3)
  • Why Do We Need Graphical Model
  • Why Do We Need Graphical Models
  • Directed Acyclic Graphical Models (Bayesian Networks)
  • Directed Graphs for Statistical Models Plate Notation
  • pLSA ndash Probabilistic Latent Semantic Analysis
  • Latent Semantic Indexing (LSI) Review
  • pLSA ndash Probabilistic Latent Semantic Analysis (2)
  • pLSA
  • pLSA (2)
  • Joint Probability vs Likelihood
  • Document Decomposition
  • pLSA ndash Objective Function
  • EM Steps
  • Lower Bounding the Log Likelihood (2)
  • EM Steps (2)
  • Latent Subspace
  • pLSA vs LSA
  • pLSA vs LSA (2)
  • Applications
  • Text Mining
  • Scene Classification
  • Classification Result
  • Reference (4)
  • LDA ndash Latent Dirichilet Allocation
  • Problems in pLSA
  • Problems in pLSA (2)
  • Dirichlet Distribution
  • Dirichlet Distribution (2)
  • Example Dirichlet Distributions (K=3)
  • Example Dirichlet Distributions (K=3) (2)
  • The LDA Model
  • The LDA Model (2)
  • Joint Probability
  • Likelihood
  • Inference
  • Inference (2)
  • Variational Inference
  • Variantional Inference
  • VBEM vs EM
  • Parameter Estimation
  • Parameter Estimation (2)
  • Topic Examples in a 100-topic LDA Model)
  • Classification (50-topic LDA + SVM)
  • Problems in LDA
  • A Bayesian Hierarchical Model for Learning Natural Scene Catego
  • Codebook
  • Topic Distribution in Different Categories
  • Topic Hierarchical Clustering
  • More Topic Models
  • Are you really into Graphical Models
  • Reference (5)
  • Outline (3)
  • Slide 115
  • The Long Query Problem
  • Key Idea Dimension Reduction + Residual Error Preservation
  • Orthogonal Decomposition
  • A Probabilistic Implementation
  • Search (Online)
  • Search Example
  • Search Example (2)
  • Simultaneously Modeling Semantics and Structure of Threaded Dis
  • Semantic amp structure
  • Optimize them together
  • Reply reconstruction
  • Baselines
  • Evaluation
  • Expert finding
  • Baselines (2)
  • Evaluation (2)
  • Summary
Page 64: An Introduction To Matrix Decomposition and Graphical Model

PLSA ndash PROBABILISTIC LATENT SEMANTIC ANALYSIS

Latent Semantic Indexing (LSI) Review

bull For natural Language Queries simple term matching does not work effectivelyndash Ambiguous terms ndash Same Queries vary due to personal styles

bull Latent semantic indexingndash Creates this lsquolatent semantic spacersquo (hidden meaning)

bull LSI puts documents together even if they donrsquot have common words if the docs share frequently co-occurring terms

bull Disadvantagesndash Statistical foundation is missing

pLSA ndash Probabilistic Latent Semantic Analysis

bull Automated Document Indexing and Information retrievalbull Identification of Latent Classes using an Expectation

Maximization (EM) Algorithmbull Shown to solve

ndash Polysemybull Java could mean ldquocoffeerdquo and also the ldquoPL Javardquobull Cricket is a ldquogamerdquo and also an ldquoinsectrdquo

ndash Synonymybull ldquocomputerrdquo ldquopcrdquo ldquodesktoprdquo all could mean the same

bull Has a better statistical foundation than LSA

pLSA

M

Nd

d

z

w

M

d

z1

w1

z2

w2

z3

w3

zN

wN

hellip

z1 hellip zN are variables ziє[1K]K is the number of latent topics

pLSA

d1

z1

w1

z2

w2

zN1

wN1

hellip

d2

z1

w1

z2

w2

zN2

wN2

hellip

dM

z1

w1

z2

w2

zNm

wNm

hellip

p(w|z=1) p(w|z=2) p(w | z=Nm) are shared for all documents

Likelihood

Joint Probability vs Likelihood

bull Joint probability

bull Likelihood (only for observed variables)

bull p(d) is assumed to be uniform

Document Decomposition

bull Each document can be decomposed as

bull This is similar to the matrix decomposition if we consider each discrete distribution as a vector

p(w|d) = ZVtimesk p(z|d)

bull With many documents we hope to find latent topics as common basis

pLSA ndash Objective Function

bull pLSA tries to maximize the log likelihood

bull Due to the summation over z inside log we have to resort to EM

EM Steps

bull E-Stepndash Expectation of the likelihood function is calculated with the current

parameter valuesbull M-Step

ndash Update the parameters with the calculated posterior probabilitiesndash Find the parameters that maximizes the likelihood function

Lower Bounding the Log Likelihood

EM Steps

bull The E-Step

bull The M-Step

Latent Subspace

pLSA vs LSAbull LSA and PLSA perform dimensionality reduction

ndash In LSA by keeping only K singular valuesndash In PLSA by having K aspects

bull Comparison to SVDndash U Matrix related to P(z|d) (doc to aspect)ndash V Matrix related to P(w|z) (aspect to term)ndash E Matrix related to P(z) (aspect strength)

pLSA vs LSAbull The main difference is the way the approximation is done

bull PLSA generates a model (aspect model) and maximizes its predictive power

bull Selecting the proper value of K is heuristic in LSA

bull Model selection in statistics can determine optimal K in PLSA

Applications

bull Text mining topic discovering

bull Scene Classification

Text Mining

Scene Classification

Classification Result

Reference

bull Thomas Hofmann Probabilistic latent semantic analysis In Proc of Uncertainty in Artificial Intelligence UAI99 Stockholm 1999

bull Bosch A Zisserman A and Munoz X Scene Classification via pLSAProceedings of the European Conference on Computer Vision (2006)

bull Sivic J Russell B C Efros A A Zisserman A and Freeman W T Discovering Object Categories in Image Collections MIT AI Lab Memo AIM-2005-005 February 2005

LDA ndash LATENT DIRICHILET ALLOCATION

Problems in pLSA

bull pLSA provides no probabilistic model at the document level Each doc has its own topic mixture proportion

bull The number of parameters in the model grows linearly with M (the number of documents in the training set)

Problems in pLSA

bull There is no constraint for distributions p(z|di)

bull Easy to lead to serious problems with over-fitting

d1

z1

w1

z2

w2

zN1

wN1

hellip

d2

z1

w1

z2

w2

zN2

wN2

hellip

dm

z1

w1

z2

w2

zNm

wNm

hellip

p(z|d1) p(z|d2) p(z|dm)

Dirichlet Distribution

bull In the LDA model the topic mixture proportions for each document are assumed to follow some distribution

bull Requirement for such a distributionndash The samples (mixture proportions) generated from it are K-tuples of non-

negative numbers that sum to one That is the samples are multinormialsndash Easy to optimize

bull Dirichlet Distribution is one of such distributions

bull The space of all of these multinomials has a nice geometric interpretation as a (k-1)-simplex

Dirichlet Distribution

bull Definition

bull The density is zero outside this open (K minus 1)-dimensional simplex

k

i ii

ki ik

i i

ki i

kk

xx

xxxxp i

1

11

1

12121

1 0 st

)Γ()(Γ)|(

bull Various parameter α

(6 2 2) (3 7 5)

(2 3 4) (6 2 6)

Example Dirichlet Distributions (K=3)

Example Dirichlet Distributions (K=3)

bull Equal αi different

α0=01 α0=1 α0=10

k

i i10

The LDA Model

z4z3z2z1

w4w3w2w1

b

z4z3z2z1

w4w3w2w1

z4z3z2z1

w4w3w2w1

bull For each documentbull Choose ~Dirichlet()bull For each of the N words wn

ndash Choose a topic znraquo Multinomial()

ndash Choose a word wn from p(wn|znb) a multinomial probability conditioned on the topic zn

The LDA Model

bull For each documentbull Choose ~Dirichlet()bull For each of the N words wn

ndash Choose a topic znraquo Multinomial()

ndash Choose a word wn from p(wn|znb) a multinomial probability conditioned on the topic zn

Joint Probability

bull Given parameter α and β

where

Likelihood

bull Joint Probability

bull Marginal distribution of a document

bull Likelihood over all the documents

Inference

bull The likelihood can be computed by summing each documentbull Jansenrsquos inequality in EM

Inference

bull In E-Step we need to compute the posterior distribution of the hidden variables

bull Unfortunately this distribution is intractable to compute in general

bull We have to resort to variational approach

Variational Inference

bull In variational inference we consider a simplified graphical model with variational parameters and minimize the KL Divergence between the variational and posterior distributions

Variantional Inference

bull The difference between the lower bound and the likelihood is the KL divergence

bull Maximizing the lower bound L(b) with respect to and is equivalent to minimizing the KL divergence

VBEM vs EM

bull Only different in the E-Step

bull In standard EM q(X) is directly set to p(X|Dθ) and let KL=0bull In VBEM it is intractable to compute p(X|Dθ) Instead it

approximates p(X|Dθ) by a variational distribution q(X) by minimizing KL(q(X) | P(X|D θ)

bull This is also equivalent to maximizing the lower bound L(θ)

Parameter Estimation

bull Given a corpus of documents we would like to find the parameters and b which maximize the likelihood of the observed data

bull Strategy (Variational EM)

bull Lower bound log p(w|b) by a function L(b)bull Repeat until convergence

ndash E Maximize L(b) with respect to the variational parameters ndash M Maximize the bound with respect to parameters and b

Parameter Estimation

bull E-Step Variational Inference ndash repeat until convergence

bull M-Step Parameter estimation

β

α can be implemented using the Newton-Raphson method

Topic Examples in a 100-topic LDA Model)bull 16000 documents from a subset of the TREC AP corpus

Classification (50-topic LDA + SVM)

bull Reuters-21578 dataset ndash contains 8000 documents and 15818 words

(a) EARN vs NOT EARN (b) GRAIN vs NOT GRAIN

Problems in LDA

bull Dirichlet Distribution is helpful to avoid over-fitting But the assumption might be too strong

z4z3z2z1

w4w3w2w1

b

z4z3z2z1

w4w3w2w1

z4z3z2z1

w4w3w2w1

A Bayesian Hierarchical Model for Learning Natural Scene Categories

bull Incorporating category information

MNd

π

z

x

θ

β

Codebookbull 174 Local Image Patches

bull DetectionEvenly Sampled GridRandom SamplingSaliency DetectorLowersquos DoG Detector

bull RepresentationNormalized 11x11 gray values128-dim SIFT

Topic Distribution in Different Categories

Topic Hierarchical Clustering

More Topic Models

bull Dynamic topic models ICML 2006bull Correlated Topic Model NIPS 2005bull Hierarchical Dirichlet Process Journal of the American

Statistical Association 2003bull Nonparametric Bayes pachinko allocation UAI 2007bull Supervised LDA NIPS 2007bull MedLDA ndash Maximum Margin Discrimant LDA ICML 2009bull hellip

Are you really into Graphical Models

bull Describing Visual Scenes using Transformed Dirichlet Processes E Sudderth A Torralba W Freeman and A Willsky NIPS Dec 2005

Reference

bull David M Blei Andrew Y Ng Michael I Jordan Latent Dirichlet Allocation Journal of Machine Learning Research (JMLR) 2003

bull Matthew J Beal Variational Algorithms for Approximate Bayesian Inference PhD Thesis University of Cambridge 1998

bull L Fei-Fei and P Perona A Bayesian Hierarchical Model for Learning Natural Scene Categories CVPR 2005

Outline

bull Matrix Decompositionndash PCA SVD NMFndash LDA ICA Sparse Coding etc

bull Graphical Modelndash Basic concepts in probabilistic machine learningndash EMndash pLSAndash LDA

bull Two Applicationsndash Document decomposition for ldquolong queryrdquo retrieval ICCV 2009ndash Modeling Threaded Discussions SIGIR 2009

Large-Scale Indexing for ldquoLong Queryrdquo Retrieval (Similarity Search)

Xiao Zhang Zhiwei Li Lei Zhang Wei-Ying Ma and Heung-Yeung Shum

ICCV 2009

The Long Query Problem

bull If a query contains 1000 keywordsndash Need to access 1000 inverted listsndash The intersection of 1000 inverted lists may be emptyndash The union of 1000 inverted list may be the whole corpus

bull Dimension reductionTerm1 Term2 Term3 Term4 hellip TermN

Img1 1 2 0 0 hellip 2

f1 f2 hellip fM

Img1 02 01 hellip 003

Topic Projection

Dim = 1 million

Dim = 200

Key Idea Dimension Reduction + Residual Error Preservation

bull p original TF-IDF vector in vocabulary spacebull X projection matrix for dimension reductionbull w low dimensional feature vectorbull residual error

p Xw

An image = 0 01 02 03 04 05 06 07 08 09 1

0

2

4

6

8

10

12

14

16

+ a few words(10 words)

Orthogonal Decomposition

p Xw 1 11 1 1

2 21 2 1 2

1 1

1

k

k

W k W

W W Wk W

p x xp x x w

p wp x x

Base vector

Low dimensional representation

Residual

An image = 0 01 02 03 04 05 06 07 08 09 1

0

2

4

6

8

10

12

14

16

+ a few words(10 words)

T Tp q p q

p q

w w

X1 X2 X3 hellip Xk

A Probabilistic Implementation

x is a switch variable It controls a word generated from

bull a topic specific distribution

bull a document specific distribution

bull a background distribution

( | )p w d 1( 0 | ) ( | ) ( | )K

kp x d p w z k p z k d

( 1| ) ( | )p x d p w d ( 2 | ) ( | )p x d p w

C Chemudugunta etal Modeling General and Specific Aspects of Documents with a Probabilistic Topic Model NIPS 2006

Search (Online)

DS1 DS2 hellip

DS1 DS2 hellip

DS1 DS2 hellip

DS1 DS2 hellip

DS1 DS2 hellip

hellip

hellip

hellip

hellip

hellip

LSH Index

Doc 300 Doc 401 hellip

A query = 0 01 02 03 04 05 06 07 08 09 1

0

2

4

6

8

10

12

14

16

+ a few words

Re-rankingDoc 401 hellip

Doc 1

Doc 2

Doc 300

Doc 401

Doc N

Doc 300

Index 10M Images 46GBSearch Speed lt 100ms

Doc Meta

Search Example

Query Image

Search Example

Query Image

Simultaneously Modeling Semantics and Structure of Threaded Discussions A Sparse

Coding Approach and Its Applications

Chen LIN Jiang-Ming YANG Rui CAI Xin-jing WANG Wei WANG Lei ZHANG

SIGIR 2009

123

Semantic amp structure

124

SemanticTopics

StructureWho reply to who

Optimize them together

Model semantic

Model structure

125

Reply reconstruction

126

DocumentSimilarity

TopicSimilarity

StructureSimilarity

BaselinesNP

Reply to Nearest PostRR

Reply to RootDS

Document SimilarityLDA

Latent Dirichlet Allocation Project documents to topic space

SWB Special Words Topic Model with Background distribution Project documents to topic and junk topic space

127

Evaluation

method Slashdot Apple

All Posts Good Posts All Posts Good Posts

NP 0021 0012 0289 0239

RR 0183 0319 0269 0474

DS 0463 0643 0409 0628

LDA 0465 0644 0410 0648

SWB 0463 0644 0410 0641

SMSS 0524 0737 0517 0772

128

Expert finding

Reply reconstruction

Network construction

Expert finding

Methods

HITS

PageRank

hellip

129

Baselines

LMFormal Models for Expert Finding in Enterprise Corpora SIGIR

06Achieves stable performance in expert finding task using a

language modelPageRank

Benchmark nodal ranking methodHITS

Find hub nodes and authority nodeEABIF

Personalized Recommendation Driven by Information Flow SIGIR rsquo06

Find most influential node130

Evaluation

131

bull Bayesian estimate

Method MRR MAP P10

LM 0821 0698 0800

EABIF(ori) 0674 0362 0243

EABIF(rec) 0742 0318 0281

PageRank(ori) 0675 0377 0263

PageRank(rec) 0743 0321 0266

HITS(ori) 0906 0832 0900

HITS(rec) 0938 0822 0906

Summary

bull Matrix and probability are fundamental mathematics in information retrieval and computer visionndash Matrix decomposition ndash a good practice to learn matrixndash Graphical model ndash a good practice to learn probability

bull Graphical model is a good tool to analyze problems

bull The essence of decomposition is to discover a set of mid-level features to describe original documentsimages

bull It is more adaptable for various applications than matrix decomposition

  • An Introduction To Matrix Decomposition and Graphical Model
  • Outline
  • What Is Matrix Decomposition
  • Why We Need Matrix Decomposition
  • Why We Need Matrix Decomposition (2)
  • Principle Component Analysis
  • Definition ndash Eigenvalue amp Eigenvector
  • Definition ndash Principle Component Analysis
  • Principle Component Analysis (2)
  • Maximizing Variance
  • Optimization Problem
  • Property Data Decomposition
  • Face Recognition ndash Eigenface
  • Slide 14
  • Slide 15
  • PageRank ndash Power Iteration
  • Column-Stochastic amp Irreducible
  • Iterative PageRank Calculation
  • Convergence of the power iteration
  • Singular value decomposition
  • SVD - Definition
  • Singular Values And Singular Vectors
  • Matrix approximation
  • SVD and PCA
  • Example - LSI
  • SVD and PCA (2)
  • Latent Semantic Indexing (LSI)
  • Latent Semantic Indexing
  • Similarity Measures
  • Similarity Measures (2)
  • HITS (Hyperlink Induced Topic Search)
  • Power Iteration
  • HITS and SVD
  • HITS vs PageRank
  • NMF ndash Non-Negative Matrix Factorization
  • Definition
  • Motivation
  • Multiplicative Update Algorithm
  • Multiplicative Update Algorithm (2)
  • NMF vs PCA
  • Reference
  • Major Reference
  • Outline (2)
  • Not Included
  • Basic Concepts
  • What Is Machine Learning
  • Likelihood Function
  • Maximum Likelihood (ML)
  • IID ndash Independent Identically Distributed
  • Reference (2)
  • Expectation Maximization
  • Why We Need EM
  • More General
  • The Expectation Maximization (EM) Algorithm
  • Jensenrsquos Inequality
  • Lower Bounding the Log Likelihood
  • The E and M Steps of EM
  • The E Step
  • The M Step
  • EM Never Decreases the Likelihood
  • Reference (3)
  • Why Do We Need Graphical Model
  • Why Do We Need Graphical Models
  • Directed Acyclic Graphical Models (Bayesian Networks)
  • Directed Graphs for Statistical Models Plate Notation
  • pLSA ndash Probabilistic Latent Semantic Analysis
  • Latent Semantic Indexing (LSI) Review
  • pLSA ndash Probabilistic Latent Semantic Analysis (2)
  • pLSA
  • pLSA (2)
  • Joint Probability vs Likelihood
  • Document Decomposition
  • pLSA ndash Objective Function
  • EM Steps
  • Lower Bounding the Log Likelihood (2)
  • EM Steps (2)
  • Latent Subspace
  • pLSA vs LSA
  • pLSA vs LSA (2)
  • Applications
  • Text Mining
  • Scene Classification
  • Classification Result
  • Reference (4)
  • LDA ndash Latent Dirichilet Allocation
  • Problems in pLSA
  • Problems in pLSA (2)
  • Dirichlet Distribution
  • Dirichlet Distribution (2)
  • Example Dirichlet Distributions (K=3)
  • Example Dirichlet Distributions (K=3) (2)
  • The LDA Model
  • The LDA Model (2)
  • Joint Probability
  • Likelihood
  • Inference
  • Inference (2)
  • Variational Inference
  • Variantional Inference
  • VBEM vs EM
  • Parameter Estimation
  • Parameter Estimation (2)
  • Topic Examples in a 100-topic LDA Model)
  • Classification (50-topic LDA + SVM)
  • Problems in LDA
  • A Bayesian Hierarchical Model for Learning Natural Scene Catego
  • Codebook
  • Topic Distribution in Different Categories
  • Topic Hierarchical Clustering
  • More Topic Models
  • Are you really into Graphical Models
  • Reference (5)
  • Outline (3)
  • Slide 115
  • The Long Query Problem
  • Key Idea Dimension Reduction + Residual Error Preservation
  • Orthogonal Decomposition
  • A Probabilistic Implementation
  • Search (Online)
  • Search Example
  • Search Example (2)
  • Simultaneously Modeling Semantics and Structure of Threaded Dis
  • Semantic amp structure
  • Optimize them together
  • Reply reconstruction
  • Baselines
  • Evaluation
  • Expert finding
  • Baselines (2)
  • Evaluation (2)
  • Summary
Page 65: An Introduction To Matrix Decomposition and Graphical Model

Latent Semantic Indexing (LSI) Review

bull For natural Language Queries simple term matching does not work effectivelyndash Ambiguous terms ndash Same Queries vary due to personal styles

bull Latent semantic indexingndash Creates this lsquolatent semantic spacersquo (hidden meaning)

bull LSI puts documents together even if they donrsquot have common words if the docs share frequently co-occurring terms

bull Disadvantagesndash Statistical foundation is missing

pLSA ndash Probabilistic Latent Semantic Analysis

bull Automated Document Indexing and Information retrievalbull Identification of Latent Classes using an Expectation

Maximization (EM) Algorithmbull Shown to solve

ndash Polysemybull Java could mean ldquocoffeerdquo and also the ldquoPL Javardquobull Cricket is a ldquogamerdquo and also an ldquoinsectrdquo

ndash Synonymybull ldquocomputerrdquo ldquopcrdquo ldquodesktoprdquo all could mean the same

bull Has a better statistical foundation than LSA

pLSA

M

Nd

d

z

w

M

d

z1

w1

z2

w2

z3

w3

zN

wN

hellip

z1 hellip zN are variables ziє[1K]K is the number of latent topics

pLSA

d1

z1

w1

z2

w2

zN1

wN1

hellip

d2

z1

w1

z2

w2

zN2

wN2

hellip

dM

z1

w1

z2

w2

zNm

wNm

hellip

p(w|z=1) p(w|z=2) p(w | z=Nm) are shared for all documents

Likelihood

Joint Probability vs Likelihood

bull Joint probability

bull Likelihood (only for observed variables)

bull p(d) is assumed to be uniform

Document Decomposition

bull Each document can be decomposed as

bull This is similar to the matrix decomposition if we consider each discrete distribution as a vector

p(w|d) = ZVtimesk p(z|d)

bull With many documents we hope to find latent topics as common basis

pLSA ndash Objective Function

bull pLSA tries to maximize the log likelihood

bull Due to the summation over z inside log we have to resort to EM

EM Steps

bull E-Stepndash Expectation of the likelihood function is calculated with the current

parameter valuesbull M-Step

ndash Update the parameters with the calculated posterior probabilitiesndash Find the parameters that maximizes the likelihood function

Lower Bounding the Log Likelihood

EM Steps

bull The E-Step

bull The M-Step

Latent Subspace

pLSA vs LSAbull LSA and PLSA perform dimensionality reduction

ndash In LSA by keeping only K singular valuesndash In PLSA by having K aspects

bull Comparison to SVDndash U Matrix related to P(z|d) (doc to aspect)ndash V Matrix related to P(w|z) (aspect to term)ndash E Matrix related to P(z) (aspect strength)

pLSA vs LSAbull The main difference is the way the approximation is done

bull PLSA generates a model (aspect model) and maximizes its predictive power

bull Selecting the proper value of K is heuristic in LSA

bull Model selection in statistics can determine optimal K in PLSA

Applications

bull Text mining topic discovering

bull Scene Classification

Text Mining

Scene Classification

Classification Result

Reference

bull Thomas Hofmann Probabilistic latent semantic analysis In Proc of Uncertainty in Artificial Intelligence UAI99 Stockholm 1999

bull Bosch A Zisserman A and Munoz X Scene Classification via pLSAProceedings of the European Conference on Computer Vision (2006)

bull Sivic J Russell B C Efros A A Zisserman A and Freeman W T Discovering Object Categories in Image Collections MIT AI Lab Memo AIM-2005-005 February 2005

LDA ndash LATENT DIRICHILET ALLOCATION

Problems in pLSA

bull pLSA provides no probabilistic model at the document level Each doc has its own topic mixture proportion

bull The number of parameters in the model grows linearly with M (the number of documents in the training set)

Problems in pLSA

bull There is no constraint for distributions p(z|di)

bull Easy to lead to serious problems with over-fitting

d1

z1

w1

z2

w2

zN1

wN1

hellip

d2

z1

w1

z2

w2

zN2

wN2

hellip

dm

z1

w1

z2

w2

zNm

wNm

hellip

p(z|d1) p(z|d2) p(z|dm)

Dirichlet Distribution

bull In the LDA model the topic mixture proportions for each document are assumed to follow some distribution

bull Requirement for such a distributionndash The samples (mixture proportions) generated from it are K-tuples of non-

negative numbers that sum to one That is the samples are multinormialsndash Easy to optimize

bull Dirichlet Distribution is one of such distributions

bull The space of all of these multinomials has a nice geometric interpretation as a (k-1)-simplex

Dirichlet Distribution

bull Definition

bull The density is zero outside this open (K minus 1)-dimensional simplex

k

i ii

ki ik

i i

ki i

kk

xx

xxxxp i

1

11

1

12121

1 0 st

)Γ()(Γ)|(

bull Various parameter α

(6 2 2) (3 7 5)

(2 3 4) (6 2 6)

Example Dirichlet Distributions (K=3)

Example Dirichlet Distributions (K=3)

bull Equal αi different

α0=01 α0=1 α0=10

k

i i10

The LDA Model

z4z3z2z1

w4w3w2w1

b

z4z3z2z1

w4w3w2w1

z4z3z2z1

w4w3w2w1

bull For each documentbull Choose ~Dirichlet()bull For each of the N words wn

ndash Choose a topic znraquo Multinomial()

ndash Choose a word wn from p(wn|znb) a multinomial probability conditioned on the topic zn

The LDA Model

bull For each documentbull Choose ~Dirichlet()bull For each of the N words wn

ndash Choose a topic znraquo Multinomial()

ndash Choose a word wn from p(wn|znb) a multinomial probability conditioned on the topic zn

Joint Probability

bull Given parameter α and β

where

Likelihood

bull Joint Probability

bull Marginal distribution of a document

bull Likelihood over all the documents

Inference

bull The likelihood can be computed by summing each documentbull Jansenrsquos inequality in EM

Inference

bull In E-Step we need to compute the posterior distribution of the hidden variables

bull Unfortunately this distribution is intractable to compute in general

bull We have to resort to variational approach

Variational Inference

bull In variational inference we consider a simplified graphical model with variational parameters and minimize the KL Divergence between the variational and posterior distributions

Variantional Inference

bull The difference between the lower bound and the likelihood is the KL divergence

bull Maximizing the lower bound L(b) with respect to and is equivalent to minimizing the KL divergence

VBEM vs EM

bull Only different in the E-Step

bull In standard EM q(X) is directly set to p(X|Dθ) and let KL=0bull In VBEM it is intractable to compute p(X|Dθ) Instead it

approximates p(X|Dθ) by a variational distribution q(X) by minimizing KL(q(X) | P(X|D θ)

bull This is also equivalent to maximizing the lower bound L(θ)

Parameter Estimation

bull Given a corpus of documents we would like to find the parameters and b which maximize the likelihood of the observed data

bull Strategy (Variational EM)

bull Lower bound log p(w|b) by a function L(b)bull Repeat until convergence

ndash E Maximize L(b) with respect to the variational parameters ndash M Maximize the bound with respect to parameters and b

Parameter Estimation

bull E-Step Variational Inference ndash repeat until convergence

bull M-Step Parameter estimation

β

α can be implemented using the Newton-Raphson method

Topic Examples in a 100-topic LDA Model)bull 16000 documents from a subset of the TREC AP corpus

Classification (50-topic LDA + SVM)

bull Reuters-21578 dataset ndash contains 8000 documents and 15818 words

(a) EARN vs NOT EARN (b) GRAIN vs NOT GRAIN

Problems in LDA

bull Dirichlet Distribution is helpful to avoid over-fitting But the assumption might be too strong

z4z3z2z1

w4w3w2w1

b

z4z3z2z1

w4w3w2w1

z4z3z2z1

w4w3w2w1

A Bayesian Hierarchical Model for Learning Natural Scene Categories

bull Incorporating category information

MNd

π

z

x

θ

β

Codebookbull 174 Local Image Patches

bull DetectionEvenly Sampled GridRandom SamplingSaliency DetectorLowersquos DoG Detector

bull RepresentationNormalized 11x11 gray values128-dim SIFT

Topic Distribution in Different Categories

Topic Hierarchical Clustering

More Topic Models

bull Dynamic topic models ICML 2006bull Correlated Topic Model NIPS 2005bull Hierarchical Dirichlet Process Journal of the American

Statistical Association 2003bull Nonparametric Bayes pachinko allocation UAI 2007bull Supervised LDA NIPS 2007bull MedLDA ndash Maximum Margin Discrimant LDA ICML 2009bull hellip

Are you really into Graphical Models

bull Describing Visual Scenes using Transformed Dirichlet Processes E Sudderth A Torralba W Freeman and A Willsky NIPS Dec 2005

Reference

bull David M Blei Andrew Y Ng Michael I Jordan Latent Dirichlet Allocation Journal of Machine Learning Research (JMLR) 2003

bull Matthew J Beal Variational Algorithms for Approximate Bayesian Inference PhD Thesis University of Cambridge 1998

bull L Fei-Fei and P Perona A Bayesian Hierarchical Model for Learning Natural Scene Categories CVPR 2005

Outline

bull Matrix Decompositionndash PCA SVD NMFndash LDA ICA Sparse Coding etc

bull Graphical Modelndash Basic concepts in probabilistic machine learningndash EMndash pLSAndash LDA

bull Two Applicationsndash Document decomposition for ldquolong queryrdquo retrieval ICCV 2009ndash Modeling Threaded Discussions SIGIR 2009

Large-Scale Indexing for ldquoLong Queryrdquo Retrieval (Similarity Search)

Xiao Zhang Zhiwei Li Lei Zhang Wei-Ying Ma and Heung-Yeung Shum

ICCV 2009

The Long Query Problem

bull If a query contains 1000 keywordsndash Need to access 1000 inverted listsndash The intersection of 1000 inverted lists may be emptyndash The union of 1000 inverted list may be the whole corpus

bull Dimension reductionTerm1 Term2 Term3 Term4 hellip TermN

Img1 1 2 0 0 hellip 2

f1 f2 hellip fM

Img1 02 01 hellip 003

Topic Projection

Dim = 1 million

Dim = 200

Key Idea Dimension Reduction + Residual Error Preservation

bull p original TF-IDF vector in vocabulary spacebull X projection matrix for dimension reductionbull w low dimensional feature vectorbull residual error

p Xw

An image = 0 01 02 03 04 05 06 07 08 09 1

0

2

4

6

8

10

12

14

16

+ a few words(10 words)

Orthogonal Decomposition

p Xw 1 11 1 1

2 21 2 1 2

1 1

1

k

k

W k W

W W Wk W

p x xp x x w

p wp x x

Base vector

Low dimensional representation

Residual

An image = 0 01 02 03 04 05 06 07 08 09 1

0

2

4

6

8

10

12

14

16

+ a few words(10 words)

T Tp q p q

p q

w w

X1 X2 X3 hellip Xk

A Probabilistic Implementation

x is a switch variable It controls a word generated from

bull a topic specific distribution

bull a document specific distribution

bull a background distribution

( | )p w d 1( 0 | ) ( | ) ( | )K

kp x d p w z k p z k d

( 1| ) ( | )p x d p w d ( 2 | ) ( | )p x d p w

C Chemudugunta etal Modeling General and Specific Aspects of Documents with a Probabilistic Topic Model NIPS 2006

Search (Online)

DS1 DS2 hellip

DS1 DS2 hellip

DS1 DS2 hellip

DS1 DS2 hellip

DS1 DS2 hellip

hellip

hellip

hellip

hellip

hellip

LSH Index

Doc 300 Doc 401 hellip

A query = 0 01 02 03 04 05 06 07 08 09 1

0

2

4

6

8

10

12

14

16

+ a few words

Re-rankingDoc 401 hellip

Doc 1

Doc 2

Doc 300

Doc 401

Doc N

Doc 300

Index 10M Images 46GBSearch Speed lt 100ms

Doc Meta

Search Example

Query Image

Search Example

Query Image

Simultaneously Modeling Semantics and Structure of Threaded Discussions A Sparse

Coding Approach and Its Applications

Chen LIN Jiang-Ming YANG Rui CAI Xin-jing WANG Wei WANG Lei ZHANG

SIGIR 2009

123

Semantic amp structure

124

SemanticTopics

StructureWho reply to who

Optimize them together

Model semantic

Model structure

125

Reply reconstruction

126

DocumentSimilarity

TopicSimilarity

StructureSimilarity

BaselinesNP

Reply to Nearest PostRR

Reply to RootDS

Document SimilarityLDA

Latent Dirichlet Allocation Project documents to topic space

SWB Special Words Topic Model with Background distribution Project documents to topic and junk topic space

127

Evaluation

method Slashdot Apple

All Posts Good Posts All Posts Good Posts

NP 0021 0012 0289 0239

RR 0183 0319 0269 0474

DS 0463 0643 0409 0628

LDA 0465 0644 0410 0648

SWB 0463 0644 0410 0641

SMSS 0524 0737 0517 0772

128

Expert finding

Reply reconstruction

Network construction

Expert finding

Methods

HITS

PageRank

hellip

129

Baselines

LMFormal Models for Expert Finding in Enterprise Corpora SIGIR

06Achieves stable performance in expert finding task using a

language modelPageRank

Benchmark nodal ranking methodHITS

Find hub nodes and authority nodeEABIF

Personalized Recommendation Driven by Information Flow SIGIR rsquo06

Find most influential node130

Evaluation

131

bull Bayesian estimate

Method MRR MAP P10

LM 0821 0698 0800

EABIF(ori) 0674 0362 0243

EABIF(rec) 0742 0318 0281

PageRank(ori) 0675 0377 0263

PageRank(rec) 0743 0321 0266

HITS(ori) 0906 0832 0900

HITS(rec) 0938 0822 0906

Summary

bull Matrix and probability are fundamental mathematics in information retrieval and computer visionndash Matrix decomposition ndash a good practice to learn matrixndash Graphical model ndash a good practice to learn probability

bull Graphical model is a good tool to analyze problems

bull The essence of decomposition is to discover a set of mid-level features to describe original documentsimages

bull It is more adaptable for various applications than matrix decomposition

  • An Introduction To Matrix Decomposition and Graphical Model
  • Outline
  • What Is Matrix Decomposition
  • Why We Need Matrix Decomposition
  • Why We Need Matrix Decomposition (2)
  • Principle Component Analysis
  • Definition ndash Eigenvalue amp Eigenvector
  • Definition ndash Principle Component Analysis
  • Principle Component Analysis (2)
  • Maximizing Variance
  • Optimization Problem
  • Property Data Decomposition
  • Face Recognition ndash Eigenface
  • Slide 14
  • Slide 15
  • PageRank ndash Power Iteration
  • Column-Stochastic amp Irreducible
  • Iterative PageRank Calculation
  • Convergence of the power iteration
  • Singular value decomposition
  • SVD - Definition
  • Singular Values And Singular Vectors
  • Matrix approximation
  • SVD and PCA
  • Example - LSI
  • SVD and PCA (2)
  • Latent Semantic Indexing (LSI)
  • Latent Semantic Indexing
  • Similarity Measures
  • Similarity Measures (2)
  • HITS (Hyperlink Induced Topic Search)
  • Power Iteration
  • HITS and SVD
  • HITS vs PageRank
  • NMF ndash Non-Negative Matrix Factorization
  • Definition
  • Motivation
  • Multiplicative Update Algorithm
  • Multiplicative Update Algorithm (2)
  • NMF vs PCA
  • Reference
  • Major Reference
  • Outline (2)
  • Not Included
  • Basic Concepts
  • What Is Machine Learning
  • Likelihood Function
  • Maximum Likelihood (ML)
  • IID ndash Independent Identically Distributed
  • Reference (2)
  • Expectation Maximization
  • Why We Need EM
  • More General
  • The Expectation Maximization (EM) Algorithm
  • Jensenrsquos Inequality
  • Lower Bounding the Log Likelihood
  • The E and M Steps of EM
  • The E Step
  • The M Step
  • EM Never Decreases the Likelihood
  • Reference (3)
  • Why Do We Need Graphical Model
  • Why Do We Need Graphical Models
  • Directed Acyclic Graphical Models (Bayesian Networks)
  • Directed Graphs for Statistical Models Plate Notation
  • pLSA ndash Probabilistic Latent Semantic Analysis
  • Latent Semantic Indexing (LSI) Review
  • pLSA ndash Probabilistic Latent Semantic Analysis (2)
  • pLSA
  • pLSA (2)
  • Joint Probability vs Likelihood
  • Document Decomposition
  • pLSA ndash Objective Function
  • EM Steps
  • Lower Bounding the Log Likelihood (2)
  • EM Steps (2)
  • Latent Subspace
  • pLSA vs LSA
  • pLSA vs LSA (2)
  • Applications
  • Text Mining
  • Scene Classification
  • Classification Result
  • Reference (4)
  • LDA ndash Latent Dirichilet Allocation
  • Problems in pLSA
  • Problems in pLSA (2)
  • Dirichlet Distribution
  • Dirichlet Distribution (2)
  • Example Dirichlet Distributions (K=3)
  • Example Dirichlet Distributions (K=3) (2)
  • The LDA Model
  • The LDA Model (2)
  • Joint Probability
  • Likelihood
  • Inference
  • Inference (2)
  • Variational Inference
  • Variantional Inference
  • VBEM vs EM
  • Parameter Estimation
  • Parameter Estimation (2)
  • Topic Examples in a 100-topic LDA Model)
  • Classification (50-topic LDA + SVM)
  • Problems in LDA
  • A Bayesian Hierarchical Model for Learning Natural Scene Catego
  • Codebook
  • Topic Distribution in Different Categories
  • Topic Hierarchical Clustering
  • More Topic Models
  • Are you really into Graphical Models
  • Reference (5)
  • Outline (3)
  • Slide 115
  • The Long Query Problem
  • Key Idea Dimension Reduction + Residual Error Preservation
  • Orthogonal Decomposition
  • A Probabilistic Implementation
  • Search (Online)
  • Search Example
  • Search Example (2)
  • Simultaneously Modeling Semantics and Structure of Threaded Dis
  • Semantic amp structure
  • Optimize them together
  • Reply reconstruction
  • Baselines
  • Evaluation
  • Expert finding
  • Baselines (2)
  • Evaluation (2)
  • Summary
Page 66: An Introduction To Matrix Decomposition and Graphical Model

pLSA ndash Probabilistic Latent Semantic Analysis

bull Automated Document Indexing and Information retrievalbull Identification of Latent Classes using an Expectation

Maximization (EM) Algorithmbull Shown to solve

ndash Polysemybull Java could mean ldquocoffeerdquo and also the ldquoPL Javardquobull Cricket is a ldquogamerdquo and also an ldquoinsectrdquo

ndash Synonymybull ldquocomputerrdquo ldquopcrdquo ldquodesktoprdquo all could mean the same

bull Has a better statistical foundation than LSA

pLSA

M

Nd

d

z

w

M

d

z1

w1

z2

w2

z3

w3

zN

wN

hellip

z1 hellip zN are variables ziє[1K]K is the number of latent topics

pLSA

d1

z1

w1

z2

w2

zN1

wN1

hellip

d2

z1

w1

z2

w2

zN2

wN2

hellip

dM

z1

w1

z2

w2

zNm

wNm

hellip

p(w|z=1) p(w|z=2) p(w | z=Nm) are shared for all documents

Likelihood

Joint Probability vs Likelihood

bull Joint probability

bull Likelihood (only for observed variables)

bull p(d) is assumed to be uniform

Document Decomposition

bull Each document can be decomposed as

bull This is similar to the matrix decomposition if we consider each discrete distribution as a vector

p(w|d) = ZVtimesk p(z|d)

bull With many documents we hope to find latent topics as common basis

pLSA ndash Objective Function

bull pLSA tries to maximize the log likelihood

bull Due to the summation over z inside log we have to resort to EM

EM Steps

bull E-Stepndash Expectation of the likelihood function is calculated with the current

parameter valuesbull M-Step

ndash Update the parameters with the calculated posterior probabilitiesndash Find the parameters that maximizes the likelihood function

Lower Bounding the Log Likelihood

EM Steps

bull The E-Step

bull The M-Step

Latent Subspace

pLSA vs LSAbull LSA and PLSA perform dimensionality reduction

ndash In LSA by keeping only K singular valuesndash In PLSA by having K aspects

bull Comparison to SVDndash U Matrix related to P(z|d) (doc to aspect)ndash V Matrix related to P(w|z) (aspect to term)ndash E Matrix related to P(z) (aspect strength)

pLSA vs LSAbull The main difference is the way the approximation is done

bull PLSA generates a model (aspect model) and maximizes its predictive power

bull Selecting the proper value of K is heuristic in LSA

bull Model selection in statistics can determine optimal K in PLSA

Applications

bull Text mining topic discovering

bull Scene Classification

Text Mining

Scene Classification

Classification Result

Reference

bull Thomas Hofmann Probabilistic latent semantic analysis In Proc of Uncertainty in Artificial Intelligence UAI99 Stockholm 1999

bull Bosch A Zisserman A and Munoz X Scene Classification via pLSAProceedings of the European Conference on Computer Vision (2006)

bull Sivic J Russell B C Efros A A Zisserman A and Freeman W T Discovering Object Categories in Image Collections MIT AI Lab Memo AIM-2005-005 February 2005

LDA ndash LATENT DIRICHILET ALLOCATION

Problems in pLSA

bull pLSA provides no probabilistic model at the document level Each doc has its own topic mixture proportion

bull The number of parameters in the model grows linearly with M (the number of documents in the training set)

Problems in pLSA

bull There is no constraint for distributions p(z|di)

bull Easy to lead to serious problems with over-fitting

d1

z1

w1

z2

w2

zN1

wN1

hellip

d2

z1

w1

z2

w2

zN2

wN2

hellip

dm

z1

w1

z2

w2

zNm

wNm

hellip

p(z|d1) p(z|d2) p(z|dm)

Dirichlet Distribution

bull In the LDA model the topic mixture proportions for each document are assumed to follow some distribution

bull Requirement for such a distributionndash The samples (mixture proportions) generated from it are K-tuples of non-

negative numbers that sum to one That is the samples are multinormialsndash Easy to optimize

bull Dirichlet Distribution is one of such distributions

bull The space of all of these multinomials has a nice geometric interpretation as a (k-1)-simplex

Dirichlet Distribution

bull Definition

bull The density is zero outside this open (K minus 1)-dimensional simplex

k

i ii

ki ik

i i

ki i

kk

xx

xxxxp i

1

11

1

12121

1 0 st

)Γ()(Γ)|(

bull Various parameter α

(6 2 2) (3 7 5)

(2 3 4) (6 2 6)

Example Dirichlet Distributions (K=3)

Example Dirichlet Distributions (K=3)

bull Equal αi different

α0=01 α0=1 α0=10

k

i i10

The LDA Model

z4z3z2z1

w4w3w2w1

b

z4z3z2z1

w4w3w2w1

z4z3z2z1

w4w3w2w1

bull For each documentbull Choose ~Dirichlet()bull For each of the N words wn

ndash Choose a topic znraquo Multinomial()

ndash Choose a word wn from p(wn|znb) a multinomial probability conditioned on the topic zn

The LDA Model

bull For each documentbull Choose ~Dirichlet()bull For each of the N words wn

ndash Choose a topic znraquo Multinomial()

ndash Choose a word wn from p(wn|znb) a multinomial probability conditioned on the topic zn

Joint Probability

bull Given parameter α and β

where

Likelihood

bull Joint Probability

bull Marginal distribution of a document

bull Likelihood over all the documents

Inference

bull The likelihood can be computed by summing each documentbull Jansenrsquos inequality in EM

Inference

bull In E-Step we need to compute the posterior distribution of the hidden variables

bull Unfortunately this distribution is intractable to compute in general

bull We have to resort to variational approach

Variational Inference

bull In variational inference we consider a simplified graphical model with variational parameters and minimize the KL Divergence between the variational and posterior distributions

Variantional Inference

bull The difference between the lower bound and the likelihood is the KL divergence

bull Maximizing the lower bound L(b) with respect to and is equivalent to minimizing the KL divergence

VBEM vs EM

bull Only different in the E-Step

bull In standard EM q(X) is directly set to p(X|Dθ) and let KL=0bull In VBEM it is intractable to compute p(X|Dθ) Instead it

approximates p(X|Dθ) by a variational distribution q(X) by minimizing KL(q(X) | P(X|D θ)

bull This is also equivalent to maximizing the lower bound L(θ)

Parameter Estimation

bull Given a corpus of documents we would like to find the parameters and b which maximize the likelihood of the observed data

bull Strategy (Variational EM)

bull Lower bound log p(w|b) by a function L(b)bull Repeat until convergence

ndash E Maximize L(b) with respect to the variational parameters ndash M Maximize the bound with respect to parameters and b

Parameter Estimation

bull E-Step Variational Inference ndash repeat until convergence

bull M-Step Parameter estimation

β

α can be implemented using the Newton-Raphson method

Topic Examples in a 100-topic LDA Model)bull 16000 documents from a subset of the TREC AP corpus

Classification (50-topic LDA + SVM)

bull Reuters-21578 dataset ndash contains 8000 documents and 15818 words

(a) EARN vs NOT EARN (b) GRAIN vs NOT GRAIN

Problems in LDA

bull Dirichlet Distribution is helpful to avoid over-fitting But the assumption might be too strong

z4z3z2z1

w4w3w2w1

b

z4z3z2z1

w4w3w2w1

z4z3z2z1

w4w3w2w1

A Bayesian Hierarchical Model for Learning Natural Scene Categories

bull Incorporating category information

MNd

π

z

x

θ

β

Codebookbull 174 Local Image Patches

bull DetectionEvenly Sampled GridRandom SamplingSaliency DetectorLowersquos DoG Detector

bull RepresentationNormalized 11x11 gray values128-dim SIFT

Topic Distribution in Different Categories

Topic Hierarchical Clustering

More Topic Models

bull Dynamic topic models ICML 2006bull Correlated Topic Model NIPS 2005bull Hierarchical Dirichlet Process Journal of the American

Statistical Association 2003bull Nonparametric Bayes pachinko allocation UAI 2007bull Supervised LDA NIPS 2007bull MedLDA ndash Maximum Margin Discrimant LDA ICML 2009bull hellip

Are you really into Graphical Models

bull Describing Visual Scenes using Transformed Dirichlet Processes E Sudderth A Torralba W Freeman and A Willsky NIPS Dec 2005

Reference

bull David M Blei Andrew Y Ng Michael I Jordan Latent Dirichlet Allocation Journal of Machine Learning Research (JMLR) 2003

bull Matthew J Beal Variational Algorithms for Approximate Bayesian Inference PhD Thesis University of Cambridge 1998

bull L Fei-Fei and P Perona A Bayesian Hierarchical Model for Learning Natural Scene Categories CVPR 2005

Outline

bull Matrix Decompositionndash PCA SVD NMFndash LDA ICA Sparse Coding etc

bull Graphical Modelndash Basic concepts in probabilistic machine learningndash EMndash pLSAndash LDA

bull Two Applicationsndash Document decomposition for ldquolong queryrdquo retrieval ICCV 2009ndash Modeling Threaded Discussions SIGIR 2009

Large-Scale Indexing for ldquoLong Queryrdquo Retrieval (Similarity Search)

Xiao Zhang Zhiwei Li Lei Zhang Wei-Ying Ma and Heung-Yeung Shum

ICCV 2009

The Long Query Problem

bull If a query contains 1000 keywordsndash Need to access 1000 inverted listsndash The intersection of 1000 inverted lists may be emptyndash The union of 1000 inverted list may be the whole corpus

bull Dimension reductionTerm1 Term2 Term3 Term4 hellip TermN

Img1 1 2 0 0 hellip 2

f1 f2 hellip fM

Img1 02 01 hellip 003

Topic Projection

Dim = 1 million

Dim = 200

Key Idea Dimension Reduction + Residual Error Preservation

bull p original TF-IDF vector in vocabulary spacebull X projection matrix for dimension reductionbull w low dimensional feature vectorbull residual error

p Xw

An image = 0 01 02 03 04 05 06 07 08 09 1

0

2

4

6

8

10

12

14

16

+ a few words(10 words)

Orthogonal Decomposition

p Xw 1 11 1 1

2 21 2 1 2

1 1

1

k

k

W k W

W W Wk W

p x xp x x w

p wp x x

Base vector

Low dimensional representation

Residual

An image = 0 01 02 03 04 05 06 07 08 09 1

0

2

4

6

8

10

12

14

16

+ a few words(10 words)

T Tp q p q

p q

w w

X1 X2 X3 hellip Xk

A Probabilistic Implementation

x is a switch variable It controls a word generated from

bull a topic specific distribution

bull a document specific distribution

bull a background distribution

( | )p w d 1( 0 | ) ( | ) ( | )K

kp x d p w z k p z k d

( 1| ) ( | )p x d p w d ( 2 | ) ( | )p x d p w

C Chemudugunta etal Modeling General and Specific Aspects of Documents with a Probabilistic Topic Model NIPS 2006

Search (Online)

DS1 DS2 hellip

DS1 DS2 hellip

DS1 DS2 hellip

DS1 DS2 hellip

DS1 DS2 hellip

hellip

hellip

hellip

hellip

hellip

LSH Index

Doc 300 Doc 401 hellip

A query = 0 01 02 03 04 05 06 07 08 09 1

0

2

4

6

8

10

12

14

16

+ a few words

Re-rankingDoc 401 hellip

Doc 1

Doc 2

Doc 300

Doc 401

Doc N

Doc 300

Index 10M Images 46GBSearch Speed lt 100ms

Doc Meta

Search Example

Query Image

Search Example

Query Image

Simultaneously Modeling Semantics and Structure of Threaded Discussions A Sparse

Coding Approach and Its Applications

Chen LIN Jiang-Ming YANG Rui CAI Xin-jing WANG Wei WANG Lei ZHANG

SIGIR 2009

123

Semantic amp structure

124

SemanticTopics

StructureWho reply to who

Optimize them together

Model semantic

Model structure

125

Reply reconstruction

126

DocumentSimilarity

TopicSimilarity

StructureSimilarity

BaselinesNP

Reply to Nearest PostRR

Reply to RootDS

Document SimilarityLDA

Latent Dirichlet Allocation Project documents to topic space

SWB Special Words Topic Model with Background distribution Project documents to topic and junk topic space

127

Evaluation

method Slashdot Apple

All Posts Good Posts All Posts Good Posts

NP 0021 0012 0289 0239

RR 0183 0319 0269 0474

DS 0463 0643 0409 0628

LDA 0465 0644 0410 0648

SWB 0463 0644 0410 0641

SMSS 0524 0737 0517 0772

128

Expert finding

Reply reconstruction

Network construction

Expert finding

Methods

HITS

PageRank

hellip

129

Baselines

LMFormal Models for Expert Finding in Enterprise Corpora SIGIR

06Achieves stable performance in expert finding task using a

language modelPageRank

Benchmark nodal ranking methodHITS

Find hub nodes and authority nodeEABIF

Personalized Recommendation Driven by Information Flow SIGIR rsquo06

Find most influential node130

Evaluation

131

bull Bayesian estimate

Method MRR MAP P10

LM 0821 0698 0800

EABIF(ori) 0674 0362 0243

EABIF(rec) 0742 0318 0281

PageRank(ori) 0675 0377 0263

PageRank(rec) 0743 0321 0266

HITS(ori) 0906 0832 0900

HITS(rec) 0938 0822 0906

Summary

bull Matrix and probability are fundamental mathematics in information retrieval and computer visionndash Matrix decomposition ndash a good practice to learn matrixndash Graphical model ndash a good practice to learn probability

bull Graphical model is a good tool to analyze problems

bull The essence of decomposition is to discover a set of mid-level features to describe original documentsimages

bull It is more adaptable for various applications than matrix decomposition

  • An Introduction To Matrix Decomposition and Graphical Model
  • Outline
  • What Is Matrix Decomposition
  • Why We Need Matrix Decomposition
  • Why We Need Matrix Decomposition (2)
  • Principle Component Analysis
  • Definition ndash Eigenvalue amp Eigenvector
  • Definition ndash Principle Component Analysis
  • Principle Component Analysis (2)
  • Maximizing Variance
  • Optimization Problem
  • Property Data Decomposition
  • Face Recognition ndash Eigenface
  • Slide 14
  • Slide 15
  • PageRank ndash Power Iteration
  • Column-Stochastic amp Irreducible
  • Iterative PageRank Calculation
  • Convergence of the power iteration
  • Singular value decomposition
  • SVD - Definition
  • Singular Values And Singular Vectors
  • Matrix approximation
  • SVD and PCA
  • Example - LSI
  • SVD and PCA (2)
  • Latent Semantic Indexing (LSI)
  • Latent Semantic Indexing
  • Similarity Measures
  • Similarity Measures (2)
  • HITS (Hyperlink Induced Topic Search)
  • Power Iteration
  • HITS and SVD
  • HITS vs PageRank
  • NMF ndash Non-Negative Matrix Factorization
  • Definition
  • Motivation
  • Multiplicative Update Algorithm
  • Multiplicative Update Algorithm (2)
  • NMF vs PCA
  • Reference
  • Major Reference
  • Outline (2)
  • Not Included
  • Basic Concepts
  • What Is Machine Learning
  • Likelihood Function
  • Maximum Likelihood (ML)
  • IID ndash Independent Identically Distributed
  • Reference (2)
  • Expectation Maximization
  • Why We Need EM
  • More General
  • The Expectation Maximization (EM) Algorithm
  • Jensenrsquos Inequality
  • Lower Bounding the Log Likelihood
  • The E and M Steps of EM
  • The E Step
  • The M Step
  • EM Never Decreases the Likelihood
  • Reference (3)
  • Why Do We Need Graphical Model
  • Why Do We Need Graphical Models
  • Directed Acyclic Graphical Models (Bayesian Networks)
  • Directed Graphs for Statistical Models Plate Notation
  • pLSA ndash Probabilistic Latent Semantic Analysis
  • Latent Semantic Indexing (LSI) Review
  • pLSA ndash Probabilistic Latent Semantic Analysis (2)
  • pLSA
  • pLSA (2)
  • Joint Probability vs Likelihood
  • Document Decomposition
  • pLSA ndash Objective Function
  • EM Steps
  • Lower Bounding the Log Likelihood (2)
  • EM Steps (2)
  • Latent Subspace
  • pLSA vs LSA
  • pLSA vs LSA (2)
  • Applications
  • Text Mining
  • Scene Classification
  • Classification Result
  • Reference (4)
  • LDA ndash Latent Dirichilet Allocation
  • Problems in pLSA
  • Problems in pLSA (2)
  • Dirichlet Distribution
  • Dirichlet Distribution (2)
  • Example Dirichlet Distributions (K=3)
  • Example Dirichlet Distributions (K=3) (2)
  • The LDA Model
  • The LDA Model (2)
  • Joint Probability
  • Likelihood
  • Inference
  • Inference (2)
  • Variational Inference
  • Variantional Inference
  • VBEM vs EM
  • Parameter Estimation
  • Parameter Estimation (2)
  • Topic Examples in a 100-topic LDA Model)
  • Classification (50-topic LDA + SVM)
  • Problems in LDA
  • A Bayesian Hierarchical Model for Learning Natural Scene Catego
  • Codebook
  • Topic Distribution in Different Categories
  • Topic Hierarchical Clustering
  • More Topic Models
  • Are you really into Graphical Models
  • Reference (5)
  • Outline (3)
  • Slide 115
  • The Long Query Problem
  • Key Idea Dimension Reduction + Residual Error Preservation
  • Orthogonal Decomposition
  • A Probabilistic Implementation
  • Search (Online)
  • Search Example
  • Search Example (2)
  • Simultaneously Modeling Semantics and Structure of Threaded Dis
  • Semantic amp structure
  • Optimize them together
  • Reply reconstruction
  • Baselines
  • Evaluation
  • Expert finding
  • Baselines (2)
  • Evaluation (2)
  • Summary
Page 67: An Introduction To Matrix Decomposition and Graphical Model

pLSA

M

Nd

d

z

w

M

d

z1

w1

z2

w2

z3

w3

zN

wN

hellip

z1 hellip zN are variables ziє[1K]K is the number of latent topics

pLSA

d1

z1

w1

z2

w2

zN1

wN1

hellip

d2

z1

w1

z2

w2

zN2

wN2

hellip

dM

z1

w1

z2

w2

zNm

wNm

hellip

p(w|z=1) p(w|z=2) p(w | z=Nm) are shared for all documents

Likelihood

Joint Probability vs Likelihood

bull Joint probability

bull Likelihood (only for observed variables)

bull p(d) is assumed to be uniform

Document Decomposition

bull Each document can be decomposed as

bull This is similar to the matrix decomposition if we consider each discrete distribution as a vector

p(w|d) = ZVtimesk p(z|d)

bull With many documents we hope to find latent topics as common basis

pLSA ndash Objective Function

bull pLSA tries to maximize the log likelihood

bull Due to the summation over z inside log we have to resort to EM

EM Steps

bull E-Stepndash Expectation of the likelihood function is calculated with the current

parameter valuesbull M-Step

ndash Update the parameters with the calculated posterior probabilitiesndash Find the parameters that maximizes the likelihood function

Lower Bounding the Log Likelihood

EM Steps

bull The E-Step

bull The M-Step

Latent Subspace

pLSA vs LSAbull LSA and PLSA perform dimensionality reduction

ndash In LSA by keeping only K singular valuesndash In PLSA by having K aspects

bull Comparison to SVDndash U Matrix related to P(z|d) (doc to aspect)ndash V Matrix related to P(w|z) (aspect to term)ndash E Matrix related to P(z) (aspect strength)

pLSA vs LSAbull The main difference is the way the approximation is done

bull PLSA generates a model (aspect model) and maximizes its predictive power

bull Selecting the proper value of K is heuristic in LSA

bull Model selection in statistics can determine optimal K in PLSA

Applications

bull Text mining topic discovering

bull Scene Classification

Text Mining

Scene Classification

Classification Result

Reference

bull Thomas Hofmann Probabilistic latent semantic analysis In Proc of Uncertainty in Artificial Intelligence UAI99 Stockholm 1999

bull Bosch A Zisserman A and Munoz X Scene Classification via pLSAProceedings of the European Conference on Computer Vision (2006)

bull Sivic J Russell B C Efros A A Zisserman A and Freeman W T Discovering Object Categories in Image Collections MIT AI Lab Memo AIM-2005-005 February 2005

LDA ndash LATENT DIRICHILET ALLOCATION

Problems in pLSA

bull pLSA provides no probabilistic model at the document level Each doc has its own topic mixture proportion

bull The number of parameters in the model grows linearly with M (the number of documents in the training set)

Problems in pLSA

bull There is no constraint for distributions p(z|di)

bull Easy to lead to serious problems with over-fitting

d1

z1

w1

z2

w2

zN1

wN1

hellip

d2

z1

w1

z2

w2

zN2

wN2

hellip

dm

z1

w1

z2

w2

zNm

wNm

hellip

p(z|d1) p(z|d2) p(z|dm)

Dirichlet Distribution

bull In the LDA model the topic mixture proportions for each document are assumed to follow some distribution

bull Requirement for such a distributionndash The samples (mixture proportions) generated from it are K-tuples of non-

negative numbers that sum to one That is the samples are multinormialsndash Easy to optimize

bull Dirichlet Distribution is one of such distributions

bull The space of all of these multinomials has a nice geometric interpretation as a (k-1)-simplex

Dirichlet Distribution

bull Definition

bull The density is zero outside this open (K minus 1)-dimensional simplex

k

i ii

ki ik

i i

ki i

kk

xx

xxxxp i

1

11

1

12121

1 0 st

)Γ()(Γ)|(

bull Various parameter α

(6 2 2) (3 7 5)

(2 3 4) (6 2 6)

Example Dirichlet Distributions (K=3)

Example Dirichlet Distributions (K=3)

bull Equal αi different

α0=01 α0=1 α0=10

k

i i10

The LDA Model

z4z3z2z1

w4w3w2w1

b

z4z3z2z1

w4w3w2w1

z4z3z2z1

w4w3w2w1

bull For each documentbull Choose ~Dirichlet()bull For each of the N words wn

ndash Choose a topic znraquo Multinomial()

ndash Choose a word wn from p(wn|znb) a multinomial probability conditioned on the topic zn

The LDA Model

bull For each documentbull Choose ~Dirichlet()bull For each of the N words wn

ndash Choose a topic znraquo Multinomial()

ndash Choose a word wn from p(wn|znb) a multinomial probability conditioned on the topic zn

Joint Probability

bull Given parameter α and β

where

Likelihood

bull Joint Probability

bull Marginal distribution of a document

bull Likelihood over all the documents

Inference

bull The likelihood can be computed by summing each documentbull Jansenrsquos inequality in EM

Inference

bull In E-Step we need to compute the posterior distribution of the hidden variables

bull Unfortunately this distribution is intractable to compute in general

bull We have to resort to variational approach

Variational Inference

bull In variational inference we consider a simplified graphical model with variational parameters and minimize the KL Divergence between the variational and posterior distributions

Variantional Inference

bull The difference between the lower bound and the likelihood is the KL divergence

bull Maximizing the lower bound L(b) with respect to and is equivalent to minimizing the KL divergence

VBEM vs EM

bull Only different in the E-Step

bull In standard EM q(X) is directly set to p(X|Dθ) and let KL=0bull In VBEM it is intractable to compute p(X|Dθ) Instead it

approximates p(X|Dθ) by a variational distribution q(X) by minimizing KL(q(X) | P(X|D θ)

bull This is also equivalent to maximizing the lower bound L(θ)

Parameter Estimation

bull Given a corpus of documents we would like to find the parameters and b which maximize the likelihood of the observed data

bull Strategy (Variational EM)

bull Lower bound log p(w|b) by a function L(b)bull Repeat until convergence

ndash E Maximize L(b) with respect to the variational parameters ndash M Maximize the bound with respect to parameters and b

Parameter Estimation

bull E-Step Variational Inference ndash repeat until convergence

bull M-Step Parameter estimation

β

α can be implemented using the Newton-Raphson method

Topic Examples in a 100-topic LDA Model)bull 16000 documents from a subset of the TREC AP corpus

Classification (50-topic LDA + SVM)

bull Reuters-21578 dataset ndash contains 8000 documents and 15818 words

(a) EARN vs NOT EARN (b) GRAIN vs NOT GRAIN

Problems in LDA

bull Dirichlet Distribution is helpful to avoid over-fitting But the assumption might be too strong

z4z3z2z1

w4w3w2w1

b

z4z3z2z1

w4w3w2w1

z4z3z2z1

w4w3w2w1

A Bayesian Hierarchical Model for Learning Natural Scene Categories

bull Incorporating category information

MNd

π

z

x

θ

β

Codebookbull 174 Local Image Patches

bull DetectionEvenly Sampled GridRandom SamplingSaliency DetectorLowersquos DoG Detector

bull RepresentationNormalized 11x11 gray values128-dim SIFT

Topic Distribution in Different Categories

Topic Hierarchical Clustering

More Topic Models

bull Dynamic topic models ICML 2006bull Correlated Topic Model NIPS 2005bull Hierarchical Dirichlet Process Journal of the American

Statistical Association 2003bull Nonparametric Bayes pachinko allocation UAI 2007bull Supervised LDA NIPS 2007bull MedLDA ndash Maximum Margin Discrimant LDA ICML 2009bull hellip

Are you really into Graphical Models

bull Describing Visual Scenes using Transformed Dirichlet Processes E Sudderth A Torralba W Freeman and A Willsky NIPS Dec 2005

Reference

bull David M Blei Andrew Y Ng Michael I Jordan Latent Dirichlet Allocation Journal of Machine Learning Research (JMLR) 2003

bull Matthew J Beal Variational Algorithms for Approximate Bayesian Inference PhD Thesis University of Cambridge 1998

bull L Fei-Fei and P Perona A Bayesian Hierarchical Model for Learning Natural Scene Categories CVPR 2005

Outline

bull Matrix Decompositionndash PCA SVD NMFndash LDA ICA Sparse Coding etc

bull Graphical Modelndash Basic concepts in probabilistic machine learningndash EMndash pLSAndash LDA

bull Two Applicationsndash Document decomposition for ldquolong queryrdquo retrieval ICCV 2009ndash Modeling Threaded Discussions SIGIR 2009

Large-Scale Indexing for ldquoLong Queryrdquo Retrieval (Similarity Search)

Xiao Zhang Zhiwei Li Lei Zhang Wei-Ying Ma and Heung-Yeung Shum

ICCV 2009

The Long Query Problem

bull If a query contains 1000 keywordsndash Need to access 1000 inverted listsndash The intersection of 1000 inverted lists may be emptyndash The union of 1000 inverted list may be the whole corpus

bull Dimension reductionTerm1 Term2 Term3 Term4 hellip TermN

Img1 1 2 0 0 hellip 2

f1 f2 hellip fM

Img1 02 01 hellip 003

Topic Projection

Dim = 1 million

Dim = 200

Key Idea Dimension Reduction + Residual Error Preservation

bull p original TF-IDF vector in vocabulary spacebull X projection matrix for dimension reductionbull w low dimensional feature vectorbull residual error

p Xw

An image = 0 01 02 03 04 05 06 07 08 09 1

0

2

4

6

8

10

12

14

16

+ a few words(10 words)

Orthogonal Decomposition

p Xw 1 11 1 1

2 21 2 1 2

1 1

1

k

k

W k W

W W Wk W

p x xp x x w

p wp x x

Base vector

Low dimensional representation

Residual

An image = 0 01 02 03 04 05 06 07 08 09 1

0

2

4

6

8

10

12

14

16

+ a few words(10 words)

T Tp q p q

p q

w w

X1 X2 X3 hellip Xk

A Probabilistic Implementation

x is a switch variable It controls a word generated from

bull a topic specific distribution

bull a document specific distribution

bull a background distribution

( | )p w d 1( 0 | ) ( | ) ( | )K

kp x d p w z k p z k d

( 1| ) ( | )p x d p w d ( 2 | ) ( | )p x d p w

C Chemudugunta etal Modeling General and Specific Aspects of Documents with a Probabilistic Topic Model NIPS 2006

Search (Online)

DS1 DS2 hellip

DS1 DS2 hellip

DS1 DS2 hellip

DS1 DS2 hellip

DS1 DS2 hellip

hellip

hellip

hellip

hellip

hellip

LSH Index

Doc 300 Doc 401 hellip

A query = 0 01 02 03 04 05 06 07 08 09 1

0

2

4

6

8

10

12

14

16

+ a few words

Re-rankingDoc 401 hellip

Doc 1

Doc 2

Doc 300

Doc 401

Doc N

Doc 300

Index 10M Images 46GBSearch Speed lt 100ms

Doc Meta

Search Example

Query Image

Search Example

Query Image

Simultaneously Modeling Semantics and Structure of Threaded Discussions A Sparse

Coding Approach and Its Applications

Chen LIN Jiang-Ming YANG Rui CAI Xin-jing WANG Wei WANG Lei ZHANG

SIGIR 2009

123

Semantic amp structure

124

SemanticTopics

StructureWho reply to who

Optimize them together

Model semantic

Model structure

125

Reply reconstruction

126

DocumentSimilarity

TopicSimilarity

StructureSimilarity

BaselinesNP

Reply to Nearest PostRR

Reply to RootDS

Document SimilarityLDA

Latent Dirichlet Allocation Project documents to topic space

SWB Special Words Topic Model with Background distribution Project documents to topic and junk topic space

127

Evaluation

method Slashdot Apple

All Posts Good Posts All Posts Good Posts

NP 0021 0012 0289 0239

RR 0183 0319 0269 0474

DS 0463 0643 0409 0628

LDA 0465 0644 0410 0648

SWB 0463 0644 0410 0641

SMSS 0524 0737 0517 0772

128

Expert finding

Reply reconstruction

Network construction

Expert finding

Methods

HITS

PageRank

hellip

129

Baselines

LMFormal Models for Expert Finding in Enterprise Corpora SIGIR

06Achieves stable performance in expert finding task using a

language modelPageRank

Benchmark nodal ranking methodHITS

Find hub nodes and authority nodeEABIF

Personalized Recommendation Driven by Information Flow SIGIR rsquo06

Find most influential node130

Evaluation

131

bull Bayesian estimate

Method MRR MAP P10

LM 0821 0698 0800

EABIF(ori) 0674 0362 0243

EABIF(rec) 0742 0318 0281

PageRank(ori) 0675 0377 0263

PageRank(rec) 0743 0321 0266

HITS(ori) 0906 0832 0900

HITS(rec) 0938 0822 0906

Summary

bull Matrix and probability are fundamental mathematics in information retrieval and computer visionndash Matrix decomposition ndash a good practice to learn matrixndash Graphical model ndash a good practice to learn probability

bull Graphical model is a good tool to analyze problems

bull The essence of decomposition is to discover a set of mid-level features to describe original documentsimages

bull It is more adaptable for various applications than matrix decomposition

  • An Introduction To Matrix Decomposition and Graphical Model
  • Outline
  • What Is Matrix Decomposition
  • Why We Need Matrix Decomposition
  • Why We Need Matrix Decomposition (2)
  • Principle Component Analysis
  • Definition ndash Eigenvalue amp Eigenvector
  • Definition ndash Principle Component Analysis
  • Principle Component Analysis (2)
  • Maximizing Variance
  • Optimization Problem
  • Property Data Decomposition
  • Face Recognition ndash Eigenface
  • Slide 14
  • Slide 15
  • PageRank ndash Power Iteration
  • Column-Stochastic amp Irreducible
  • Iterative PageRank Calculation
  • Convergence of the power iteration
  • Singular value decomposition
  • SVD - Definition
  • Singular Values And Singular Vectors
  • Matrix approximation
  • SVD and PCA
  • Example - LSI
  • SVD and PCA (2)
  • Latent Semantic Indexing (LSI)
  • Latent Semantic Indexing
  • Similarity Measures
  • Similarity Measures (2)
  • HITS (Hyperlink Induced Topic Search)
  • Power Iteration
  • HITS and SVD
  • HITS vs PageRank
  • NMF ndash Non-Negative Matrix Factorization
  • Definition
  • Motivation
  • Multiplicative Update Algorithm
  • Multiplicative Update Algorithm (2)
  • NMF vs PCA
  • Reference
  • Major Reference
  • Outline (2)
  • Not Included
  • Basic Concepts
  • What Is Machine Learning
  • Likelihood Function
  • Maximum Likelihood (ML)
  • IID ndash Independent Identically Distributed
  • Reference (2)
  • Expectation Maximization
  • Why We Need EM
  • More General
  • The Expectation Maximization (EM) Algorithm
  • Jensenrsquos Inequality
  • Lower Bounding the Log Likelihood
  • The E and M Steps of EM
  • The E Step
  • The M Step
  • EM Never Decreases the Likelihood
  • Reference (3)
  • Why Do We Need Graphical Model
  • Why Do We Need Graphical Models
  • Directed Acyclic Graphical Models (Bayesian Networks)
  • Directed Graphs for Statistical Models Plate Notation
  • pLSA ndash Probabilistic Latent Semantic Analysis
  • Latent Semantic Indexing (LSI) Review
  • pLSA ndash Probabilistic Latent Semantic Analysis (2)
  • pLSA
  • pLSA (2)
  • Joint Probability vs Likelihood
  • Document Decomposition
  • pLSA ndash Objective Function
  • EM Steps
  • Lower Bounding the Log Likelihood (2)
  • EM Steps (2)
  • Latent Subspace
  • pLSA vs LSA
  • pLSA vs LSA (2)
  • Applications
  • Text Mining
  • Scene Classification
  • Classification Result
  • Reference (4)
  • LDA ndash Latent Dirichilet Allocation
  • Problems in pLSA
  • Problems in pLSA (2)
  • Dirichlet Distribution
  • Dirichlet Distribution (2)
  • Example Dirichlet Distributions (K=3)
  • Example Dirichlet Distributions (K=3) (2)
  • The LDA Model
  • The LDA Model (2)
  • Joint Probability
  • Likelihood
  • Inference
  • Inference (2)
  • Variational Inference
  • Variantional Inference
  • VBEM vs EM
  • Parameter Estimation
  • Parameter Estimation (2)
  • Topic Examples in a 100-topic LDA Model)
  • Classification (50-topic LDA + SVM)
  • Problems in LDA
  • A Bayesian Hierarchical Model for Learning Natural Scene Catego
  • Codebook
  • Topic Distribution in Different Categories
  • Topic Hierarchical Clustering
  • More Topic Models
  • Are you really into Graphical Models
  • Reference (5)
  • Outline (3)
  • Slide 115
  • The Long Query Problem
  • Key Idea Dimension Reduction + Residual Error Preservation
  • Orthogonal Decomposition
  • A Probabilistic Implementation
  • Search (Online)
  • Search Example
  • Search Example (2)
  • Simultaneously Modeling Semantics and Structure of Threaded Dis
  • Semantic amp structure
  • Optimize them together
  • Reply reconstruction
  • Baselines
  • Evaluation
  • Expert finding
  • Baselines (2)
  • Evaluation (2)
  • Summary
Page 68: An Introduction To Matrix Decomposition and Graphical Model

pLSA

d1

z1

w1

z2

w2

zN1

wN1

hellip

d2

z1

w1

z2

w2

zN2

wN2

hellip

dM

z1

w1

z2

w2

zNm

wNm

hellip

p(w|z=1) p(w|z=2) p(w | z=Nm) are shared for all documents

Likelihood

Joint Probability vs Likelihood

bull Joint probability

bull Likelihood (only for observed variables)

bull p(d) is assumed to be uniform

Document Decomposition

bull Each document can be decomposed as

bull This is similar to the matrix decomposition if we consider each discrete distribution as a vector

p(w|d) = ZVtimesk p(z|d)

bull With many documents we hope to find latent topics as common basis

pLSA ndash Objective Function

bull pLSA tries to maximize the log likelihood

bull Due to the summation over z inside log we have to resort to EM

EM Steps

bull E-Stepndash Expectation of the likelihood function is calculated with the current

parameter valuesbull M-Step

ndash Update the parameters with the calculated posterior probabilitiesndash Find the parameters that maximizes the likelihood function

Lower Bounding the Log Likelihood

EM Steps

bull The E-Step

bull The M-Step

Latent Subspace

pLSA vs LSAbull LSA and PLSA perform dimensionality reduction

ndash In LSA by keeping only K singular valuesndash In PLSA by having K aspects

bull Comparison to SVDndash U Matrix related to P(z|d) (doc to aspect)ndash V Matrix related to P(w|z) (aspect to term)ndash E Matrix related to P(z) (aspect strength)

pLSA vs LSAbull The main difference is the way the approximation is done

bull PLSA generates a model (aspect model) and maximizes its predictive power

bull Selecting the proper value of K is heuristic in LSA

bull Model selection in statistics can determine optimal K in PLSA

Applications

bull Text mining topic discovering

bull Scene Classification

Text Mining

Scene Classification

Classification Result

Reference

bull Thomas Hofmann Probabilistic latent semantic analysis In Proc of Uncertainty in Artificial Intelligence UAI99 Stockholm 1999

bull Bosch A Zisserman A and Munoz X Scene Classification via pLSAProceedings of the European Conference on Computer Vision (2006)

bull Sivic J Russell B C Efros A A Zisserman A and Freeman W T Discovering Object Categories in Image Collections MIT AI Lab Memo AIM-2005-005 February 2005

LDA ndash LATENT DIRICHILET ALLOCATION

Problems in pLSA

bull pLSA provides no probabilistic model at the document level Each doc has its own topic mixture proportion

bull The number of parameters in the model grows linearly with M (the number of documents in the training set)

Problems in pLSA

bull There is no constraint for distributions p(z|di)

bull Easy to lead to serious problems with over-fitting

d1

z1

w1

z2

w2

zN1

wN1

hellip

d2

z1

w1

z2

w2

zN2

wN2

hellip

dm

z1

w1

z2

w2

zNm

wNm

hellip

p(z|d1) p(z|d2) p(z|dm)

Dirichlet Distribution

bull In the LDA model the topic mixture proportions for each document are assumed to follow some distribution

bull Requirement for such a distributionndash The samples (mixture proportions) generated from it are K-tuples of non-

negative numbers that sum to one That is the samples are multinormialsndash Easy to optimize

bull Dirichlet Distribution is one of such distributions

bull The space of all of these multinomials has a nice geometric interpretation as a (k-1)-simplex

Dirichlet Distribution

bull Definition

bull The density is zero outside this open (K minus 1)-dimensional simplex

k

i ii

ki ik

i i

ki i

kk

xx

xxxxp i

1

11

1

12121

1 0 st

)Γ()(Γ)|(

bull Various parameter α

(6 2 2) (3 7 5)

(2 3 4) (6 2 6)

Example Dirichlet Distributions (K=3)

Example Dirichlet Distributions (K=3)

bull Equal αi different

α0=01 α0=1 α0=10

k

i i10

The LDA Model

z4z3z2z1

w4w3w2w1

b

z4z3z2z1

w4w3w2w1

z4z3z2z1

w4w3w2w1

bull For each documentbull Choose ~Dirichlet()bull For each of the N words wn

ndash Choose a topic znraquo Multinomial()

ndash Choose a word wn from p(wn|znb) a multinomial probability conditioned on the topic zn

The LDA Model

bull For each documentbull Choose ~Dirichlet()bull For each of the N words wn

ndash Choose a topic znraquo Multinomial()

ndash Choose a word wn from p(wn|znb) a multinomial probability conditioned on the topic zn

Joint Probability

bull Given parameter α and β

where

Likelihood

bull Joint Probability

bull Marginal distribution of a document

bull Likelihood over all the documents

Inference

bull The likelihood can be computed by summing each documentbull Jansenrsquos inequality in EM

Inference

bull In E-Step we need to compute the posterior distribution of the hidden variables

bull Unfortunately this distribution is intractable to compute in general

bull We have to resort to variational approach

Variational Inference

bull In variational inference we consider a simplified graphical model with variational parameters and minimize the KL Divergence between the variational and posterior distributions

Variantional Inference

bull The difference between the lower bound and the likelihood is the KL divergence

bull Maximizing the lower bound L(b) with respect to and is equivalent to minimizing the KL divergence

VBEM vs EM

bull Only different in the E-Step

bull In standard EM q(X) is directly set to p(X|Dθ) and let KL=0bull In VBEM it is intractable to compute p(X|Dθ) Instead it

approximates p(X|Dθ) by a variational distribution q(X) by minimizing KL(q(X) | P(X|D θ)

bull This is also equivalent to maximizing the lower bound L(θ)

Parameter Estimation

bull Given a corpus of documents we would like to find the parameters and b which maximize the likelihood of the observed data

bull Strategy (Variational EM)

bull Lower bound log p(w|b) by a function L(b)bull Repeat until convergence

ndash E Maximize L(b) with respect to the variational parameters ndash M Maximize the bound with respect to parameters and b

Parameter Estimation

bull E-Step Variational Inference ndash repeat until convergence

bull M-Step Parameter estimation

β

α can be implemented using the Newton-Raphson method

Topic Examples in a 100-topic LDA Model)bull 16000 documents from a subset of the TREC AP corpus

Classification (50-topic LDA + SVM)

bull Reuters-21578 dataset ndash contains 8000 documents and 15818 words

(a) EARN vs NOT EARN (b) GRAIN vs NOT GRAIN

Problems in LDA

bull Dirichlet Distribution is helpful to avoid over-fitting But the assumption might be too strong

z4z3z2z1

w4w3w2w1

b

z4z3z2z1

w4w3w2w1

z4z3z2z1

w4w3w2w1

A Bayesian Hierarchical Model for Learning Natural Scene Categories

bull Incorporating category information

MNd

π

z

x

θ

β

Codebookbull 174 Local Image Patches

bull DetectionEvenly Sampled GridRandom SamplingSaliency DetectorLowersquos DoG Detector

bull RepresentationNormalized 11x11 gray values128-dim SIFT

Topic Distribution in Different Categories

Topic Hierarchical Clustering

More Topic Models

bull Dynamic topic models ICML 2006bull Correlated Topic Model NIPS 2005bull Hierarchical Dirichlet Process Journal of the American

Statistical Association 2003bull Nonparametric Bayes pachinko allocation UAI 2007bull Supervised LDA NIPS 2007bull MedLDA ndash Maximum Margin Discrimant LDA ICML 2009bull hellip

Are you really into Graphical Models

bull Describing Visual Scenes using Transformed Dirichlet Processes E Sudderth A Torralba W Freeman and A Willsky NIPS Dec 2005

Reference

bull David M Blei Andrew Y Ng Michael I Jordan Latent Dirichlet Allocation Journal of Machine Learning Research (JMLR) 2003

bull Matthew J Beal Variational Algorithms for Approximate Bayesian Inference PhD Thesis University of Cambridge 1998

bull L Fei-Fei and P Perona A Bayesian Hierarchical Model for Learning Natural Scene Categories CVPR 2005

Outline

bull Matrix Decompositionndash PCA SVD NMFndash LDA ICA Sparse Coding etc

bull Graphical Modelndash Basic concepts in probabilistic machine learningndash EMndash pLSAndash LDA

bull Two Applicationsndash Document decomposition for ldquolong queryrdquo retrieval ICCV 2009ndash Modeling Threaded Discussions SIGIR 2009

Large-Scale Indexing for ldquoLong Queryrdquo Retrieval (Similarity Search)

Xiao Zhang Zhiwei Li Lei Zhang Wei-Ying Ma and Heung-Yeung Shum

ICCV 2009

The Long Query Problem

bull If a query contains 1000 keywordsndash Need to access 1000 inverted listsndash The intersection of 1000 inverted lists may be emptyndash The union of 1000 inverted list may be the whole corpus

bull Dimension reductionTerm1 Term2 Term3 Term4 hellip TermN

Img1 1 2 0 0 hellip 2

f1 f2 hellip fM

Img1 02 01 hellip 003

Topic Projection

Dim = 1 million

Dim = 200

Key Idea Dimension Reduction + Residual Error Preservation

bull p original TF-IDF vector in vocabulary spacebull X projection matrix for dimension reductionbull w low dimensional feature vectorbull residual error

p Xw

An image = 0 01 02 03 04 05 06 07 08 09 1

0

2

4

6

8

10

12

14

16

+ a few words(10 words)

Orthogonal Decomposition

p Xw 1 11 1 1

2 21 2 1 2

1 1

1

k

k

W k W

W W Wk W

p x xp x x w

p wp x x

Base vector

Low dimensional representation

Residual

An image = 0 01 02 03 04 05 06 07 08 09 1

0

2

4

6

8

10

12

14

16

+ a few words(10 words)

T Tp q p q

p q

w w

X1 X2 X3 hellip Xk

A Probabilistic Implementation

x is a switch variable It controls a word generated from

bull a topic specific distribution

bull a document specific distribution

bull a background distribution

( | )p w d 1( 0 | ) ( | ) ( | )K

kp x d p w z k p z k d

( 1| ) ( | )p x d p w d ( 2 | ) ( | )p x d p w

C Chemudugunta etal Modeling General and Specific Aspects of Documents with a Probabilistic Topic Model NIPS 2006

Search (Online)

DS1 DS2 hellip

DS1 DS2 hellip

DS1 DS2 hellip

DS1 DS2 hellip

DS1 DS2 hellip

hellip

hellip

hellip

hellip

hellip

LSH Index

Doc 300 Doc 401 hellip

A query = 0 01 02 03 04 05 06 07 08 09 1

0

2

4

6

8

10

12

14

16

+ a few words

Re-rankingDoc 401 hellip

Doc 1

Doc 2

Doc 300

Doc 401

Doc N

Doc 300

Index 10M Images 46GBSearch Speed lt 100ms

Doc Meta

Search Example

Query Image

Search Example

Query Image

Simultaneously Modeling Semantics and Structure of Threaded Discussions A Sparse

Coding Approach and Its Applications

Chen LIN Jiang-Ming YANG Rui CAI Xin-jing WANG Wei WANG Lei ZHANG

SIGIR 2009

123

Semantic amp structure

124

SemanticTopics

StructureWho reply to who

Optimize them together

Model semantic

Model structure

125

Reply reconstruction

126

DocumentSimilarity

TopicSimilarity

StructureSimilarity

BaselinesNP

Reply to Nearest PostRR

Reply to RootDS

Document SimilarityLDA

Latent Dirichlet Allocation Project documents to topic space

SWB Special Words Topic Model with Background distribution Project documents to topic and junk topic space

127

Evaluation

method Slashdot Apple

All Posts Good Posts All Posts Good Posts

NP 0021 0012 0289 0239

RR 0183 0319 0269 0474

DS 0463 0643 0409 0628

LDA 0465 0644 0410 0648

SWB 0463 0644 0410 0641

SMSS 0524 0737 0517 0772

128

Expert finding

Reply reconstruction

Network construction

Expert finding

Methods

HITS

PageRank

hellip

129

Baselines

LMFormal Models for Expert Finding in Enterprise Corpora SIGIR

06Achieves stable performance in expert finding task using a

language modelPageRank

Benchmark nodal ranking methodHITS

Find hub nodes and authority nodeEABIF

Personalized Recommendation Driven by Information Flow SIGIR rsquo06

Find most influential node130

Evaluation

131

bull Bayesian estimate

Method MRR MAP P10

LM 0821 0698 0800

EABIF(ori) 0674 0362 0243

EABIF(rec) 0742 0318 0281

PageRank(ori) 0675 0377 0263

PageRank(rec) 0743 0321 0266

HITS(ori) 0906 0832 0900

HITS(rec) 0938 0822 0906

Summary

bull Matrix and probability are fundamental mathematics in information retrieval and computer visionndash Matrix decomposition ndash a good practice to learn matrixndash Graphical model ndash a good practice to learn probability

bull Graphical model is a good tool to analyze problems

bull The essence of decomposition is to discover a set of mid-level features to describe original documentsimages

bull It is more adaptable for various applications than matrix decomposition

  • An Introduction To Matrix Decomposition and Graphical Model
  • Outline
  • What Is Matrix Decomposition
  • Why We Need Matrix Decomposition
  • Why We Need Matrix Decomposition (2)
  • Principle Component Analysis
  • Definition ndash Eigenvalue amp Eigenvector
  • Definition ndash Principle Component Analysis
  • Principle Component Analysis (2)
  • Maximizing Variance
  • Optimization Problem
  • Property Data Decomposition
  • Face Recognition ndash Eigenface
  • Slide 14
  • Slide 15
  • PageRank ndash Power Iteration
  • Column-Stochastic amp Irreducible
  • Iterative PageRank Calculation
  • Convergence of the power iteration
  • Singular value decomposition
  • SVD - Definition
  • Singular Values And Singular Vectors
  • Matrix approximation
  • SVD and PCA
  • Example - LSI
  • SVD and PCA (2)
  • Latent Semantic Indexing (LSI)
  • Latent Semantic Indexing
  • Similarity Measures
  • Similarity Measures (2)
  • HITS (Hyperlink Induced Topic Search)
  • Power Iteration
  • HITS and SVD
  • HITS vs PageRank
  • NMF ndash Non-Negative Matrix Factorization
  • Definition
  • Motivation
  • Multiplicative Update Algorithm
  • Multiplicative Update Algorithm (2)
  • NMF vs PCA
  • Reference
  • Major Reference
  • Outline (2)
  • Not Included
  • Basic Concepts
  • What Is Machine Learning
  • Likelihood Function
  • Maximum Likelihood (ML)
  • IID ndash Independent Identically Distributed
  • Reference (2)
  • Expectation Maximization
  • Why We Need EM
  • More General
  • The Expectation Maximization (EM) Algorithm
  • Jensenrsquos Inequality
  • Lower Bounding the Log Likelihood
  • The E and M Steps of EM
  • The E Step
  • The M Step
  • EM Never Decreases the Likelihood
  • Reference (3)
  • Why Do We Need Graphical Model
  • Why Do We Need Graphical Models
  • Directed Acyclic Graphical Models (Bayesian Networks)
  • Directed Graphs for Statistical Models Plate Notation
  • pLSA ndash Probabilistic Latent Semantic Analysis
  • Latent Semantic Indexing (LSI) Review
  • pLSA ndash Probabilistic Latent Semantic Analysis (2)
  • pLSA
  • pLSA (2)
  • Joint Probability vs Likelihood
  • Document Decomposition
  • pLSA ndash Objective Function
  • EM Steps
  • Lower Bounding the Log Likelihood (2)
  • EM Steps (2)
  • Latent Subspace
  • pLSA vs LSA
  • pLSA vs LSA (2)
  • Applications
  • Text Mining
  • Scene Classification
  • Classification Result
  • Reference (4)
  • LDA ndash Latent Dirichilet Allocation
  • Problems in pLSA
  • Problems in pLSA (2)
  • Dirichlet Distribution
  • Dirichlet Distribution (2)
  • Example Dirichlet Distributions (K=3)
  • Example Dirichlet Distributions (K=3) (2)
  • The LDA Model
  • The LDA Model (2)
  • Joint Probability
  • Likelihood
  • Inference
  • Inference (2)
  • Variational Inference
  • Variantional Inference
  • VBEM vs EM
  • Parameter Estimation
  • Parameter Estimation (2)
  • Topic Examples in a 100-topic LDA Model)
  • Classification (50-topic LDA + SVM)
  • Problems in LDA
  • A Bayesian Hierarchical Model for Learning Natural Scene Catego
  • Codebook
  • Topic Distribution in Different Categories
  • Topic Hierarchical Clustering
  • More Topic Models
  • Are you really into Graphical Models
  • Reference (5)
  • Outline (3)
  • Slide 115
  • The Long Query Problem
  • Key Idea Dimension Reduction + Residual Error Preservation
  • Orthogonal Decomposition
  • A Probabilistic Implementation
  • Search (Online)
  • Search Example
  • Search Example (2)
  • Simultaneously Modeling Semantics and Structure of Threaded Dis
  • Semantic amp structure
  • Optimize them together
  • Reply reconstruction
  • Baselines
  • Evaluation
  • Expert finding
  • Baselines (2)
  • Evaluation (2)
  • Summary
Page 69: An Introduction To Matrix Decomposition and Graphical Model

Joint Probability vs Likelihood

bull Joint probability

bull Likelihood (only for observed variables)

bull p(d) is assumed to be uniform

Document Decomposition

bull Each document can be decomposed as

bull This is similar to the matrix decomposition if we consider each discrete distribution as a vector

p(w|d) = ZVtimesk p(z|d)

bull With many documents we hope to find latent topics as common basis

pLSA ndash Objective Function

bull pLSA tries to maximize the log likelihood

bull Due to the summation over z inside log we have to resort to EM

EM Steps

bull E-Stepndash Expectation of the likelihood function is calculated with the current

parameter valuesbull M-Step

ndash Update the parameters with the calculated posterior probabilitiesndash Find the parameters that maximizes the likelihood function

Lower Bounding the Log Likelihood

EM Steps

bull The E-Step

bull The M-Step

Latent Subspace

pLSA vs LSAbull LSA and PLSA perform dimensionality reduction

ndash In LSA by keeping only K singular valuesndash In PLSA by having K aspects

bull Comparison to SVDndash U Matrix related to P(z|d) (doc to aspect)ndash V Matrix related to P(w|z) (aspect to term)ndash E Matrix related to P(z) (aspect strength)

pLSA vs LSAbull The main difference is the way the approximation is done

bull PLSA generates a model (aspect model) and maximizes its predictive power

bull Selecting the proper value of K is heuristic in LSA

bull Model selection in statistics can determine optimal K in PLSA

Applications

bull Text mining topic discovering

bull Scene Classification

Text Mining

Scene Classification

Classification Result

Reference

bull Thomas Hofmann Probabilistic latent semantic analysis In Proc of Uncertainty in Artificial Intelligence UAI99 Stockholm 1999

bull Bosch A Zisserman A and Munoz X Scene Classification via pLSAProceedings of the European Conference on Computer Vision (2006)

bull Sivic J Russell B C Efros A A Zisserman A and Freeman W T Discovering Object Categories in Image Collections MIT AI Lab Memo AIM-2005-005 February 2005

LDA ndash LATENT DIRICHILET ALLOCATION

Problems in pLSA

bull pLSA provides no probabilistic model at the document level Each doc has its own topic mixture proportion

bull The number of parameters in the model grows linearly with M (the number of documents in the training set)

Problems in pLSA

bull There is no constraint for distributions p(z|di)

bull Easy to lead to serious problems with over-fitting

d1

z1

w1

z2

w2

zN1

wN1

hellip

d2

z1

w1

z2

w2

zN2

wN2

hellip

dm

z1

w1

z2

w2

zNm

wNm

hellip

p(z|d1) p(z|d2) p(z|dm)

Dirichlet Distribution

bull In the LDA model the topic mixture proportions for each document are assumed to follow some distribution

bull Requirement for such a distributionndash The samples (mixture proportions) generated from it are K-tuples of non-

negative numbers that sum to one That is the samples are multinormialsndash Easy to optimize

bull Dirichlet Distribution is one of such distributions

bull The space of all of these multinomials has a nice geometric interpretation as a (k-1)-simplex

Dirichlet Distribution

bull Definition

bull The density is zero outside this open (K minus 1)-dimensional simplex

k

i ii

ki ik

i i

ki i

kk

xx

xxxxp i

1

11

1

12121

1 0 st

)Γ()(Γ)|(

bull Various parameter α

(6 2 2) (3 7 5)

(2 3 4) (6 2 6)

Example Dirichlet Distributions (K=3)

Example Dirichlet Distributions (K=3)

bull Equal αi different

α0=01 α0=1 α0=10

k

i i10

The LDA Model

z4z3z2z1

w4w3w2w1

b

z4z3z2z1

w4w3w2w1

z4z3z2z1

w4w3w2w1

bull For each documentbull Choose ~Dirichlet()bull For each of the N words wn

ndash Choose a topic znraquo Multinomial()

ndash Choose a word wn from p(wn|znb) a multinomial probability conditioned on the topic zn

The LDA Model

bull For each documentbull Choose ~Dirichlet()bull For each of the N words wn

ndash Choose a topic znraquo Multinomial()

ndash Choose a word wn from p(wn|znb) a multinomial probability conditioned on the topic zn

Joint Probability

bull Given parameter α and β

where

Likelihood

bull Joint Probability

bull Marginal distribution of a document

bull Likelihood over all the documents

Inference

bull The likelihood can be computed by summing each documentbull Jansenrsquos inequality in EM

Inference

bull In E-Step we need to compute the posterior distribution of the hidden variables

bull Unfortunately this distribution is intractable to compute in general

bull We have to resort to variational approach

Variational Inference

bull In variational inference we consider a simplified graphical model with variational parameters and minimize the KL Divergence between the variational and posterior distributions

Variantional Inference

bull The difference between the lower bound and the likelihood is the KL divergence

bull Maximizing the lower bound L(b) with respect to and is equivalent to minimizing the KL divergence

VBEM vs EM

bull Only different in the E-Step

bull In standard EM q(X) is directly set to p(X|Dθ) and let KL=0bull In VBEM it is intractable to compute p(X|Dθ) Instead it

approximates p(X|Dθ) by a variational distribution q(X) by minimizing KL(q(X) | P(X|D θ)

bull This is also equivalent to maximizing the lower bound L(θ)

Parameter Estimation

bull Given a corpus of documents we would like to find the parameters and b which maximize the likelihood of the observed data

bull Strategy (Variational EM)

bull Lower bound log p(w|b) by a function L(b)bull Repeat until convergence

ndash E Maximize L(b) with respect to the variational parameters ndash M Maximize the bound with respect to parameters and b

Parameter Estimation

bull E-Step Variational Inference ndash repeat until convergence

bull M-Step Parameter estimation

β

α can be implemented using the Newton-Raphson method

Topic Examples in a 100-topic LDA Model)bull 16000 documents from a subset of the TREC AP corpus

Classification (50-topic LDA + SVM)

bull Reuters-21578 dataset ndash contains 8000 documents and 15818 words

(a) EARN vs NOT EARN (b) GRAIN vs NOT GRAIN

Problems in LDA

bull Dirichlet Distribution is helpful to avoid over-fitting But the assumption might be too strong

z4z3z2z1

w4w3w2w1

b

z4z3z2z1

w4w3w2w1

z4z3z2z1

w4w3w2w1

A Bayesian Hierarchical Model for Learning Natural Scene Categories

bull Incorporating category information

MNd

π

z

x

θ

β

Codebookbull 174 Local Image Patches

bull DetectionEvenly Sampled GridRandom SamplingSaliency DetectorLowersquos DoG Detector

bull RepresentationNormalized 11x11 gray values128-dim SIFT

Topic Distribution in Different Categories

Topic Hierarchical Clustering

More Topic Models

bull Dynamic topic models ICML 2006bull Correlated Topic Model NIPS 2005bull Hierarchical Dirichlet Process Journal of the American

Statistical Association 2003bull Nonparametric Bayes pachinko allocation UAI 2007bull Supervised LDA NIPS 2007bull MedLDA ndash Maximum Margin Discrimant LDA ICML 2009bull hellip

Are you really into Graphical Models

bull Describing Visual Scenes using Transformed Dirichlet Processes E Sudderth A Torralba W Freeman and A Willsky NIPS Dec 2005

Reference

bull David M Blei Andrew Y Ng Michael I Jordan Latent Dirichlet Allocation Journal of Machine Learning Research (JMLR) 2003

bull Matthew J Beal Variational Algorithms for Approximate Bayesian Inference PhD Thesis University of Cambridge 1998

bull L Fei-Fei and P Perona A Bayesian Hierarchical Model for Learning Natural Scene Categories CVPR 2005

Outline

bull Matrix Decompositionndash PCA SVD NMFndash LDA ICA Sparse Coding etc

bull Graphical Modelndash Basic concepts in probabilistic machine learningndash EMndash pLSAndash LDA

bull Two Applicationsndash Document decomposition for ldquolong queryrdquo retrieval ICCV 2009ndash Modeling Threaded Discussions SIGIR 2009

Large-Scale Indexing for ldquoLong Queryrdquo Retrieval (Similarity Search)

Xiao Zhang Zhiwei Li Lei Zhang Wei-Ying Ma and Heung-Yeung Shum

ICCV 2009

The Long Query Problem

bull If a query contains 1000 keywordsndash Need to access 1000 inverted listsndash The intersection of 1000 inverted lists may be emptyndash The union of 1000 inverted list may be the whole corpus

bull Dimension reductionTerm1 Term2 Term3 Term4 hellip TermN

Img1 1 2 0 0 hellip 2

f1 f2 hellip fM

Img1 02 01 hellip 003

Topic Projection

Dim = 1 million

Dim = 200

Key Idea Dimension Reduction + Residual Error Preservation

bull p original TF-IDF vector in vocabulary spacebull X projection matrix for dimension reductionbull w low dimensional feature vectorbull residual error

p Xw

An image = 0 01 02 03 04 05 06 07 08 09 1

0

2

4

6

8

10

12

14

16

+ a few words(10 words)

Orthogonal Decomposition

p Xw 1 11 1 1

2 21 2 1 2

1 1

1

k

k

W k W

W W Wk W

p x xp x x w

p wp x x

Base vector

Low dimensional representation

Residual

An image = 0 01 02 03 04 05 06 07 08 09 1

0

2

4

6

8

10

12

14

16

+ a few words(10 words)

T Tp q p q

p q

w w

X1 X2 X3 hellip Xk

A Probabilistic Implementation

x is a switch variable It controls a word generated from

bull a topic specific distribution

bull a document specific distribution

bull a background distribution

( | )p w d 1( 0 | ) ( | ) ( | )K

kp x d p w z k p z k d

( 1| ) ( | )p x d p w d ( 2 | ) ( | )p x d p w

C Chemudugunta etal Modeling General and Specific Aspects of Documents with a Probabilistic Topic Model NIPS 2006

Search (Online)

DS1 DS2 hellip

DS1 DS2 hellip

DS1 DS2 hellip

DS1 DS2 hellip

DS1 DS2 hellip

hellip

hellip

hellip

hellip

hellip

LSH Index

Doc 300 Doc 401 hellip

A query = 0 01 02 03 04 05 06 07 08 09 1

0

2

4

6

8

10

12

14

16

+ a few words

Re-rankingDoc 401 hellip

Doc 1

Doc 2

Doc 300

Doc 401

Doc N

Doc 300

Index 10M Images 46GBSearch Speed lt 100ms

Doc Meta

Search Example

Query Image

Search Example

Query Image

Simultaneously Modeling Semantics and Structure of Threaded Discussions A Sparse

Coding Approach and Its Applications

Chen LIN Jiang-Ming YANG Rui CAI Xin-jing WANG Wei WANG Lei ZHANG

SIGIR 2009

123

Semantic amp structure

124

SemanticTopics

StructureWho reply to who

Optimize them together

Model semantic

Model structure

125

Reply reconstruction

126

DocumentSimilarity

TopicSimilarity

StructureSimilarity

BaselinesNP

Reply to Nearest PostRR

Reply to RootDS

Document SimilarityLDA

Latent Dirichlet Allocation Project documents to topic space

SWB Special Words Topic Model with Background distribution Project documents to topic and junk topic space

127

Evaluation

method Slashdot Apple

All Posts Good Posts All Posts Good Posts

NP 0021 0012 0289 0239

RR 0183 0319 0269 0474

DS 0463 0643 0409 0628

LDA 0465 0644 0410 0648

SWB 0463 0644 0410 0641

SMSS 0524 0737 0517 0772

128

Expert finding

Reply reconstruction

Network construction

Expert finding

Methods

HITS

PageRank

hellip

129

Baselines

LMFormal Models for Expert Finding in Enterprise Corpora SIGIR

06Achieves stable performance in expert finding task using a

language modelPageRank

Benchmark nodal ranking methodHITS

Find hub nodes and authority nodeEABIF

Personalized Recommendation Driven by Information Flow SIGIR rsquo06

Find most influential node130

Evaluation

131

bull Bayesian estimate

Method MRR MAP P10

LM 0821 0698 0800

EABIF(ori) 0674 0362 0243

EABIF(rec) 0742 0318 0281

PageRank(ori) 0675 0377 0263

PageRank(rec) 0743 0321 0266

HITS(ori) 0906 0832 0900

HITS(rec) 0938 0822 0906

Summary

bull Matrix and probability are fundamental mathematics in information retrieval and computer visionndash Matrix decomposition ndash a good practice to learn matrixndash Graphical model ndash a good practice to learn probability

bull Graphical model is a good tool to analyze problems

bull The essence of decomposition is to discover a set of mid-level features to describe original documentsimages

bull It is more adaptable for various applications than matrix decomposition

  • An Introduction To Matrix Decomposition and Graphical Model
  • Outline
  • What Is Matrix Decomposition
  • Why We Need Matrix Decomposition
  • Why We Need Matrix Decomposition (2)
  • Principle Component Analysis
  • Definition ndash Eigenvalue amp Eigenvector
  • Definition ndash Principle Component Analysis
  • Principle Component Analysis (2)
  • Maximizing Variance
  • Optimization Problem
  • Property Data Decomposition
  • Face Recognition ndash Eigenface
  • Slide 14
  • Slide 15
  • PageRank ndash Power Iteration
  • Column-Stochastic amp Irreducible
  • Iterative PageRank Calculation
  • Convergence of the power iteration
  • Singular value decomposition
  • SVD - Definition
  • Singular Values And Singular Vectors
  • Matrix approximation
  • SVD and PCA
  • Example - LSI
  • SVD and PCA (2)
  • Latent Semantic Indexing (LSI)
  • Latent Semantic Indexing
  • Similarity Measures
  • Similarity Measures (2)
  • HITS (Hyperlink Induced Topic Search)
  • Power Iteration
  • HITS and SVD
  • HITS vs PageRank
  • NMF ndash Non-Negative Matrix Factorization
  • Definition
  • Motivation
  • Multiplicative Update Algorithm
  • Multiplicative Update Algorithm (2)
  • NMF vs PCA
  • Reference
  • Major Reference
  • Outline (2)
  • Not Included
  • Basic Concepts
  • What Is Machine Learning
  • Likelihood Function
  • Maximum Likelihood (ML)
  • IID ndash Independent Identically Distributed
  • Reference (2)
  • Expectation Maximization
  • Why We Need EM
  • More General
  • The Expectation Maximization (EM) Algorithm
  • Jensenrsquos Inequality
  • Lower Bounding the Log Likelihood
  • The E and M Steps of EM
  • The E Step
  • The M Step
  • EM Never Decreases the Likelihood
  • Reference (3)
  • Why Do We Need Graphical Model
  • Why Do We Need Graphical Models
  • Directed Acyclic Graphical Models (Bayesian Networks)
  • Directed Graphs for Statistical Models Plate Notation
  • pLSA ndash Probabilistic Latent Semantic Analysis
  • Latent Semantic Indexing (LSI) Review
  • pLSA ndash Probabilistic Latent Semantic Analysis (2)
  • pLSA
  • pLSA (2)
  • Joint Probability vs Likelihood
  • Document Decomposition
  • pLSA ndash Objective Function
  • EM Steps
  • Lower Bounding the Log Likelihood (2)
  • EM Steps (2)
  • Latent Subspace
  • pLSA vs LSA
  • pLSA vs LSA (2)
  • Applications
  • Text Mining
  • Scene Classification
  • Classification Result
  • Reference (4)
  • LDA ndash Latent Dirichilet Allocation
  • Problems in pLSA
  • Problems in pLSA (2)
  • Dirichlet Distribution
  • Dirichlet Distribution (2)
  • Example Dirichlet Distributions (K=3)
  • Example Dirichlet Distributions (K=3) (2)
  • The LDA Model
  • The LDA Model (2)
  • Joint Probability
  • Likelihood
  • Inference
  • Inference (2)
  • Variational Inference
  • Variantional Inference
  • VBEM vs EM
  • Parameter Estimation
  • Parameter Estimation (2)
  • Topic Examples in a 100-topic LDA Model)
  • Classification (50-topic LDA + SVM)
  • Problems in LDA
  • A Bayesian Hierarchical Model for Learning Natural Scene Catego
  • Codebook
  • Topic Distribution in Different Categories
  • Topic Hierarchical Clustering
  • More Topic Models
  • Are you really into Graphical Models
  • Reference (5)
  • Outline (3)
  • Slide 115
  • The Long Query Problem
  • Key Idea Dimension Reduction + Residual Error Preservation
  • Orthogonal Decomposition
  • A Probabilistic Implementation
  • Search (Online)
  • Search Example
  • Search Example (2)
  • Simultaneously Modeling Semantics and Structure of Threaded Dis
  • Semantic amp structure
  • Optimize them together
  • Reply reconstruction
  • Baselines
  • Evaluation
  • Expert finding
  • Baselines (2)
  • Evaluation (2)
  • Summary
Page 70: An Introduction To Matrix Decomposition and Graphical Model

Document Decomposition

bull Each document can be decomposed as

bull This is similar to the matrix decomposition if we consider each discrete distribution as a vector

p(w|d) = ZVtimesk p(z|d)

bull With many documents we hope to find latent topics as common basis

pLSA ndash Objective Function

bull pLSA tries to maximize the log likelihood

bull Due to the summation over z inside log we have to resort to EM

EM Steps

bull E-Stepndash Expectation of the likelihood function is calculated with the current

parameter valuesbull M-Step

ndash Update the parameters with the calculated posterior probabilitiesndash Find the parameters that maximizes the likelihood function

Lower Bounding the Log Likelihood

EM Steps

bull The E-Step

bull The M-Step

Latent Subspace

pLSA vs LSAbull LSA and PLSA perform dimensionality reduction

ndash In LSA by keeping only K singular valuesndash In PLSA by having K aspects

bull Comparison to SVDndash U Matrix related to P(z|d) (doc to aspect)ndash V Matrix related to P(w|z) (aspect to term)ndash E Matrix related to P(z) (aspect strength)

pLSA vs LSAbull The main difference is the way the approximation is done

bull PLSA generates a model (aspect model) and maximizes its predictive power

bull Selecting the proper value of K is heuristic in LSA

bull Model selection in statistics can determine optimal K in PLSA

Applications

bull Text mining topic discovering

bull Scene Classification

Text Mining

Scene Classification

Classification Result

Reference

bull Thomas Hofmann Probabilistic latent semantic analysis In Proc of Uncertainty in Artificial Intelligence UAI99 Stockholm 1999

bull Bosch A Zisserman A and Munoz X Scene Classification via pLSAProceedings of the European Conference on Computer Vision (2006)

bull Sivic J Russell B C Efros A A Zisserman A and Freeman W T Discovering Object Categories in Image Collections MIT AI Lab Memo AIM-2005-005 February 2005

LDA ndash LATENT DIRICHILET ALLOCATION

Problems in pLSA

bull pLSA provides no probabilistic model at the document level Each doc has its own topic mixture proportion

bull The number of parameters in the model grows linearly with M (the number of documents in the training set)

Problems in pLSA

bull There is no constraint for distributions p(z|di)

bull Easy to lead to serious problems with over-fitting

d1

z1

w1

z2

w2

zN1

wN1

hellip

d2

z1

w1

z2

w2

zN2

wN2

hellip

dm

z1

w1

z2

w2

zNm

wNm

hellip

p(z|d1) p(z|d2) p(z|dm)

Dirichlet Distribution

bull In the LDA model the topic mixture proportions for each document are assumed to follow some distribution

bull Requirement for such a distributionndash The samples (mixture proportions) generated from it are K-tuples of non-

negative numbers that sum to one That is the samples are multinormialsndash Easy to optimize

bull Dirichlet Distribution is one of such distributions

bull The space of all of these multinomials has a nice geometric interpretation as a (k-1)-simplex

Dirichlet Distribution

bull Definition

bull The density is zero outside this open (K minus 1)-dimensional simplex

k

i ii

ki ik

i i

ki i

kk

xx

xxxxp i

1

11

1

12121

1 0 st

)Γ()(Γ)|(

bull Various parameter α

(6 2 2) (3 7 5)

(2 3 4) (6 2 6)

Example Dirichlet Distributions (K=3)

Example Dirichlet Distributions (K=3)

bull Equal αi different

α0=01 α0=1 α0=10

k

i i10

The LDA Model

z4z3z2z1

w4w3w2w1

b

z4z3z2z1

w4w3w2w1

z4z3z2z1

w4w3w2w1

bull For each documentbull Choose ~Dirichlet()bull For each of the N words wn

ndash Choose a topic znraquo Multinomial()

ndash Choose a word wn from p(wn|znb) a multinomial probability conditioned on the topic zn

The LDA Model

bull For each documentbull Choose ~Dirichlet()bull For each of the N words wn

ndash Choose a topic znraquo Multinomial()

ndash Choose a word wn from p(wn|znb) a multinomial probability conditioned on the topic zn

Joint Probability

bull Given parameter α and β

where

Likelihood

bull Joint Probability

bull Marginal distribution of a document

bull Likelihood over all the documents

Inference

bull The likelihood can be computed by summing each documentbull Jansenrsquos inequality in EM

Inference

bull In E-Step we need to compute the posterior distribution of the hidden variables

bull Unfortunately this distribution is intractable to compute in general

bull We have to resort to variational approach

Variational Inference

bull In variational inference we consider a simplified graphical model with variational parameters and minimize the KL Divergence between the variational and posterior distributions

Variantional Inference

bull The difference between the lower bound and the likelihood is the KL divergence

bull Maximizing the lower bound L(b) with respect to and is equivalent to minimizing the KL divergence

VBEM vs EM

bull Only different in the E-Step

bull In standard EM q(X) is directly set to p(X|Dθ) and let KL=0bull In VBEM it is intractable to compute p(X|Dθ) Instead it

approximates p(X|Dθ) by a variational distribution q(X) by minimizing KL(q(X) | P(X|D θ)

bull This is also equivalent to maximizing the lower bound L(θ)

Parameter Estimation

bull Given a corpus of documents we would like to find the parameters and b which maximize the likelihood of the observed data

bull Strategy (Variational EM)

bull Lower bound log p(w|b) by a function L(b)bull Repeat until convergence

ndash E Maximize L(b) with respect to the variational parameters ndash M Maximize the bound with respect to parameters and b

Parameter Estimation

bull E-Step Variational Inference ndash repeat until convergence

bull M-Step Parameter estimation

β

α can be implemented using the Newton-Raphson method

Topic Examples in a 100-topic LDA Model)bull 16000 documents from a subset of the TREC AP corpus

Classification (50-topic LDA + SVM)

bull Reuters-21578 dataset ndash contains 8000 documents and 15818 words

(a) EARN vs NOT EARN (b) GRAIN vs NOT GRAIN

Problems in LDA

bull Dirichlet Distribution is helpful to avoid over-fitting But the assumption might be too strong

z4z3z2z1

w4w3w2w1

b

z4z3z2z1

w4w3w2w1

z4z3z2z1

w4w3w2w1

A Bayesian Hierarchical Model for Learning Natural Scene Categories

bull Incorporating category information

MNd

π

z

x

θ

β

Codebookbull 174 Local Image Patches

bull DetectionEvenly Sampled GridRandom SamplingSaliency DetectorLowersquos DoG Detector

bull RepresentationNormalized 11x11 gray values128-dim SIFT

Topic Distribution in Different Categories

Topic Hierarchical Clustering

More Topic Models

bull Dynamic topic models ICML 2006bull Correlated Topic Model NIPS 2005bull Hierarchical Dirichlet Process Journal of the American

Statistical Association 2003bull Nonparametric Bayes pachinko allocation UAI 2007bull Supervised LDA NIPS 2007bull MedLDA ndash Maximum Margin Discrimant LDA ICML 2009bull hellip

Are you really into Graphical Models

bull Describing Visual Scenes using Transformed Dirichlet Processes E Sudderth A Torralba W Freeman and A Willsky NIPS Dec 2005

Reference

bull David M Blei Andrew Y Ng Michael I Jordan Latent Dirichlet Allocation Journal of Machine Learning Research (JMLR) 2003

bull Matthew J Beal Variational Algorithms for Approximate Bayesian Inference PhD Thesis University of Cambridge 1998

bull L Fei-Fei and P Perona A Bayesian Hierarchical Model for Learning Natural Scene Categories CVPR 2005

Outline

bull Matrix Decompositionndash PCA SVD NMFndash LDA ICA Sparse Coding etc

bull Graphical Modelndash Basic concepts in probabilistic machine learningndash EMndash pLSAndash LDA

bull Two Applicationsndash Document decomposition for ldquolong queryrdquo retrieval ICCV 2009ndash Modeling Threaded Discussions SIGIR 2009

Large-Scale Indexing for ldquoLong Queryrdquo Retrieval (Similarity Search)

Xiao Zhang Zhiwei Li Lei Zhang Wei-Ying Ma and Heung-Yeung Shum

ICCV 2009

The Long Query Problem

bull If a query contains 1000 keywordsndash Need to access 1000 inverted listsndash The intersection of 1000 inverted lists may be emptyndash The union of 1000 inverted list may be the whole corpus

bull Dimension reductionTerm1 Term2 Term3 Term4 hellip TermN

Img1 1 2 0 0 hellip 2

f1 f2 hellip fM

Img1 02 01 hellip 003

Topic Projection

Dim = 1 million

Dim = 200

Key Idea Dimension Reduction + Residual Error Preservation

bull p original TF-IDF vector in vocabulary spacebull X projection matrix for dimension reductionbull w low dimensional feature vectorbull residual error

p Xw

An image = 0 01 02 03 04 05 06 07 08 09 1

0

2

4

6

8

10

12

14

16

+ a few words(10 words)

Orthogonal Decomposition

p Xw 1 11 1 1

2 21 2 1 2

1 1

1

k

k

W k W

W W Wk W

p x xp x x w

p wp x x

Base vector

Low dimensional representation

Residual

An image = 0 01 02 03 04 05 06 07 08 09 1

0

2

4

6

8

10

12

14

16

+ a few words(10 words)

T Tp q p q

p q

w w

X1 X2 X3 hellip Xk

A Probabilistic Implementation

x is a switch variable It controls a word generated from

bull a topic specific distribution

bull a document specific distribution

bull a background distribution

( | )p w d 1( 0 | ) ( | ) ( | )K

kp x d p w z k p z k d

( 1| ) ( | )p x d p w d ( 2 | ) ( | )p x d p w

C Chemudugunta etal Modeling General and Specific Aspects of Documents with a Probabilistic Topic Model NIPS 2006

Search (Online)

DS1 DS2 hellip

DS1 DS2 hellip

DS1 DS2 hellip

DS1 DS2 hellip

DS1 DS2 hellip

hellip

hellip

hellip

hellip

hellip

LSH Index

Doc 300 Doc 401 hellip

A query = 0 01 02 03 04 05 06 07 08 09 1

0

2

4

6

8

10

12

14

16

+ a few words

Re-rankingDoc 401 hellip

Doc 1

Doc 2

Doc 300

Doc 401

Doc N

Doc 300

Index 10M Images 46GBSearch Speed lt 100ms

Doc Meta

Search Example

Query Image

Search Example

Query Image

Simultaneously Modeling Semantics and Structure of Threaded Discussions A Sparse

Coding Approach and Its Applications

Chen LIN Jiang-Ming YANG Rui CAI Xin-jing WANG Wei WANG Lei ZHANG

SIGIR 2009

123

Semantic amp structure

124

SemanticTopics

StructureWho reply to who

Optimize them together

Model semantic

Model structure

125

Reply reconstruction

126

DocumentSimilarity

TopicSimilarity

StructureSimilarity

BaselinesNP

Reply to Nearest PostRR

Reply to RootDS

Document SimilarityLDA

Latent Dirichlet Allocation Project documents to topic space

SWB Special Words Topic Model with Background distribution Project documents to topic and junk topic space

127

Evaluation

method Slashdot Apple

All Posts Good Posts All Posts Good Posts

NP 0021 0012 0289 0239

RR 0183 0319 0269 0474

DS 0463 0643 0409 0628

LDA 0465 0644 0410 0648

SWB 0463 0644 0410 0641

SMSS 0524 0737 0517 0772

128

Expert finding

Reply reconstruction

Network construction

Expert finding

Methods

HITS

PageRank

hellip

129

Baselines

LMFormal Models for Expert Finding in Enterprise Corpora SIGIR

06Achieves stable performance in expert finding task using a

language modelPageRank

Benchmark nodal ranking methodHITS

Find hub nodes and authority nodeEABIF

Personalized Recommendation Driven by Information Flow SIGIR rsquo06

Find most influential node130

Evaluation

131

bull Bayesian estimate

Method MRR MAP P10

LM 0821 0698 0800

EABIF(ori) 0674 0362 0243

EABIF(rec) 0742 0318 0281

PageRank(ori) 0675 0377 0263

PageRank(rec) 0743 0321 0266

HITS(ori) 0906 0832 0900

HITS(rec) 0938 0822 0906

Summary

bull Matrix and probability are fundamental mathematics in information retrieval and computer visionndash Matrix decomposition ndash a good practice to learn matrixndash Graphical model ndash a good practice to learn probability

bull Graphical model is a good tool to analyze problems

bull The essence of decomposition is to discover a set of mid-level features to describe original documentsimages

bull It is more adaptable for various applications than matrix decomposition

  • An Introduction To Matrix Decomposition and Graphical Model
  • Outline
  • What Is Matrix Decomposition
  • Why We Need Matrix Decomposition
  • Why We Need Matrix Decomposition (2)
  • Principle Component Analysis
  • Definition ndash Eigenvalue amp Eigenvector
  • Definition ndash Principle Component Analysis
  • Principle Component Analysis (2)
  • Maximizing Variance
  • Optimization Problem
  • Property Data Decomposition
  • Face Recognition ndash Eigenface
  • Slide 14
  • Slide 15
  • PageRank ndash Power Iteration
  • Column-Stochastic amp Irreducible
  • Iterative PageRank Calculation
  • Convergence of the power iteration
  • Singular value decomposition
  • SVD - Definition
  • Singular Values And Singular Vectors
  • Matrix approximation
  • SVD and PCA
  • Example - LSI
  • SVD and PCA (2)
  • Latent Semantic Indexing (LSI)
  • Latent Semantic Indexing
  • Similarity Measures
  • Similarity Measures (2)
  • HITS (Hyperlink Induced Topic Search)
  • Power Iteration
  • HITS and SVD
  • HITS vs PageRank
  • NMF ndash Non-Negative Matrix Factorization
  • Definition
  • Motivation
  • Multiplicative Update Algorithm
  • Multiplicative Update Algorithm (2)
  • NMF vs PCA
  • Reference
  • Major Reference
  • Outline (2)
  • Not Included
  • Basic Concepts
  • What Is Machine Learning
  • Likelihood Function
  • Maximum Likelihood (ML)
  • IID ndash Independent Identically Distributed
  • Reference (2)
  • Expectation Maximization
  • Why We Need EM
  • More General
  • The Expectation Maximization (EM) Algorithm
  • Jensenrsquos Inequality
  • Lower Bounding the Log Likelihood
  • The E and M Steps of EM
  • The E Step
  • The M Step
  • EM Never Decreases the Likelihood
  • Reference (3)
  • Why Do We Need Graphical Model
  • Why Do We Need Graphical Models
  • Directed Acyclic Graphical Models (Bayesian Networks)
  • Directed Graphs for Statistical Models Plate Notation
  • pLSA ndash Probabilistic Latent Semantic Analysis
  • Latent Semantic Indexing (LSI) Review
  • pLSA ndash Probabilistic Latent Semantic Analysis (2)
  • pLSA
  • pLSA (2)
  • Joint Probability vs Likelihood
  • Document Decomposition
  • pLSA ndash Objective Function
  • EM Steps
  • Lower Bounding the Log Likelihood (2)
  • EM Steps (2)
  • Latent Subspace
  • pLSA vs LSA
  • pLSA vs LSA (2)
  • Applications
  • Text Mining
  • Scene Classification
  • Classification Result
  • Reference (4)
  • LDA ndash Latent Dirichilet Allocation
  • Problems in pLSA
  • Problems in pLSA (2)
  • Dirichlet Distribution
  • Dirichlet Distribution (2)
  • Example Dirichlet Distributions (K=3)
  • Example Dirichlet Distributions (K=3) (2)
  • The LDA Model
  • The LDA Model (2)
  • Joint Probability
  • Likelihood
  • Inference
  • Inference (2)
  • Variational Inference
  • Variantional Inference
  • VBEM vs EM
  • Parameter Estimation
  • Parameter Estimation (2)
  • Topic Examples in a 100-topic LDA Model)
  • Classification (50-topic LDA + SVM)
  • Problems in LDA
  • A Bayesian Hierarchical Model for Learning Natural Scene Catego
  • Codebook
  • Topic Distribution in Different Categories
  • Topic Hierarchical Clustering
  • More Topic Models
  • Are you really into Graphical Models
  • Reference (5)
  • Outline (3)
  • Slide 115
  • The Long Query Problem
  • Key Idea Dimension Reduction + Residual Error Preservation
  • Orthogonal Decomposition
  • A Probabilistic Implementation
  • Search (Online)
  • Search Example
  • Search Example (2)
  • Simultaneously Modeling Semantics and Structure of Threaded Dis
  • Semantic amp structure
  • Optimize them together
  • Reply reconstruction
  • Baselines
  • Evaluation
  • Expert finding
  • Baselines (2)
  • Evaluation (2)
  • Summary
Page 71: An Introduction To Matrix Decomposition and Graphical Model

pLSA ndash Objective Function

bull pLSA tries to maximize the log likelihood

bull Due to the summation over z inside log we have to resort to EM

EM Steps

bull E-Stepndash Expectation of the likelihood function is calculated with the current

parameter valuesbull M-Step

ndash Update the parameters with the calculated posterior probabilitiesndash Find the parameters that maximizes the likelihood function

Lower Bounding the Log Likelihood

EM Steps

bull The E-Step

bull The M-Step

Latent Subspace

pLSA vs LSAbull LSA and PLSA perform dimensionality reduction

ndash In LSA by keeping only K singular valuesndash In PLSA by having K aspects

bull Comparison to SVDndash U Matrix related to P(z|d) (doc to aspect)ndash V Matrix related to P(w|z) (aspect to term)ndash E Matrix related to P(z) (aspect strength)

pLSA vs LSAbull The main difference is the way the approximation is done

bull PLSA generates a model (aspect model) and maximizes its predictive power

bull Selecting the proper value of K is heuristic in LSA

bull Model selection in statistics can determine optimal K in PLSA

Applications

bull Text mining topic discovering

bull Scene Classification

Text Mining

Scene Classification

Classification Result

Reference

bull Thomas Hofmann Probabilistic latent semantic analysis In Proc of Uncertainty in Artificial Intelligence UAI99 Stockholm 1999

bull Bosch A Zisserman A and Munoz X Scene Classification via pLSAProceedings of the European Conference on Computer Vision (2006)

bull Sivic J Russell B C Efros A A Zisserman A and Freeman W T Discovering Object Categories in Image Collections MIT AI Lab Memo AIM-2005-005 February 2005

LDA ndash LATENT DIRICHILET ALLOCATION

Problems in pLSA

bull pLSA provides no probabilistic model at the document level Each doc has its own topic mixture proportion

bull The number of parameters in the model grows linearly with M (the number of documents in the training set)

Problems in pLSA

bull There is no constraint for distributions p(z|di)

bull Easy to lead to serious problems with over-fitting

d1

z1

w1

z2

w2

zN1

wN1

hellip

d2

z1

w1

z2

w2

zN2

wN2

hellip

dm

z1

w1

z2

w2

zNm

wNm

hellip

p(z|d1) p(z|d2) p(z|dm)

Dirichlet Distribution

bull In the LDA model the topic mixture proportions for each document are assumed to follow some distribution

bull Requirement for such a distributionndash The samples (mixture proportions) generated from it are K-tuples of non-

negative numbers that sum to one That is the samples are multinormialsndash Easy to optimize

bull Dirichlet Distribution is one of such distributions

bull The space of all of these multinomials has a nice geometric interpretation as a (k-1)-simplex

Dirichlet Distribution

bull Definition

bull The density is zero outside this open (K minus 1)-dimensional simplex

k

i ii

ki ik

i i

ki i

kk

xx

xxxxp i

1

11

1

12121

1 0 st

)Γ()(Γ)|(

bull Various parameter α

(6 2 2) (3 7 5)

(2 3 4) (6 2 6)

Example Dirichlet Distributions (K=3)

Example Dirichlet Distributions (K=3)

bull Equal αi different

α0=01 α0=1 α0=10

k

i i10

The LDA Model

z4z3z2z1

w4w3w2w1

b

z4z3z2z1

w4w3w2w1

z4z3z2z1

w4w3w2w1

bull For each documentbull Choose ~Dirichlet()bull For each of the N words wn

ndash Choose a topic znraquo Multinomial()

ndash Choose a word wn from p(wn|znb) a multinomial probability conditioned on the topic zn

The LDA Model

bull For each documentbull Choose ~Dirichlet()bull For each of the N words wn

ndash Choose a topic znraquo Multinomial()

ndash Choose a word wn from p(wn|znb) a multinomial probability conditioned on the topic zn

Joint Probability

bull Given parameter α and β

where

Likelihood

bull Joint Probability

bull Marginal distribution of a document

bull Likelihood over all the documents

Inference

bull The likelihood can be computed by summing each documentbull Jansenrsquos inequality in EM

Inference

bull In E-Step we need to compute the posterior distribution of the hidden variables

bull Unfortunately this distribution is intractable to compute in general

bull We have to resort to variational approach

Variational Inference

bull In variational inference we consider a simplified graphical model with variational parameters and minimize the KL Divergence between the variational and posterior distributions

Variantional Inference

bull The difference between the lower bound and the likelihood is the KL divergence

bull Maximizing the lower bound L(b) with respect to and is equivalent to minimizing the KL divergence

VBEM vs EM

bull Only different in the E-Step

bull In standard EM q(X) is directly set to p(X|Dθ) and let KL=0bull In VBEM it is intractable to compute p(X|Dθ) Instead it

approximates p(X|Dθ) by a variational distribution q(X) by minimizing KL(q(X) | P(X|D θ)

bull This is also equivalent to maximizing the lower bound L(θ)

Parameter Estimation

bull Given a corpus of documents we would like to find the parameters and b which maximize the likelihood of the observed data

bull Strategy (Variational EM)

bull Lower bound log p(w|b) by a function L(b)bull Repeat until convergence

ndash E Maximize L(b) with respect to the variational parameters ndash M Maximize the bound with respect to parameters and b

Parameter Estimation

bull E-Step Variational Inference ndash repeat until convergence

bull M-Step Parameter estimation

β

α can be implemented using the Newton-Raphson method

Topic Examples in a 100-topic LDA Model)bull 16000 documents from a subset of the TREC AP corpus

Classification (50-topic LDA + SVM)

bull Reuters-21578 dataset ndash contains 8000 documents and 15818 words

(a) EARN vs NOT EARN (b) GRAIN vs NOT GRAIN

Problems in LDA

bull Dirichlet Distribution is helpful to avoid over-fitting But the assumption might be too strong

z4z3z2z1

w4w3w2w1

b

z4z3z2z1

w4w3w2w1

z4z3z2z1

w4w3w2w1

A Bayesian Hierarchical Model for Learning Natural Scene Categories

bull Incorporating category information

MNd

π

z

x

θ

β

Codebookbull 174 Local Image Patches

bull DetectionEvenly Sampled GridRandom SamplingSaliency DetectorLowersquos DoG Detector

bull RepresentationNormalized 11x11 gray values128-dim SIFT

Topic Distribution in Different Categories

Topic Hierarchical Clustering

More Topic Models

bull Dynamic topic models ICML 2006bull Correlated Topic Model NIPS 2005bull Hierarchical Dirichlet Process Journal of the American

Statistical Association 2003bull Nonparametric Bayes pachinko allocation UAI 2007bull Supervised LDA NIPS 2007bull MedLDA ndash Maximum Margin Discrimant LDA ICML 2009bull hellip

Are you really into Graphical Models

bull Describing Visual Scenes using Transformed Dirichlet Processes E Sudderth A Torralba W Freeman and A Willsky NIPS Dec 2005

Reference

bull David M Blei Andrew Y Ng Michael I Jordan Latent Dirichlet Allocation Journal of Machine Learning Research (JMLR) 2003

bull Matthew J Beal Variational Algorithms for Approximate Bayesian Inference PhD Thesis University of Cambridge 1998

bull L Fei-Fei and P Perona A Bayesian Hierarchical Model for Learning Natural Scene Categories CVPR 2005

Outline

bull Matrix Decompositionndash PCA SVD NMFndash LDA ICA Sparse Coding etc

bull Graphical Modelndash Basic concepts in probabilistic machine learningndash EMndash pLSAndash LDA

bull Two Applicationsndash Document decomposition for ldquolong queryrdquo retrieval ICCV 2009ndash Modeling Threaded Discussions SIGIR 2009

Large-Scale Indexing for ldquoLong Queryrdquo Retrieval (Similarity Search)

Xiao Zhang Zhiwei Li Lei Zhang Wei-Ying Ma and Heung-Yeung Shum

ICCV 2009

The Long Query Problem

bull If a query contains 1000 keywordsndash Need to access 1000 inverted listsndash The intersection of 1000 inverted lists may be emptyndash The union of 1000 inverted list may be the whole corpus

bull Dimension reductionTerm1 Term2 Term3 Term4 hellip TermN

Img1 1 2 0 0 hellip 2

f1 f2 hellip fM

Img1 02 01 hellip 003

Topic Projection

Dim = 1 million

Dim = 200

Key Idea Dimension Reduction + Residual Error Preservation

bull p original TF-IDF vector in vocabulary spacebull X projection matrix for dimension reductionbull w low dimensional feature vectorbull residual error

p Xw

An image = 0 01 02 03 04 05 06 07 08 09 1

0

2

4

6

8

10

12

14

16

+ a few words(10 words)

Orthogonal Decomposition

p Xw 1 11 1 1

2 21 2 1 2

1 1

1

k

k

W k W

W W Wk W

p x xp x x w

p wp x x

Base vector

Low dimensional representation

Residual

An image = 0 01 02 03 04 05 06 07 08 09 1

0

2

4

6

8

10

12

14

16

+ a few words(10 words)

T Tp q p q

p q

w w

X1 X2 X3 hellip Xk

A Probabilistic Implementation

x is a switch variable It controls a word generated from

bull a topic specific distribution

bull a document specific distribution

bull a background distribution

( | )p w d 1( 0 | ) ( | ) ( | )K

kp x d p w z k p z k d

( 1| ) ( | )p x d p w d ( 2 | ) ( | )p x d p w

C Chemudugunta etal Modeling General and Specific Aspects of Documents with a Probabilistic Topic Model NIPS 2006

Search (Online)

DS1 DS2 hellip

DS1 DS2 hellip

DS1 DS2 hellip

DS1 DS2 hellip

DS1 DS2 hellip

hellip

hellip

hellip

hellip

hellip

LSH Index

Doc 300 Doc 401 hellip

A query = 0 01 02 03 04 05 06 07 08 09 1

0

2

4

6

8

10

12

14

16

+ a few words

Re-rankingDoc 401 hellip

Doc 1

Doc 2

Doc 300

Doc 401

Doc N

Doc 300

Index 10M Images 46GBSearch Speed lt 100ms

Doc Meta

Search Example

Query Image

Search Example

Query Image

Simultaneously Modeling Semantics and Structure of Threaded Discussions A Sparse

Coding Approach and Its Applications

Chen LIN Jiang-Ming YANG Rui CAI Xin-jing WANG Wei WANG Lei ZHANG

SIGIR 2009

123

Semantic amp structure

124

SemanticTopics

StructureWho reply to who

Optimize them together

Model semantic

Model structure

125

Reply reconstruction

126

DocumentSimilarity

TopicSimilarity

StructureSimilarity

BaselinesNP

Reply to Nearest PostRR

Reply to RootDS

Document SimilarityLDA

Latent Dirichlet Allocation Project documents to topic space

SWB Special Words Topic Model with Background distribution Project documents to topic and junk topic space

127

Evaluation

method Slashdot Apple

All Posts Good Posts All Posts Good Posts

NP 0021 0012 0289 0239

RR 0183 0319 0269 0474

DS 0463 0643 0409 0628

LDA 0465 0644 0410 0648

SWB 0463 0644 0410 0641

SMSS 0524 0737 0517 0772

128

Expert finding

Reply reconstruction

Network construction

Expert finding

Methods

HITS

PageRank

hellip

129

Baselines

LMFormal Models for Expert Finding in Enterprise Corpora SIGIR

06Achieves stable performance in expert finding task using a

language modelPageRank

Benchmark nodal ranking methodHITS

Find hub nodes and authority nodeEABIF

Personalized Recommendation Driven by Information Flow SIGIR rsquo06

Find most influential node130

Evaluation

131

bull Bayesian estimate

Method MRR MAP P10

LM 0821 0698 0800

EABIF(ori) 0674 0362 0243

EABIF(rec) 0742 0318 0281

PageRank(ori) 0675 0377 0263

PageRank(rec) 0743 0321 0266

HITS(ori) 0906 0832 0900

HITS(rec) 0938 0822 0906

Summary

bull Matrix and probability are fundamental mathematics in information retrieval and computer visionndash Matrix decomposition ndash a good practice to learn matrixndash Graphical model ndash a good practice to learn probability

bull Graphical model is a good tool to analyze problems

bull The essence of decomposition is to discover a set of mid-level features to describe original documentsimages

bull It is more adaptable for various applications than matrix decomposition

  • An Introduction To Matrix Decomposition and Graphical Model
  • Outline
  • What Is Matrix Decomposition
  • Why We Need Matrix Decomposition
  • Why We Need Matrix Decomposition (2)
  • Principle Component Analysis
  • Definition ndash Eigenvalue amp Eigenvector
  • Definition ndash Principle Component Analysis
  • Principle Component Analysis (2)
  • Maximizing Variance
  • Optimization Problem
  • Property Data Decomposition
  • Face Recognition ndash Eigenface
  • Slide 14
  • Slide 15
  • PageRank ndash Power Iteration
  • Column-Stochastic amp Irreducible
  • Iterative PageRank Calculation
  • Convergence of the power iteration
  • Singular value decomposition
  • SVD - Definition
  • Singular Values And Singular Vectors
  • Matrix approximation
  • SVD and PCA
  • Example - LSI
  • SVD and PCA (2)
  • Latent Semantic Indexing (LSI)
  • Latent Semantic Indexing
  • Similarity Measures
  • Similarity Measures (2)
  • HITS (Hyperlink Induced Topic Search)
  • Power Iteration
  • HITS and SVD
  • HITS vs PageRank
  • NMF ndash Non-Negative Matrix Factorization
  • Definition
  • Motivation
  • Multiplicative Update Algorithm
  • Multiplicative Update Algorithm (2)
  • NMF vs PCA
  • Reference
  • Major Reference
  • Outline (2)
  • Not Included
  • Basic Concepts
  • What Is Machine Learning
  • Likelihood Function
  • Maximum Likelihood (ML)
  • IID ndash Independent Identically Distributed
  • Reference (2)
  • Expectation Maximization
  • Why We Need EM
  • More General
  • The Expectation Maximization (EM) Algorithm
  • Jensenrsquos Inequality
  • Lower Bounding the Log Likelihood
  • The E and M Steps of EM
  • The E Step
  • The M Step
  • EM Never Decreases the Likelihood
  • Reference (3)
  • Why Do We Need Graphical Model
  • Why Do We Need Graphical Models
  • Directed Acyclic Graphical Models (Bayesian Networks)
  • Directed Graphs for Statistical Models Plate Notation
  • pLSA ndash Probabilistic Latent Semantic Analysis
  • Latent Semantic Indexing (LSI) Review
  • pLSA ndash Probabilistic Latent Semantic Analysis (2)
  • pLSA
  • pLSA (2)
  • Joint Probability vs Likelihood
  • Document Decomposition
  • pLSA ndash Objective Function
  • EM Steps
  • Lower Bounding the Log Likelihood (2)
  • EM Steps (2)
  • Latent Subspace
  • pLSA vs LSA
  • pLSA vs LSA (2)
  • Applications
  • Text Mining
  • Scene Classification
  • Classification Result
  • Reference (4)
  • LDA ndash Latent Dirichilet Allocation
  • Problems in pLSA
  • Problems in pLSA (2)
  • Dirichlet Distribution
  • Dirichlet Distribution (2)
  • Example Dirichlet Distributions (K=3)
  • Example Dirichlet Distributions (K=3) (2)
  • The LDA Model
  • The LDA Model (2)
  • Joint Probability
  • Likelihood
  • Inference
  • Inference (2)
  • Variational Inference
  • Variantional Inference
  • VBEM vs EM
  • Parameter Estimation
  • Parameter Estimation (2)
  • Topic Examples in a 100-topic LDA Model)
  • Classification (50-topic LDA + SVM)
  • Problems in LDA
  • A Bayesian Hierarchical Model for Learning Natural Scene Catego
  • Codebook
  • Topic Distribution in Different Categories
  • Topic Hierarchical Clustering
  • More Topic Models
  • Are you really into Graphical Models
  • Reference (5)
  • Outline (3)
  • Slide 115
  • The Long Query Problem
  • Key Idea Dimension Reduction + Residual Error Preservation
  • Orthogonal Decomposition
  • A Probabilistic Implementation
  • Search (Online)
  • Search Example
  • Search Example (2)
  • Simultaneously Modeling Semantics and Structure of Threaded Dis
  • Semantic amp structure
  • Optimize them together
  • Reply reconstruction
  • Baselines
  • Evaluation
  • Expert finding
  • Baselines (2)
  • Evaluation (2)
  • Summary
Page 72: An Introduction To Matrix Decomposition and Graphical Model

EM Steps

bull E-Stepndash Expectation of the likelihood function is calculated with the current

parameter valuesbull M-Step

ndash Update the parameters with the calculated posterior probabilitiesndash Find the parameters that maximizes the likelihood function

Lower Bounding the Log Likelihood

EM Steps

bull The E-Step

bull The M-Step

Latent Subspace

pLSA vs LSAbull LSA and PLSA perform dimensionality reduction

ndash In LSA by keeping only K singular valuesndash In PLSA by having K aspects

bull Comparison to SVDndash U Matrix related to P(z|d) (doc to aspect)ndash V Matrix related to P(w|z) (aspect to term)ndash E Matrix related to P(z) (aspect strength)

pLSA vs LSAbull The main difference is the way the approximation is done

bull PLSA generates a model (aspect model) and maximizes its predictive power

bull Selecting the proper value of K is heuristic in LSA

bull Model selection in statistics can determine optimal K in PLSA

Applications

bull Text mining topic discovering

bull Scene Classification

Text Mining

Scene Classification

Classification Result

Reference

bull Thomas Hofmann Probabilistic latent semantic analysis In Proc of Uncertainty in Artificial Intelligence UAI99 Stockholm 1999

bull Bosch A Zisserman A and Munoz X Scene Classification via pLSAProceedings of the European Conference on Computer Vision (2006)

bull Sivic J Russell B C Efros A A Zisserman A and Freeman W T Discovering Object Categories in Image Collections MIT AI Lab Memo AIM-2005-005 February 2005

LDA ndash LATENT DIRICHILET ALLOCATION

Problems in pLSA

bull pLSA provides no probabilistic model at the document level Each doc has its own topic mixture proportion

bull The number of parameters in the model grows linearly with M (the number of documents in the training set)

Problems in pLSA

bull There is no constraint for distributions p(z|di)

bull Easy to lead to serious problems with over-fitting

d1

z1

w1

z2

w2

zN1

wN1

hellip

d2

z1

w1

z2

w2

zN2

wN2

hellip

dm

z1

w1

z2

w2

zNm

wNm

hellip

p(z|d1) p(z|d2) p(z|dm)

Dirichlet Distribution

bull In the LDA model the topic mixture proportions for each document are assumed to follow some distribution

bull Requirement for such a distributionndash The samples (mixture proportions) generated from it are K-tuples of non-

negative numbers that sum to one That is the samples are multinormialsndash Easy to optimize

bull Dirichlet Distribution is one of such distributions

bull The space of all of these multinomials has a nice geometric interpretation as a (k-1)-simplex

Dirichlet Distribution

bull Definition

bull The density is zero outside this open (K minus 1)-dimensional simplex

k

i ii

ki ik

i i

ki i

kk

xx

xxxxp i

1

11

1

12121

1 0 st

)Γ()(Γ)|(

bull Various parameter α

(6 2 2) (3 7 5)

(2 3 4) (6 2 6)

Example Dirichlet Distributions (K=3)

Example Dirichlet Distributions (K=3)

bull Equal αi different

α0=01 α0=1 α0=10

k

i i10

The LDA Model

z4z3z2z1

w4w3w2w1

b

z4z3z2z1

w4w3w2w1

z4z3z2z1

w4w3w2w1

bull For each documentbull Choose ~Dirichlet()bull For each of the N words wn

ndash Choose a topic znraquo Multinomial()

ndash Choose a word wn from p(wn|znb) a multinomial probability conditioned on the topic zn

The LDA Model

bull For each documentbull Choose ~Dirichlet()bull For each of the N words wn

ndash Choose a topic znraquo Multinomial()

ndash Choose a word wn from p(wn|znb) a multinomial probability conditioned on the topic zn

Joint Probability

bull Given parameter α and β

where

Likelihood

bull Joint Probability

bull Marginal distribution of a document

bull Likelihood over all the documents

Inference

bull The likelihood can be computed by summing each documentbull Jansenrsquos inequality in EM

Inference

bull In E-Step we need to compute the posterior distribution of the hidden variables

bull Unfortunately this distribution is intractable to compute in general

bull We have to resort to variational approach

Variational Inference

bull In variational inference we consider a simplified graphical model with variational parameters and minimize the KL Divergence between the variational and posterior distributions

Variantional Inference

bull The difference between the lower bound and the likelihood is the KL divergence

bull Maximizing the lower bound L(b) with respect to and is equivalent to minimizing the KL divergence

VBEM vs EM

bull Only different in the E-Step

bull In standard EM q(X) is directly set to p(X|Dθ) and let KL=0bull In VBEM it is intractable to compute p(X|Dθ) Instead it

approximates p(X|Dθ) by a variational distribution q(X) by minimizing KL(q(X) | P(X|D θ)

bull This is also equivalent to maximizing the lower bound L(θ)

Parameter Estimation

bull Given a corpus of documents we would like to find the parameters and b which maximize the likelihood of the observed data

bull Strategy (Variational EM)

bull Lower bound log p(w|b) by a function L(b)bull Repeat until convergence

ndash E Maximize L(b) with respect to the variational parameters ndash M Maximize the bound with respect to parameters and b

Parameter Estimation

bull E-Step Variational Inference ndash repeat until convergence

bull M-Step Parameter estimation

β

α can be implemented using the Newton-Raphson method

Topic Examples in a 100-topic LDA Model)bull 16000 documents from a subset of the TREC AP corpus

Classification (50-topic LDA + SVM)

bull Reuters-21578 dataset ndash contains 8000 documents and 15818 words

(a) EARN vs NOT EARN (b) GRAIN vs NOT GRAIN

Problems in LDA

bull Dirichlet Distribution is helpful to avoid over-fitting But the assumption might be too strong

z4z3z2z1

w4w3w2w1

b

z4z3z2z1

w4w3w2w1

z4z3z2z1

w4w3w2w1

A Bayesian Hierarchical Model for Learning Natural Scene Categories

bull Incorporating category information

MNd

π

z

x

θ

β

Codebookbull 174 Local Image Patches

bull DetectionEvenly Sampled GridRandom SamplingSaliency DetectorLowersquos DoG Detector

bull RepresentationNormalized 11x11 gray values128-dim SIFT

Topic Distribution in Different Categories

Topic Hierarchical Clustering

More Topic Models

bull Dynamic topic models ICML 2006bull Correlated Topic Model NIPS 2005bull Hierarchical Dirichlet Process Journal of the American

Statistical Association 2003bull Nonparametric Bayes pachinko allocation UAI 2007bull Supervised LDA NIPS 2007bull MedLDA ndash Maximum Margin Discrimant LDA ICML 2009bull hellip

Are you really into Graphical Models

bull Describing Visual Scenes using Transformed Dirichlet Processes E Sudderth A Torralba W Freeman and A Willsky NIPS Dec 2005

Reference

bull David M Blei Andrew Y Ng Michael I Jordan Latent Dirichlet Allocation Journal of Machine Learning Research (JMLR) 2003

bull Matthew J Beal Variational Algorithms for Approximate Bayesian Inference PhD Thesis University of Cambridge 1998

bull L Fei-Fei and P Perona A Bayesian Hierarchical Model for Learning Natural Scene Categories CVPR 2005

Outline

bull Matrix Decompositionndash PCA SVD NMFndash LDA ICA Sparse Coding etc

bull Graphical Modelndash Basic concepts in probabilistic machine learningndash EMndash pLSAndash LDA

bull Two Applicationsndash Document decomposition for ldquolong queryrdquo retrieval ICCV 2009ndash Modeling Threaded Discussions SIGIR 2009

Large-Scale Indexing for ldquoLong Queryrdquo Retrieval (Similarity Search)

Xiao Zhang Zhiwei Li Lei Zhang Wei-Ying Ma and Heung-Yeung Shum

ICCV 2009

The Long Query Problem

bull If a query contains 1000 keywordsndash Need to access 1000 inverted listsndash The intersection of 1000 inverted lists may be emptyndash The union of 1000 inverted list may be the whole corpus

bull Dimension reductionTerm1 Term2 Term3 Term4 hellip TermN

Img1 1 2 0 0 hellip 2

f1 f2 hellip fM

Img1 02 01 hellip 003

Topic Projection

Dim = 1 million

Dim = 200

Key Idea Dimension Reduction + Residual Error Preservation

bull p original TF-IDF vector in vocabulary spacebull X projection matrix for dimension reductionbull w low dimensional feature vectorbull residual error

p Xw

An image = 0 01 02 03 04 05 06 07 08 09 1

0

2

4

6

8

10

12

14

16

+ a few words(10 words)

Orthogonal Decomposition

p Xw 1 11 1 1

2 21 2 1 2

1 1

1

k

k

W k W

W W Wk W

p x xp x x w

p wp x x

Base vector

Low dimensional representation

Residual

An image = 0 01 02 03 04 05 06 07 08 09 1

0

2

4

6

8

10

12

14

16

+ a few words(10 words)

T Tp q p q

p q

w w

X1 X2 X3 hellip Xk

A Probabilistic Implementation

x is a switch variable It controls a word generated from

bull a topic specific distribution

bull a document specific distribution

bull a background distribution

( | )p w d 1( 0 | ) ( | ) ( | )K

kp x d p w z k p z k d

( 1| ) ( | )p x d p w d ( 2 | ) ( | )p x d p w

C Chemudugunta etal Modeling General and Specific Aspects of Documents with a Probabilistic Topic Model NIPS 2006

Search (Online)

DS1 DS2 hellip

DS1 DS2 hellip

DS1 DS2 hellip

DS1 DS2 hellip

DS1 DS2 hellip

hellip

hellip

hellip

hellip

hellip

LSH Index

Doc 300 Doc 401 hellip

A query = 0 01 02 03 04 05 06 07 08 09 1

0

2

4

6

8

10

12

14

16

+ a few words

Re-rankingDoc 401 hellip

Doc 1

Doc 2

Doc 300

Doc 401

Doc N

Doc 300

Index 10M Images 46GBSearch Speed lt 100ms

Doc Meta

Search Example

Query Image

Search Example

Query Image

Simultaneously Modeling Semantics and Structure of Threaded Discussions A Sparse

Coding Approach and Its Applications

Chen LIN Jiang-Ming YANG Rui CAI Xin-jing WANG Wei WANG Lei ZHANG

SIGIR 2009

123

Semantic amp structure

124

SemanticTopics

StructureWho reply to who

Optimize them together

Model semantic

Model structure

125

Reply reconstruction

126

DocumentSimilarity

TopicSimilarity

StructureSimilarity

BaselinesNP

Reply to Nearest PostRR

Reply to RootDS

Document SimilarityLDA

Latent Dirichlet Allocation Project documents to topic space

SWB Special Words Topic Model with Background distribution Project documents to topic and junk topic space

127

Evaluation

method Slashdot Apple

All Posts Good Posts All Posts Good Posts

NP 0021 0012 0289 0239

RR 0183 0319 0269 0474

DS 0463 0643 0409 0628

LDA 0465 0644 0410 0648

SWB 0463 0644 0410 0641

SMSS 0524 0737 0517 0772

128

Expert finding

Reply reconstruction

Network construction

Expert finding

Methods

HITS

PageRank

hellip

129

Baselines

LMFormal Models for Expert Finding in Enterprise Corpora SIGIR

06Achieves stable performance in expert finding task using a

language modelPageRank

Benchmark nodal ranking methodHITS

Find hub nodes and authority nodeEABIF

Personalized Recommendation Driven by Information Flow SIGIR rsquo06

Find most influential node130

Evaluation

131

bull Bayesian estimate

Method MRR MAP P10

LM 0821 0698 0800

EABIF(ori) 0674 0362 0243

EABIF(rec) 0742 0318 0281

PageRank(ori) 0675 0377 0263

PageRank(rec) 0743 0321 0266

HITS(ori) 0906 0832 0900

HITS(rec) 0938 0822 0906

Summary

bull Matrix and probability are fundamental mathematics in information retrieval and computer visionndash Matrix decomposition ndash a good practice to learn matrixndash Graphical model ndash a good practice to learn probability

bull Graphical model is a good tool to analyze problems

bull The essence of decomposition is to discover a set of mid-level features to describe original documentsimages

bull It is more adaptable for various applications than matrix decomposition

  • An Introduction To Matrix Decomposition and Graphical Model
  • Outline
  • What Is Matrix Decomposition
  • Why We Need Matrix Decomposition
  • Why We Need Matrix Decomposition (2)
  • Principle Component Analysis
  • Definition ndash Eigenvalue amp Eigenvector
  • Definition ndash Principle Component Analysis
  • Principle Component Analysis (2)
  • Maximizing Variance
  • Optimization Problem
  • Property Data Decomposition
  • Face Recognition ndash Eigenface
  • Slide 14
  • Slide 15
  • PageRank ndash Power Iteration
  • Column-Stochastic amp Irreducible
  • Iterative PageRank Calculation
  • Convergence of the power iteration
  • Singular value decomposition
  • SVD - Definition
  • Singular Values And Singular Vectors
  • Matrix approximation
  • SVD and PCA
  • Example - LSI
  • SVD and PCA (2)
  • Latent Semantic Indexing (LSI)
  • Latent Semantic Indexing
  • Similarity Measures
  • Similarity Measures (2)
  • HITS (Hyperlink Induced Topic Search)
  • Power Iteration
  • HITS and SVD
  • HITS vs PageRank
  • NMF ndash Non-Negative Matrix Factorization
  • Definition
  • Motivation
  • Multiplicative Update Algorithm
  • Multiplicative Update Algorithm (2)
  • NMF vs PCA
  • Reference
  • Major Reference
  • Outline (2)
  • Not Included
  • Basic Concepts
  • What Is Machine Learning
  • Likelihood Function
  • Maximum Likelihood (ML)
  • IID ndash Independent Identically Distributed
  • Reference (2)
  • Expectation Maximization
  • Why We Need EM
  • More General
  • The Expectation Maximization (EM) Algorithm
  • Jensenrsquos Inequality
  • Lower Bounding the Log Likelihood
  • The E and M Steps of EM
  • The E Step
  • The M Step
  • EM Never Decreases the Likelihood
  • Reference (3)
  • Why Do We Need Graphical Model
  • Why Do We Need Graphical Models
  • Directed Acyclic Graphical Models (Bayesian Networks)
  • Directed Graphs for Statistical Models Plate Notation
  • pLSA ndash Probabilistic Latent Semantic Analysis
  • Latent Semantic Indexing (LSI) Review
  • pLSA ndash Probabilistic Latent Semantic Analysis (2)
  • pLSA
  • pLSA (2)
  • Joint Probability vs Likelihood
  • Document Decomposition
  • pLSA ndash Objective Function
  • EM Steps
  • Lower Bounding the Log Likelihood (2)
  • EM Steps (2)
  • Latent Subspace
  • pLSA vs LSA
  • pLSA vs LSA (2)
  • Applications
  • Text Mining
  • Scene Classification
  • Classification Result
  • Reference (4)
  • LDA ndash Latent Dirichilet Allocation
  • Problems in pLSA
  • Problems in pLSA (2)
  • Dirichlet Distribution
  • Dirichlet Distribution (2)
  • Example Dirichlet Distributions (K=3)
  • Example Dirichlet Distributions (K=3) (2)
  • The LDA Model
  • The LDA Model (2)
  • Joint Probability
  • Likelihood
  • Inference
  • Inference (2)
  • Variational Inference
  • Variantional Inference
  • VBEM vs EM
  • Parameter Estimation
  • Parameter Estimation (2)
  • Topic Examples in a 100-topic LDA Model)
  • Classification (50-topic LDA + SVM)
  • Problems in LDA
  • A Bayesian Hierarchical Model for Learning Natural Scene Catego
  • Codebook
  • Topic Distribution in Different Categories
  • Topic Hierarchical Clustering
  • More Topic Models
  • Are you really into Graphical Models
  • Reference (5)
  • Outline (3)
  • Slide 115
  • The Long Query Problem
  • Key Idea Dimension Reduction + Residual Error Preservation
  • Orthogonal Decomposition
  • A Probabilistic Implementation
  • Search (Online)
  • Search Example
  • Search Example (2)
  • Simultaneously Modeling Semantics and Structure of Threaded Dis
  • Semantic amp structure
  • Optimize them together
  • Reply reconstruction
  • Baselines
  • Evaluation
  • Expert finding
  • Baselines (2)
  • Evaluation (2)
  • Summary
Page 73: An Introduction To Matrix Decomposition and Graphical Model

Lower Bounding the Log Likelihood

EM Steps

bull The E-Step

bull The M-Step

Latent Subspace

pLSA vs LSAbull LSA and PLSA perform dimensionality reduction

ndash In LSA by keeping only K singular valuesndash In PLSA by having K aspects

bull Comparison to SVDndash U Matrix related to P(z|d) (doc to aspect)ndash V Matrix related to P(w|z) (aspect to term)ndash E Matrix related to P(z) (aspect strength)

pLSA vs LSAbull The main difference is the way the approximation is done

bull PLSA generates a model (aspect model) and maximizes its predictive power

bull Selecting the proper value of K is heuristic in LSA

bull Model selection in statistics can determine optimal K in PLSA

Applications

bull Text mining topic discovering

bull Scene Classification

Text Mining

Scene Classification

Classification Result

Reference

bull Thomas Hofmann Probabilistic latent semantic analysis In Proc of Uncertainty in Artificial Intelligence UAI99 Stockholm 1999

bull Bosch A Zisserman A and Munoz X Scene Classification via pLSAProceedings of the European Conference on Computer Vision (2006)

bull Sivic J Russell B C Efros A A Zisserman A and Freeman W T Discovering Object Categories in Image Collections MIT AI Lab Memo AIM-2005-005 February 2005

LDA ndash LATENT DIRICHILET ALLOCATION

Problems in pLSA

bull pLSA provides no probabilistic model at the document level Each doc has its own topic mixture proportion

bull The number of parameters in the model grows linearly with M (the number of documents in the training set)

Problems in pLSA

bull There is no constraint for distributions p(z|di)

bull Easy to lead to serious problems with over-fitting

d1

z1

w1

z2

w2

zN1

wN1

hellip

d2

z1

w1

z2

w2

zN2

wN2

hellip

dm

z1

w1

z2

w2

zNm

wNm

hellip

p(z|d1) p(z|d2) p(z|dm)

Dirichlet Distribution

bull In the LDA model the topic mixture proportions for each document are assumed to follow some distribution

bull Requirement for such a distributionndash The samples (mixture proportions) generated from it are K-tuples of non-

negative numbers that sum to one That is the samples are multinormialsndash Easy to optimize

bull Dirichlet Distribution is one of such distributions

bull The space of all of these multinomials has a nice geometric interpretation as a (k-1)-simplex

Dirichlet Distribution

bull Definition

bull The density is zero outside this open (K minus 1)-dimensional simplex

k

i ii

ki ik

i i

ki i

kk

xx

xxxxp i

1

11

1

12121

1 0 st

)Γ()(Γ)|(

bull Various parameter α

(6 2 2) (3 7 5)

(2 3 4) (6 2 6)

Example Dirichlet Distributions (K=3)

Example Dirichlet Distributions (K=3)

bull Equal αi different

α0=01 α0=1 α0=10

k

i i10

The LDA Model

z4z3z2z1

w4w3w2w1

b

z4z3z2z1

w4w3w2w1

z4z3z2z1

w4w3w2w1

bull For each documentbull Choose ~Dirichlet()bull For each of the N words wn

ndash Choose a topic znraquo Multinomial()

ndash Choose a word wn from p(wn|znb) a multinomial probability conditioned on the topic zn

The LDA Model

bull For each documentbull Choose ~Dirichlet()bull For each of the N words wn

ndash Choose a topic znraquo Multinomial()

ndash Choose a word wn from p(wn|znb) a multinomial probability conditioned on the topic zn

Joint Probability

bull Given parameter α and β

where

Likelihood

bull Joint Probability

bull Marginal distribution of a document

bull Likelihood over all the documents

Inference

bull The likelihood can be computed by summing each documentbull Jansenrsquos inequality in EM

Inference

bull In E-Step we need to compute the posterior distribution of the hidden variables

bull Unfortunately this distribution is intractable to compute in general

bull We have to resort to variational approach

Variational Inference

bull In variational inference we consider a simplified graphical model with variational parameters and minimize the KL Divergence between the variational and posterior distributions

Variantional Inference

bull The difference between the lower bound and the likelihood is the KL divergence

bull Maximizing the lower bound L(b) with respect to and is equivalent to minimizing the KL divergence

VBEM vs EM

bull Only different in the E-Step

bull In standard EM q(X) is directly set to p(X|Dθ) and let KL=0bull In VBEM it is intractable to compute p(X|Dθ) Instead it

approximates p(X|Dθ) by a variational distribution q(X) by minimizing KL(q(X) | P(X|D θ)

bull This is also equivalent to maximizing the lower bound L(θ)

Parameter Estimation

bull Given a corpus of documents we would like to find the parameters and b which maximize the likelihood of the observed data

bull Strategy (Variational EM)

bull Lower bound log p(w|b) by a function L(b)bull Repeat until convergence

ndash E Maximize L(b) with respect to the variational parameters ndash M Maximize the bound with respect to parameters and b

Parameter Estimation

bull E-Step Variational Inference ndash repeat until convergence

bull M-Step Parameter estimation

β

α can be implemented using the Newton-Raphson method

Topic Examples in a 100-topic LDA Model)bull 16000 documents from a subset of the TREC AP corpus

Classification (50-topic LDA + SVM)

bull Reuters-21578 dataset ndash contains 8000 documents and 15818 words

(a) EARN vs NOT EARN (b) GRAIN vs NOT GRAIN

Problems in LDA

bull Dirichlet Distribution is helpful to avoid over-fitting But the assumption might be too strong

z4z3z2z1

w4w3w2w1

b

z4z3z2z1

w4w3w2w1

z4z3z2z1

w4w3w2w1

A Bayesian Hierarchical Model for Learning Natural Scene Categories

bull Incorporating category information

MNd

π

z

x

θ

β

Codebookbull 174 Local Image Patches

bull DetectionEvenly Sampled GridRandom SamplingSaliency DetectorLowersquos DoG Detector

bull RepresentationNormalized 11x11 gray values128-dim SIFT

Topic Distribution in Different Categories

Topic Hierarchical Clustering

More Topic Models

bull Dynamic topic models ICML 2006bull Correlated Topic Model NIPS 2005bull Hierarchical Dirichlet Process Journal of the American

Statistical Association 2003bull Nonparametric Bayes pachinko allocation UAI 2007bull Supervised LDA NIPS 2007bull MedLDA ndash Maximum Margin Discrimant LDA ICML 2009bull hellip

Are you really into Graphical Models

bull Describing Visual Scenes using Transformed Dirichlet Processes E Sudderth A Torralba W Freeman and A Willsky NIPS Dec 2005

Reference

bull David M Blei Andrew Y Ng Michael I Jordan Latent Dirichlet Allocation Journal of Machine Learning Research (JMLR) 2003

bull Matthew J Beal Variational Algorithms for Approximate Bayesian Inference PhD Thesis University of Cambridge 1998

bull L Fei-Fei and P Perona A Bayesian Hierarchical Model for Learning Natural Scene Categories CVPR 2005

Outline

bull Matrix Decompositionndash PCA SVD NMFndash LDA ICA Sparse Coding etc

bull Graphical Modelndash Basic concepts in probabilistic machine learningndash EMndash pLSAndash LDA

bull Two Applicationsndash Document decomposition for ldquolong queryrdquo retrieval ICCV 2009ndash Modeling Threaded Discussions SIGIR 2009

Large-Scale Indexing for ldquoLong Queryrdquo Retrieval (Similarity Search)

Xiao Zhang Zhiwei Li Lei Zhang Wei-Ying Ma and Heung-Yeung Shum

ICCV 2009

The Long Query Problem

bull If a query contains 1000 keywordsndash Need to access 1000 inverted listsndash The intersection of 1000 inverted lists may be emptyndash The union of 1000 inverted list may be the whole corpus

bull Dimension reductionTerm1 Term2 Term3 Term4 hellip TermN

Img1 1 2 0 0 hellip 2

f1 f2 hellip fM

Img1 02 01 hellip 003

Topic Projection

Dim = 1 million

Dim = 200

Key Idea Dimension Reduction + Residual Error Preservation

bull p original TF-IDF vector in vocabulary spacebull X projection matrix for dimension reductionbull w low dimensional feature vectorbull residual error

p Xw

An image = 0 01 02 03 04 05 06 07 08 09 1

0

2

4

6

8

10

12

14

16

+ a few words(10 words)

Orthogonal Decomposition

p Xw 1 11 1 1

2 21 2 1 2

1 1

1

k

k

W k W

W W Wk W

p x xp x x w

p wp x x

Base vector

Low dimensional representation

Residual

An image = 0 01 02 03 04 05 06 07 08 09 1

0

2

4

6

8

10

12

14

16

+ a few words(10 words)

T Tp q p q

p q

w w

X1 X2 X3 hellip Xk

A Probabilistic Implementation

x is a switch variable It controls a word generated from

bull a topic specific distribution

bull a document specific distribution

bull a background distribution

( | )p w d 1( 0 | ) ( | ) ( | )K

kp x d p w z k p z k d

( 1| ) ( | )p x d p w d ( 2 | ) ( | )p x d p w

C Chemudugunta etal Modeling General and Specific Aspects of Documents with a Probabilistic Topic Model NIPS 2006

Search (Online)

DS1 DS2 hellip

DS1 DS2 hellip

DS1 DS2 hellip

DS1 DS2 hellip

DS1 DS2 hellip

hellip

hellip

hellip

hellip

hellip

LSH Index

Doc 300 Doc 401 hellip

A query = 0 01 02 03 04 05 06 07 08 09 1

0

2

4

6

8

10

12

14

16

+ a few words

Re-rankingDoc 401 hellip

Doc 1

Doc 2

Doc 300

Doc 401

Doc N

Doc 300

Index 10M Images 46GBSearch Speed lt 100ms

Doc Meta

Search Example

Query Image

Search Example

Query Image

Simultaneously Modeling Semantics and Structure of Threaded Discussions A Sparse

Coding Approach and Its Applications

Chen LIN Jiang-Ming YANG Rui CAI Xin-jing WANG Wei WANG Lei ZHANG

SIGIR 2009

123

Semantic amp structure

124

SemanticTopics

StructureWho reply to who

Optimize them together

Model semantic

Model structure

125

Reply reconstruction

126

DocumentSimilarity

TopicSimilarity

StructureSimilarity

BaselinesNP

Reply to Nearest PostRR

Reply to RootDS

Document SimilarityLDA

Latent Dirichlet Allocation Project documents to topic space

SWB Special Words Topic Model with Background distribution Project documents to topic and junk topic space

127

Evaluation

method Slashdot Apple

All Posts Good Posts All Posts Good Posts

NP 0021 0012 0289 0239

RR 0183 0319 0269 0474

DS 0463 0643 0409 0628

LDA 0465 0644 0410 0648

SWB 0463 0644 0410 0641

SMSS 0524 0737 0517 0772

128

Expert finding

Reply reconstruction

Network construction

Expert finding

Methods

HITS

PageRank

hellip

129

Baselines

LMFormal Models for Expert Finding in Enterprise Corpora SIGIR

06Achieves stable performance in expert finding task using a

language modelPageRank

Benchmark nodal ranking methodHITS

Find hub nodes and authority nodeEABIF

Personalized Recommendation Driven by Information Flow SIGIR rsquo06

Find most influential node130

Evaluation

131

bull Bayesian estimate

Method MRR MAP P10

LM 0821 0698 0800

EABIF(ori) 0674 0362 0243

EABIF(rec) 0742 0318 0281

PageRank(ori) 0675 0377 0263

PageRank(rec) 0743 0321 0266

HITS(ori) 0906 0832 0900

HITS(rec) 0938 0822 0906

Summary

bull Matrix and probability are fundamental mathematics in information retrieval and computer visionndash Matrix decomposition ndash a good practice to learn matrixndash Graphical model ndash a good practice to learn probability

bull Graphical model is a good tool to analyze problems

bull The essence of decomposition is to discover a set of mid-level features to describe original documentsimages

bull It is more adaptable for various applications than matrix decomposition

  • An Introduction To Matrix Decomposition and Graphical Model
  • Outline
  • What Is Matrix Decomposition
  • Why We Need Matrix Decomposition
  • Why We Need Matrix Decomposition (2)
  • Principle Component Analysis
  • Definition ndash Eigenvalue amp Eigenvector
  • Definition ndash Principle Component Analysis
  • Principle Component Analysis (2)
  • Maximizing Variance
  • Optimization Problem
  • Property Data Decomposition
  • Face Recognition ndash Eigenface
  • Slide 14
  • Slide 15
  • PageRank ndash Power Iteration
  • Column-Stochastic amp Irreducible
  • Iterative PageRank Calculation
  • Convergence of the power iteration
  • Singular value decomposition
  • SVD - Definition
  • Singular Values And Singular Vectors
  • Matrix approximation
  • SVD and PCA
  • Example - LSI
  • SVD and PCA (2)
  • Latent Semantic Indexing (LSI)
  • Latent Semantic Indexing
  • Similarity Measures
  • Similarity Measures (2)
  • HITS (Hyperlink Induced Topic Search)
  • Power Iteration
  • HITS and SVD
  • HITS vs PageRank
  • NMF ndash Non-Negative Matrix Factorization
  • Definition
  • Motivation
  • Multiplicative Update Algorithm
  • Multiplicative Update Algorithm (2)
  • NMF vs PCA
  • Reference
  • Major Reference
  • Outline (2)
  • Not Included
  • Basic Concepts
  • What Is Machine Learning
  • Likelihood Function
  • Maximum Likelihood (ML)
  • IID ndash Independent Identically Distributed
  • Reference (2)
  • Expectation Maximization
  • Why We Need EM
  • More General
  • The Expectation Maximization (EM) Algorithm
  • Jensenrsquos Inequality
  • Lower Bounding the Log Likelihood
  • The E and M Steps of EM
  • The E Step
  • The M Step
  • EM Never Decreases the Likelihood
  • Reference (3)
  • Why Do We Need Graphical Model
  • Why Do We Need Graphical Models
  • Directed Acyclic Graphical Models (Bayesian Networks)
  • Directed Graphs for Statistical Models Plate Notation
  • pLSA ndash Probabilistic Latent Semantic Analysis
  • Latent Semantic Indexing (LSI) Review
  • pLSA ndash Probabilistic Latent Semantic Analysis (2)
  • pLSA
  • pLSA (2)
  • Joint Probability vs Likelihood
  • Document Decomposition
  • pLSA ndash Objective Function
  • EM Steps
  • Lower Bounding the Log Likelihood (2)
  • EM Steps (2)
  • Latent Subspace
  • pLSA vs LSA
  • pLSA vs LSA (2)
  • Applications
  • Text Mining
  • Scene Classification
  • Classification Result
  • Reference (4)
  • LDA ndash Latent Dirichilet Allocation
  • Problems in pLSA
  • Problems in pLSA (2)
  • Dirichlet Distribution
  • Dirichlet Distribution (2)
  • Example Dirichlet Distributions (K=3)
  • Example Dirichlet Distributions (K=3) (2)
  • The LDA Model
  • The LDA Model (2)
  • Joint Probability
  • Likelihood
  • Inference
  • Inference (2)
  • Variational Inference
  • Variantional Inference
  • VBEM vs EM
  • Parameter Estimation
  • Parameter Estimation (2)
  • Topic Examples in a 100-topic LDA Model)
  • Classification (50-topic LDA + SVM)
  • Problems in LDA
  • A Bayesian Hierarchical Model for Learning Natural Scene Catego
  • Codebook
  • Topic Distribution in Different Categories
  • Topic Hierarchical Clustering
  • More Topic Models
  • Are you really into Graphical Models
  • Reference (5)
  • Outline (3)
  • Slide 115
  • The Long Query Problem
  • Key Idea Dimension Reduction + Residual Error Preservation
  • Orthogonal Decomposition
  • A Probabilistic Implementation
  • Search (Online)
  • Search Example
  • Search Example (2)
  • Simultaneously Modeling Semantics and Structure of Threaded Dis
  • Semantic amp structure
  • Optimize them together
  • Reply reconstruction
  • Baselines
  • Evaluation
  • Expert finding
  • Baselines (2)
  • Evaluation (2)
  • Summary
Page 74: An Introduction To Matrix Decomposition and Graphical Model

EM Steps

bull The E-Step

bull The M-Step

Latent Subspace

pLSA vs LSAbull LSA and PLSA perform dimensionality reduction

ndash In LSA by keeping only K singular valuesndash In PLSA by having K aspects

bull Comparison to SVDndash U Matrix related to P(z|d) (doc to aspect)ndash V Matrix related to P(w|z) (aspect to term)ndash E Matrix related to P(z) (aspect strength)

pLSA vs LSAbull The main difference is the way the approximation is done

bull PLSA generates a model (aspect model) and maximizes its predictive power

bull Selecting the proper value of K is heuristic in LSA

bull Model selection in statistics can determine optimal K in PLSA

Applications

bull Text mining topic discovering

bull Scene Classification

Text Mining

Scene Classification

Classification Result

Reference

bull Thomas Hofmann Probabilistic latent semantic analysis In Proc of Uncertainty in Artificial Intelligence UAI99 Stockholm 1999

bull Bosch A Zisserman A and Munoz X Scene Classification via pLSAProceedings of the European Conference on Computer Vision (2006)

bull Sivic J Russell B C Efros A A Zisserman A and Freeman W T Discovering Object Categories in Image Collections MIT AI Lab Memo AIM-2005-005 February 2005

LDA ndash LATENT DIRICHILET ALLOCATION

Problems in pLSA

bull pLSA provides no probabilistic model at the document level Each doc has its own topic mixture proportion

bull The number of parameters in the model grows linearly with M (the number of documents in the training set)

Problems in pLSA

bull There is no constraint for distributions p(z|di)

bull Easy to lead to serious problems with over-fitting

d1

z1

w1

z2

w2

zN1

wN1

hellip

d2

z1

w1

z2

w2

zN2

wN2

hellip

dm

z1

w1

z2

w2

zNm

wNm

hellip

p(z|d1) p(z|d2) p(z|dm)

Dirichlet Distribution

bull In the LDA model the topic mixture proportions for each document are assumed to follow some distribution

bull Requirement for such a distributionndash The samples (mixture proportions) generated from it are K-tuples of non-

negative numbers that sum to one That is the samples are multinormialsndash Easy to optimize

bull Dirichlet Distribution is one of such distributions

bull The space of all of these multinomials has a nice geometric interpretation as a (k-1)-simplex

Dirichlet Distribution

bull Definition

bull The density is zero outside this open (K minus 1)-dimensional simplex

k

i ii

ki ik

i i

ki i

kk

xx

xxxxp i

1

11

1

12121

1 0 st

)Γ()(Γ)|(

bull Various parameter α

(6 2 2) (3 7 5)

(2 3 4) (6 2 6)

Example Dirichlet Distributions (K=3)

Example Dirichlet Distributions (K=3)

bull Equal αi different

α0=01 α0=1 α0=10

k

i i10

The LDA Model

z4z3z2z1

w4w3w2w1

b

z4z3z2z1

w4w3w2w1

z4z3z2z1

w4w3w2w1

bull For each documentbull Choose ~Dirichlet()bull For each of the N words wn

ndash Choose a topic znraquo Multinomial()

ndash Choose a word wn from p(wn|znb) a multinomial probability conditioned on the topic zn

The LDA Model

bull For each documentbull Choose ~Dirichlet()bull For each of the N words wn

ndash Choose a topic znraquo Multinomial()

ndash Choose a word wn from p(wn|znb) a multinomial probability conditioned on the topic zn

Joint Probability

bull Given parameter α and β

where

Likelihood

bull Joint Probability

bull Marginal distribution of a document

bull Likelihood over all the documents

Inference

bull The likelihood can be computed by summing each documentbull Jansenrsquos inequality in EM

Inference

bull In E-Step we need to compute the posterior distribution of the hidden variables

bull Unfortunately this distribution is intractable to compute in general

bull We have to resort to variational approach

Variational Inference

bull In variational inference we consider a simplified graphical model with variational parameters and minimize the KL Divergence between the variational and posterior distributions

Variantional Inference

bull The difference between the lower bound and the likelihood is the KL divergence

bull Maximizing the lower bound L(b) with respect to and is equivalent to minimizing the KL divergence

VBEM vs EM

bull Only different in the E-Step

bull In standard EM q(X) is directly set to p(X|Dθ) and let KL=0bull In VBEM it is intractable to compute p(X|Dθ) Instead it

approximates p(X|Dθ) by a variational distribution q(X) by minimizing KL(q(X) | P(X|D θ)

bull This is also equivalent to maximizing the lower bound L(θ)

Parameter Estimation

bull Given a corpus of documents we would like to find the parameters and b which maximize the likelihood of the observed data

bull Strategy (Variational EM)

bull Lower bound log p(w|b) by a function L(b)bull Repeat until convergence

ndash E Maximize L(b) with respect to the variational parameters ndash M Maximize the bound with respect to parameters and b

Parameter Estimation

bull E-Step Variational Inference ndash repeat until convergence

bull M-Step Parameter estimation

β

α can be implemented using the Newton-Raphson method

Topic Examples in a 100-topic LDA Model)bull 16000 documents from a subset of the TREC AP corpus

Classification (50-topic LDA + SVM)

bull Reuters-21578 dataset ndash contains 8000 documents and 15818 words

(a) EARN vs NOT EARN (b) GRAIN vs NOT GRAIN

Problems in LDA

bull Dirichlet Distribution is helpful to avoid over-fitting But the assumption might be too strong

z4z3z2z1

w4w3w2w1

b

z4z3z2z1

w4w3w2w1

z4z3z2z1

w4w3w2w1

A Bayesian Hierarchical Model for Learning Natural Scene Categories

bull Incorporating category information

MNd

π

z

x

θ

β

Codebookbull 174 Local Image Patches

bull DetectionEvenly Sampled GridRandom SamplingSaliency DetectorLowersquos DoG Detector

bull RepresentationNormalized 11x11 gray values128-dim SIFT

Topic Distribution in Different Categories

Topic Hierarchical Clustering

More Topic Models

bull Dynamic topic models ICML 2006bull Correlated Topic Model NIPS 2005bull Hierarchical Dirichlet Process Journal of the American

Statistical Association 2003bull Nonparametric Bayes pachinko allocation UAI 2007bull Supervised LDA NIPS 2007bull MedLDA ndash Maximum Margin Discrimant LDA ICML 2009bull hellip

Are you really into Graphical Models

bull Describing Visual Scenes using Transformed Dirichlet Processes E Sudderth A Torralba W Freeman and A Willsky NIPS Dec 2005

Reference

bull David M Blei Andrew Y Ng Michael I Jordan Latent Dirichlet Allocation Journal of Machine Learning Research (JMLR) 2003

bull Matthew J Beal Variational Algorithms for Approximate Bayesian Inference PhD Thesis University of Cambridge 1998

bull L Fei-Fei and P Perona A Bayesian Hierarchical Model for Learning Natural Scene Categories CVPR 2005

Outline

bull Matrix Decompositionndash PCA SVD NMFndash LDA ICA Sparse Coding etc

bull Graphical Modelndash Basic concepts in probabilistic machine learningndash EMndash pLSAndash LDA

bull Two Applicationsndash Document decomposition for ldquolong queryrdquo retrieval ICCV 2009ndash Modeling Threaded Discussions SIGIR 2009

Large-Scale Indexing for ldquoLong Queryrdquo Retrieval (Similarity Search)

Xiao Zhang Zhiwei Li Lei Zhang Wei-Ying Ma and Heung-Yeung Shum

ICCV 2009

The Long Query Problem

bull If a query contains 1000 keywordsndash Need to access 1000 inverted listsndash The intersection of 1000 inverted lists may be emptyndash The union of 1000 inverted list may be the whole corpus

bull Dimension reductionTerm1 Term2 Term3 Term4 hellip TermN

Img1 1 2 0 0 hellip 2

f1 f2 hellip fM

Img1 02 01 hellip 003

Topic Projection

Dim = 1 million

Dim = 200

Key Idea Dimension Reduction + Residual Error Preservation

bull p original TF-IDF vector in vocabulary spacebull X projection matrix for dimension reductionbull w low dimensional feature vectorbull residual error

p Xw

An image = 0 01 02 03 04 05 06 07 08 09 1

0

2

4

6

8

10

12

14

16

+ a few words(10 words)

Orthogonal Decomposition

p Xw 1 11 1 1

2 21 2 1 2

1 1

1

k

k

W k W

W W Wk W

p x xp x x w

p wp x x

Base vector

Low dimensional representation

Residual

An image = 0 01 02 03 04 05 06 07 08 09 1

0

2

4

6

8

10

12

14

16

+ a few words(10 words)

T Tp q p q

p q

w w

X1 X2 X3 hellip Xk

A Probabilistic Implementation

x is a switch variable It controls a word generated from

bull a topic specific distribution

bull a document specific distribution

bull a background distribution

( | )p w d 1( 0 | ) ( | ) ( | )K

kp x d p w z k p z k d

( 1| ) ( | )p x d p w d ( 2 | ) ( | )p x d p w

C Chemudugunta etal Modeling General and Specific Aspects of Documents with a Probabilistic Topic Model NIPS 2006

Search (Online)

DS1 DS2 hellip

DS1 DS2 hellip

DS1 DS2 hellip

DS1 DS2 hellip

DS1 DS2 hellip

hellip

hellip

hellip

hellip

hellip

LSH Index

Doc 300 Doc 401 hellip

A query = 0 01 02 03 04 05 06 07 08 09 1

0

2

4

6

8

10

12

14

16

+ a few words

Re-rankingDoc 401 hellip

Doc 1

Doc 2

Doc 300

Doc 401

Doc N

Doc 300

Index 10M Images 46GBSearch Speed lt 100ms

Doc Meta

Search Example

Query Image

Search Example

Query Image

Simultaneously Modeling Semantics and Structure of Threaded Discussions A Sparse

Coding Approach and Its Applications

Chen LIN Jiang-Ming YANG Rui CAI Xin-jing WANG Wei WANG Lei ZHANG

SIGIR 2009

123

Semantic amp structure

124

SemanticTopics

StructureWho reply to who

Optimize them together

Model semantic

Model structure

125

Reply reconstruction

126

DocumentSimilarity

TopicSimilarity

StructureSimilarity

BaselinesNP

Reply to Nearest PostRR

Reply to RootDS

Document SimilarityLDA

Latent Dirichlet Allocation Project documents to topic space

SWB Special Words Topic Model with Background distribution Project documents to topic and junk topic space

127

Evaluation

method Slashdot Apple

All Posts Good Posts All Posts Good Posts

NP 0021 0012 0289 0239

RR 0183 0319 0269 0474

DS 0463 0643 0409 0628

LDA 0465 0644 0410 0648

SWB 0463 0644 0410 0641

SMSS 0524 0737 0517 0772

128

Expert finding

Reply reconstruction

Network construction

Expert finding

Methods

HITS

PageRank

hellip

129

Baselines

LMFormal Models for Expert Finding in Enterprise Corpora SIGIR

06Achieves stable performance in expert finding task using a

language modelPageRank

Benchmark nodal ranking methodHITS

Find hub nodes and authority nodeEABIF

Personalized Recommendation Driven by Information Flow SIGIR rsquo06

Find most influential node130

Evaluation

131

bull Bayesian estimate

Method MRR MAP P10

LM 0821 0698 0800

EABIF(ori) 0674 0362 0243

EABIF(rec) 0742 0318 0281

PageRank(ori) 0675 0377 0263

PageRank(rec) 0743 0321 0266

HITS(ori) 0906 0832 0900

HITS(rec) 0938 0822 0906

Summary

bull Matrix and probability are fundamental mathematics in information retrieval and computer visionndash Matrix decomposition ndash a good practice to learn matrixndash Graphical model ndash a good practice to learn probability

bull Graphical model is a good tool to analyze problems

bull The essence of decomposition is to discover a set of mid-level features to describe original documentsimages

bull It is more adaptable for various applications than matrix decomposition

  • An Introduction To Matrix Decomposition and Graphical Model
  • Outline
  • What Is Matrix Decomposition
  • Why We Need Matrix Decomposition
  • Why We Need Matrix Decomposition (2)
  • Principle Component Analysis
  • Definition ndash Eigenvalue amp Eigenvector
  • Definition ndash Principle Component Analysis
  • Principle Component Analysis (2)
  • Maximizing Variance
  • Optimization Problem
  • Property Data Decomposition
  • Face Recognition ndash Eigenface
  • Slide 14
  • Slide 15
  • PageRank ndash Power Iteration
  • Column-Stochastic amp Irreducible
  • Iterative PageRank Calculation
  • Convergence of the power iteration
  • Singular value decomposition
  • SVD - Definition
  • Singular Values And Singular Vectors
  • Matrix approximation
  • SVD and PCA
  • Example - LSI
  • SVD and PCA (2)
  • Latent Semantic Indexing (LSI)
  • Latent Semantic Indexing
  • Similarity Measures
  • Similarity Measures (2)
  • HITS (Hyperlink Induced Topic Search)
  • Power Iteration
  • HITS and SVD
  • HITS vs PageRank
  • NMF ndash Non-Negative Matrix Factorization
  • Definition
  • Motivation
  • Multiplicative Update Algorithm
  • Multiplicative Update Algorithm (2)
  • NMF vs PCA
  • Reference
  • Major Reference
  • Outline (2)
  • Not Included
  • Basic Concepts
  • What Is Machine Learning
  • Likelihood Function
  • Maximum Likelihood (ML)
  • IID ndash Independent Identically Distributed
  • Reference (2)
  • Expectation Maximization
  • Why We Need EM
  • More General
  • The Expectation Maximization (EM) Algorithm
  • Jensenrsquos Inequality
  • Lower Bounding the Log Likelihood
  • The E and M Steps of EM
  • The E Step
  • The M Step
  • EM Never Decreases the Likelihood
  • Reference (3)
  • Why Do We Need Graphical Model
  • Why Do We Need Graphical Models
  • Directed Acyclic Graphical Models (Bayesian Networks)
  • Directed Graphs for Statistical Models Plate Notation
  • pLSA ndash Probabilistic Latent Semantic Analysis
  • Latent Semantic Indexing (LSI) Review
  • pLSA ndash Probabilistic Latent Semantic Analysis (2)
  • pLSA
  • pLSA (2)
  • Joint Probability vs Likelihood
  • Document Decomposition
  • pLSA ndash Objective Function
  • EM Steps
  • Lower Bounding the Log Likelihood (2)
  • EM Steps (2)
  • Latent Subspace
  • pLSA vs LSA
  • pLSA vs LSA (2)
  • Applications
  • Text Mining
  • Scene Classification
  • Classification Result
  • Reference (4)
  • LDA ndash Latent Dirichilet Allocation
  • Problems in pLSA
  • Problems in pLSA (2)
  • Dirichlet Distribution
  • Dirichlet Distribution (2)
  • Example Dirichlet Distributions (K=3)
  • Example Dirichlet Distributions (K=3) (2)
  • The LDA Model
  • The LDA Model (2)
  • Joint Probability
  • Likelihood
  • Inference
  • Inference (2)
  • Variational Inference
  • Variantional Inference
  • VBEM vs EM
  • Parameter Estimation
  • Parameter Estimation (2)
  • Topic Examples in a 100-topic LDA Model)
  • Classification (50-topic LDA + SVM)
  • Problems in LDA
  • A Bayesian Hierarchical Model for Learning Natural Scene Catego
  • Codebook
  • Topic Distribution in Different Categories
  • Topic Hierarchical Clustering
  • More Topic Models
  • Are you really into Graphical Models
  • Reference (5)
  • Outline (3)
  • Slide 115
  • The Long Query Problem
  • Key Idea Dimension Reduction + Residual Error Preservation
  • Orthogonal Decomposition
  • A Probabilistic Implementation
  • Search (Online)
  • Search Example
  • Search Example (2)
  • Simultaneously Modeling Semantics and Structure of Threaded Dis
  • Semantic amp structure
  • Optimize them together
  • Reply reconstruction
  • Baselines
  • Evaluation
  • Expert finding
  • Baselines (2)
  • Evaluation (2)
  • Summary
Page 75: An Introduction To Matrix Decomposition and Graphical Model

Latent Subspace

pLSA vs LSAbull LSA and PLSA perform dimensionality reduction

ndash In LSA by keeping only K singular valuesndash In PLSA by having K aspects

bull Comparison to SVDndash U Matrix related to P(z|d) (doc to aspect)ndash V Matrix related to P(w|z) (aspect to term)ndash E Matrix related to P(z) (aspect strength)

pLSA vs LSAbull The main difference is the way the approximation is done

bull PLSA generates a model (aspect model) and maximizes its predictive power

bull Selecting the proper value of K is heuristic in LSA

bull Model selection in statistics can determine optimal K in PLSA

Applications

bull Text mining topic discovering

bull Scene Classification

Text Mining

Scene Classification

Classification Result

Reference

bull Thomas Hofmann Probabilistic latent semantic analysis In Proc of Uncertainty in Artificial Intelligence UAI99 Stockholm 1999

bull Bosch A Zisserman A and Munoz X Scene Classification via pLSAProceedings of the European Conference on Computer Vision (2006)

bull Sivic J Russell B C Efros A A Zisserman A and Freeman W T Discovering Object Categories in Image Collections MIT AI Lab Memo AIM-2005-005 February 2005

LDA ndash LATENT DIRICHILET ALLOCATION

Problems in pLSA

bull pLSA provides no probabilistic model at the document level Each doc has its own topic mixture proportion

bull The number of parameters in the model grows linearly with M (the number of documents in the training set)

Problems in pLSA

bull There is no constraint for distributions p(z|di)

bull Easy to lead to serious problems with over-fitting

d1

z1

w1

z2

w2

zN1

wN1

hellip

d2

z1

w1

z2

w2

zN2

wN2

hellip

dm

z1

w1

z2

w2

zNm

wNm

hellip

p(z|d1) p(z|d2) p(z|dm)

Dirichlet Distribution

bull In the LDA model the topic mixture proportions for each document are assumed to follow some distribution

bull Requirement for such a distributionndash The samples (mixture proportions) generated from it are K-tuples of non-

negative numbers that sum to one That is the samples are multinormialsndash Easy to optimize

bull Dirichlet Distribution is one of such distributions

bull The space of all of these multinomials has a nice geometric interpretation as a (k-1)-simplex

Dirichlet Distribution

bull Definition

bull The density is zero outside this open (K minus 1)-dimensional simplex

k

i ii

ki ik

i i

ki i

kk

xx

xxxxp i

1

11

1

12121

1 0 st

)Γ()(Γ)|(

bull Various parameter α

(6 2 2) (3 7 5)

(2 3 4) (6 2 6)

Example Dirichlet Distributions (K=3)

Example Dirichlet Distributions (K=3)

bull Equal αi different

α0=01 α0=1 α0=10

k

i i10

The LDA Model

z4z3z2z1

w4w3w2w1

b

z4z3z2z1

w4w3w2w1

z4z3z2z1

w4w3w2w1

bull For each documentbull Choose ~Dirichlet()bull For each of the N words wn

ndash Choose a topic znraquo Multinomial()

ndash Choose a word wn from p(wn|znb) a multinomial probability conditioned on the topic zn

The LDA Model

bull For each documentbull Choose ~Dirichlet()bull For each of the N words wn

ndash Choose a topic znraquo Multinomial()

ndash Choose a word wn from p(wn|znb) a multinomial probability conditioned on the topic zn

Joint Probability

bull Given parameter α and β

where

Likelihood

bull Joint Probability

bull Marginal distribution of a document

bull Likelihood over all the documents

Inference

bull The likelihood can be computed by summing each documentbull Jansenrsquos inequality in EM

Inference

bull In E-Step we need to compute the posterior distribution of the hidden variables

bull Unfortunately this distribution is intractable to compute in general

bull We have to resort to variational approach

Variational Inference

bull In variational inference we consider a simplified graphical model with variational parameters and minimize the KL Divergence between the variational and posterior distributions

Variantional Inference

bull The difference between the lower bound and the likelihood is the KL divergence

bull Maximizing the lower bound L(b) with respect to and is equivalent to minimizing the KL divergence

VBEM vs EM

bull Only different in the E-Step

bull In standard EM q(X) is directly set to p(X|Dθ) and let KL=0bull In VBEM it is intractable to compute p(X|Dθ) Instead it

approximates p(X|Dθ) by a variational distribution q(X) by minimizing KL(q(X) | P(X|D θ)

bull This is also equivalent to maximizing the lower bound L(θ)

Parameter Estimation

bull Given a corpus of documents we would like to find the parameters and b which maximize the likelihood of the observed data

bull Strategy (Variational EM)

bull Lower bound log p(w|b) by a function L(b)bull Repeat until convergence

ndash E Maximize L(b) with respect to the variational parameters ndash M Maximize the bound with respect to parameters and b

Parameter Estimation

bull E-Step Variational Inference ndash repeat until convergence

bull M-Step Parameter estimation

β

α can be implemented using the Newton-Raphson method

Topic Examples in a 100-topic LDA Model)bull 16000 documents from a subset of the TREC AP corpus

Classification (50-topic LDA + SVM)

bull Reuters-21578 dataset ndash contains 8000 documents and 15818 words

(a) EARN vs NOT EARN (b) GRAIN vs NOT GRAIN

Problems in LDA

bull Dirichlet Distribution is helpful to avoid over-fitting But the assumption might be too strong

z4z3z2z1

w4w3w2w1

b

z4z3z2z1

w4w3w2w1

z4z3z2z1

w4w3w2w1

A Bayesian Hierarchical Model for Learning Natural Scene Categories

bull Incorporating category information

MNd

π

z

x

θ

β

Codebookbull 174 Local Image Patches

bull DetectionEvenly Sampled GridRandom SamplingSaliency DetectorLowersquos DoG Detector

bull RepresentationNormalized 11x11 gray values128-dim SIFT

Topic Distribution in Different Categories

Topic Hierarchical Clustering

More Topic Models

bull Dynamic topic models ICML 2006bull Correlated Topic Model NIPS 2005bull Hierarchical Dirichlet Process Journal of the American

Statistical Association 2003bull Nonparametric Bayes pachinko allocation UAI 2007bull Supervised LDA NIPS 2007bull MedLDA ndash Maximum Margin Discrimant LDA ICML 2009bull hellip

Are you really into Graphical Models

bull Describing Visual Scenes using Transformed Dirichlet Processes E Sudderth A Torralba W Freeman and A Willsky NIPS Dec 2005

Reference

bull David M Blei Andrew Y Ng Michael I Jordan Latent Dirichlet Allocation Journal of Machine Learning Research (JMLR) 2003

bull Matthew J Beal Variational Algorithms for Approximate Bayesian Inference PhD Thesis University of Cambridge 1998

bull L Fei-Fei and P Perona A Bayesian Hierarchical Model for Learning Natural Scene Categories CVPR 2005

Outline

bull Matrix Decompositionndash PCA SVD NMFndash LDA ICA Sparse Coding etc

bull Graphical Modelndash Basic concepts in probabilistic machine learningndash EMndash pLSAndash LDA

bull Two Applicationsndash Document decomposition for ldquolong queryrdquo retrieval ICCV 2009ndash Modeling Threaded Discussions SIGIR 2009

Large-Scale Indexing for ldquoLong Queryrdquo Retrieval (Similarity Search)

Xiao Zhang Zhiwei Li Lei Zhang Wei-Ying Ma and Heung-Yeung Shum

ICCV 2009

The Long Query Problem

bull If a query contains 1000 keywordsndash Need to access 1000 inverted listsndash The intersection of 1000 inverted lists may be emptyndash The union of 1000 inverted list may be the whole corpus

bull Dimension reductionTerm1 Term2 Term3 Term4 hellip TermN

Img1 1 2 0 0 hellip 2

f1 f2 hellip fM

Img1 02 01 hellip 003

Topic Projection

Dim = 1 million

Dim = 200

Key Idea Dimension Reduction + Residual Error Preservation

bull p original TF-IDF vector in vocabulary spacebull X projection matrix for dimension reductionbull w low dimensional feature vectorbull residual error

p Xw

An image = 0 01 02 03 04 05 06 07 08 09 1

0

2

4

6

8

10

12

14

16

+ a few words(10 words)

Orthogonal Decomposition

p Xw 1 11 1 1

2 21 2 1 2

1 1

1

k

k

W k W

W W Wk W

p x xp x x w

p wp x x

Base vector

Low dimensional representation

Residual

An image = 0 01 02 03 04 05 06 07 08 09 1

0

2

4

6

8

10

12

14

16

+ a few words(10 words)

T Tp q p q

p q

w w

X1 X2 X3 hellip Xk

A Probabilistic Implementation

x is a switch variable It controls a word generated from

bull a topic specific distribution

bull a document specific distribution

bull a background distribution

( | )p w d 1( 0 | ) ( | ) ( | )K

kp x d p w z k p z k d

( 1| ) ( | )p x d p w d ( 2 | ) ( | )p x d p w

C Chemudugunta etal Modeling General and Specific Aspects of Documents with a Probabilistic Topic Model NIPS 2006

Search (Online)

DS1 DS2 hellip

DS1 DS2 hellip

DS1 DS2 hellip

DS1 DS2 hellip

DS1 DS2 hellip

hellip

hellip

hellip

hellip

hellip

LSH Index

Doc 300 Doc 401 hellip

A query = 0 01 02 03 04 05 06 07 08 09 1

0

2

4

6

8

10

12

14

16

+ a few words

Re-rankingDoc 401 hellip

Doc 1

Doc 2

Doc 300

Doc 401

Doc N

Doc 300

Index 10M Images 46GBSearch Speed lt 100ms

Doc Meta

Search Example

Query Image

Search Example

Query Image

Simultaneously Modeling Semantics and Structure of Threaded Discussions A Sparse

Coding Approach and Its Applications

Chen LIN Jiang-Ming YANG Rui CAI Xin-jing WANG Wei WANG Lei ZHANG

SIGIR 2009

123

Semantic amp structure

124

SemanticTopics

StructureWho reply to who

Optimize them together

Model semantic

Model structure

125

Reply reconstruction

126

DocumentSimilarity

TopicSimilarity

StructureSimilarity

BaselinesNP

Reply to Nearest PostRR

Reply to RootDS

Document SimilarityLDA

Latent Dirichlet Allocation Project documents to topic space

SWB Special Words Topic Model with Background distribution Project documents to topic and junk topic space

127

Evaluation

method Slashdot Apple

All Posts Good Posts All Posts Good Posts

NP 0021 0012 0289 0239

RR 0183 0319 0269 0474

DS 0463 0643 0409 0628

LDA 0465 0644 0410 0648

SWB 0463 0644 0410 0641

SMSS 0524 0737 0517 0772

128

Expert finding

Reply reconstruction

Network construction

Expert finding

Methods

HITS

PageRank

hellip

129

Baselines

LMFormal Models for Expert Finding in Enterprise Corpora SIGIR

06Achieves stable performance in expert finding task using a

language modelPageRank

Benchmark nodal ranking methodHITS

Find hub nodes and authority nodeEABIF

Personalized Recommendation Driven by Information Flow SIGIR rsquo06

Find most influential node130

Evaluation

131

bull Bayesian estimate

Method MRR MAP P10

LM 0821 0698 0800

EABIF(ori) 0674 0362 0243

EABIF(rec) 0742 0318 0281

PageRank(ori) 0675 0377 0263

PageRank(rec) 0743 0321 0266

HITS(ori) 0906 0832 0900

HITS(rec) 0938 0822 0906

Summary

bull Matrix and probability are fundamental mathematics in information retrieval and computer visionndash Matrix decomposition ndash a good practice to learn matrixndash Graphical model ndash a good practice to learn probability

bull Graphical model is a good tool to analyze problems

bull The essence of decomposition is to discover a set of mid-level features to describe original documentsimages

bull It is more adaptable for various applications than matrix decomposition

  • An Introduction To Matrix Decomposition and Graphical Model
  • Outline
  • What Is Matrix Decomposition
  • Why We Need Matrix Decomposition
  • Why We Need Matrix Decomposition (2)
  • Principle Component Analysis
  • Definition ndash Eigenvalue amp Eigenvector
  • Definition ndash Principle Component Analysis
  • Principle Component Analysis (2)
  • Maximizing Variance
  • Optimization Problem
  • Property Data Decomposition
  • Face Recognition ndash Eigenface
  • Slide 14
  • Slide 15
  • PageRank ndash Power Iteration
  • Column-Stochastic amp Irreducible
  • Iterative PageRank Calculation
  • Convergence of the power iteration
  • Singular value decomposition
  • SVD - Definition
  • Singular Values And Singular Vectors
  • Matrix approximation
  • SVD and PCA
  • Example - LSI
  • SVD and PCA (2)
  • Latent Semantic Indexing (LSI)
  • Latent Semantic Indexing
  • Similarity Measures
  • Similarity Measures (2)
  • HITS (Hyperlink Induced Topic Search)
  • Power Iteration
  • HITS and SVD
  • HITS vs PageRank
  • NMF ndash Non-Negative Matrix Factorization
  • Definition
  • Motivation
  • Multiplicative Update Algorithm
  • Multiplicative Update Algorithm (2)
  • NMF vs PCA
  • Reference
  • Major Reference
  • Outline (2)
  • Not Included
  • Basic Concepts
  • What Is Machine Learning
  • Likelihood Function
  • Maximum Likelihood (ML)
  • IID ndash Independent Identically Distributed
  • Reference (2)
  • Expectation Maximization
  • Why We Need EM
  • More General
  • The Expectation Maximization (EM) Algorithm
  • Jensenrsquos Inequality
  • Lower Bounding the Log Likelihood
  • The E and M Steps of EM
  • The E Step
  • The M Step
  • EM Never Decreases the Likelihood
  • Reference (3)
  • Why Do We Need Graphical Model
  • Why Do We Need Graphical Models
  • Directed Acyclic Graphical Models (Bayesian Networks)
  • Directed Graphs for Statistical Models Plate Notation
  • pLSA ndash Probabilistic Latent Semantic Analysis
  • Latent Semantic Indexing (LSI) Review
  • pLSA ndash Probabilistic Latent Semantic Analysis (2)
  • pLSA
  • pLSA (2)
  • Joint Probability vs Likelihood
  • Document Decomposition
  • pLSA ndash Objective Function
  • EM Steps
  • Lower Bounding the Log Likelihood (2)
  • EM Steps (2)
  • Latent Subspace
  • pLSA vs LSA
  • pLSA vs LSA (2)
  • Applications
  • Text Mining
  • Scene Classification
  • Classification Result
  • Reference (4)
  • LDA ndash Latent Dirichilet Allocation
  • Problems in pLSA
  • Problems in pLSA (2)
  • Dirichlet Distribution
  • Dirichlet Distribution (2)
  • Example Dirichlet Distributions (K=3)
  • Example Dirichlet Distributions (K=3) (2)
  • The LDA Model
  • The LDA Model (2)
  • Joint Probability
  • Likelihood
  • Inference
  • Inference (2)
  • Variational Inference
  • Variantional Inference
  • VBEM vs EM
  • Parameter Estimation
  • Parameter Estimation (2)
  • Topic Examples in a 100-topic LDA Model)
  • Classification (50-topic LDA + SVM)
  • Problems in LDA
  • A Bayesian Hierarchical Model for Learning Natural Scene Catego
  • Codebook
  • Topic Distribution in Different Categories
  • Topic Hierarchical Clustering
  • More Topic Models
  • Are you really into Graphical Models
  • Reference (5)
  • Outline (3)
  • Slide 115
  • The Long Query Problem
  • Key Idea Dimension Reduction + Residual Error Preservation
  • Orthogonal Decomposition
  • A Probabilistic Implementation
  • Search (Online)
  • Search Example
  • Search Example (2)
  • Simultaneously Modeling Semantics and Structure of Threaded Dis
  • Semantic amp structure
  • Optimize them together
  • Reply reconstruction
  • Baselines
  • Evaluation
  • Expert finding
  • Baselines (2)
  • Evaluation (2)
  • Summary
Page 76: An Introduction To Matrix Decomposition and Graphical Model

pLSA vs LSAbull LSA and PLSA perform dimensionality reduction

ndash In LSA by keeping only K singular valuesndash In PLSA by having K aspects

bull Comparison to SVDndash U Matrix related to P(z|d) (doc to aspect)ndash V Matrix related to P(w|z) (aspect to term)ndash E Matrix related to P(z) (aspect strength)

pLSA vs LSAbull The main difference is the way the approximation is done

bull PLSA generates a model (aspect model) and maximizes its predictive power

bull Selecting the proper value of K is heuristic in LSA

bull Model selection in statistics can determine optimal K in PLSA

Applications

bull Text mining topic discovering

bull Scene Classification

Text Mining

Scene Classification

Classification Result

Reference

bull Thomas Hofmann Probabilistic latent semantic analysis In Proc of Uncertainty in Artificial Intelligence UAI99 Stockholm 1999

bull Bosch A Zisserman A and Munoz X Scene Classification via pLSAProceedings of the European Conference on Computer Vision (2006)

bull Sivic J Russell B C Efros A A Zisserman A and Freeman W T Discovering Object Categories in Image Collections MIT AI Lab Memo AIM-2005-005 February 2005

LDA ndash LATENT DIRICHILET ALLOCATION

Problems in pLSA

bull pLSA provides no probabilistic model at the document level Each doc has its own topic mixture proportion

bull The number of parameters in the model grows linearly with M (the number of documents in the training set)

Problems in pLSA

bull There is no constraint for distributions p(z|di)

bull Easy to lead to serious problems with over-fitting

d1

z1

w1

z2

w2

zN1

wN1

hellip

d2

z1

w1

z2

w2

zN2

wN2

hellip

dm

z1

w1

z2

w2

zNm

wNm

hellip

p(z|d1) p(z|d2) p(z|dm)

Dirichlet Distribution

bull In the LDA model the topic mixture proportions for each document are assumed to follow some distribution

bull Requirement for such a distributionndash The samples (mixture proportions) generated from it are K-tuples of non-

negative numbers that sum to one That is the samples are multinormialsndash Easy to optimize

bull Dirichlet Distribution is one of such distributions

bull The space of all of these multinomials has a nice geometric interpretation as a (k-1)-simplex

Dirichlet Distribution

bull Definition

bull The density is zero outside this open (K minus 1)-dimensional simplex

k

i ii

ki ik

i i

ki i

kk

xx

xxxxp i

1

11

1

12121

1 0 st

)Γ()(Γ)|(

bull Various parameter α

(6 2 2) (3 7 5)

(2 3 4) (6 2 6)

Example Dirichlet Distributions (K=3)

Example Dirichlet Distributions (K=3)

bull Equal αi different

α0=01 α0=1 α0=10

k

i i10

The LDA Model

z4z3z2z1

w4w3w2w1

b

z4z3z2z1

w4w3w2w1

z4z3z2z1

w4w3w2w1

bull For each documentbull Choose ~Dirichlet()bull For each of the N words wn

ndash Choose a topic znraquo Multinomial()

ndash Choose a word wn from p(wn|znb) a multinomial probability conditioned on the topic zn

The LDA Model

bull For each documentbull Choose ~Dirichlet()bull For each of the N words wn

ndash Choose a topic znraquo Multinomial()

ndash Choose a word wn from p(wn|znb) a multinomial probability conditioned on the topic zn

Joint Probability

bull Given parameter α and β

where

Likelihood

bull Joint Probability

bull Marginal distribution of a document

bull Likelihood over all the documents

Inference

bull The likelihood can be computed by summing each documentbull Jansenrsquos inequality in EM

Inference

bull In E-Step we need to compute the posterior distribution of the hidden variables

bull Unfortunately this distribution is intractable to compute in general

bull We have to resort to variational approach

Variational Inference

bull In variational inference we consider a simplified graphical model with variational parameters and minimize the KL Divergence between the variational and posterior distributions

Variantional Inference

bull The difference between the lower bound and the likelihood is the KL divergence

bull Maximizing the lower bound L(b) with respect to and is equivalent to minimizing the KL divergence

VBEM vs EM

bull Only different in the E-Step

bull In standard EM q(X) is directly set to p(X|Dθ) and let KL=0bull In VBEM it is intractable to compute p(X|Dθ) Instead it

approximates p(X|Dθ) by a variational distribution q(X) by minimizing KL(q(X) | P(X|D θ)

bull This is also equivalent to maximizing the lower bound L(θ)

Parameter Estimation

bull Given a corpus of documents we would like to find the parameters and b which maximize the likelihood of the observed data

bull Strategy (Variational EM)

bull Lower bound log p(w|b) by a function L(b)bull Repeat until convergence

ndash E Maximize L(b) with respect to the variational parameters ndash M Maximize the bound with respect to parameters and b

Parameter Estimation

bull E-Step Variational Inference ndash repeat until convergence

bull M-Step Parameter estimation

β

α can be implemented using the Newton-Raphson method

Topic Examples in a 100-topic LDA Model)bull 16000 documents from a subset of the TREC AP corpus

Classification (50-topic LDA + SVM)

bull Reuters-21578 dataset ndash contains 8000 documents and 15818 words

(a) EARN vs NOT EARN (b) GRAIN vs NOT GRAIN

Problems in LDA

bull Dirichlet Distribution is helpful to avoid over-fitting But the assumption might be too strong

z4z3z2z1

w4w3w2w1

b

z4z3z2z1

w4w3w2w1

z4z3z2z1

w4w3w2w1

A Bayesian Hierarchical Model for Learning Natural Scene Categories

bull Incorporating category information

MNd

π

z

x

θ

β

Codebookbull 174 Local Image Patches

bull DetectionEvenly Sampled GridRandom SamplingSaliency DetectorLowersquos DoG Detector

bull RepresentationNormalized 11x11 gray values128-dim SIFT

Topic Distribution in Different Categories

Topic Hierarchical Clustering

More Topic Models

bull Dynamic topic models ICML 2006bull Correlated Topic Model NIPS 2005bull Hierarchical Dirichlet Process Journal of the American

Statistical Association 2003bull Nonparametric Bayes pachinko allocation UAI 2007bull Supervised LDA NIPS 2007bull MedLDA ndash Maximum Margin Discrimant LDA ICML 2009bull hellip

Are you really into Graphical Models

bull Describing Visual Scenes using Transformed Dirichlet Processes E Sudderth A Torralba W Freeman and A Willsky NIPS Dec 2005

Reference

bull David M Blei Andrew Y Ng Michael I Jordan Latent Dirichlet Allocation Journal of Machine Learning Research (JMLR) 2003

bull Matthew J Beal Variational Algorithms for Approximate Bayesian Inference PhD Thesis University of Cambridge 1998

bull L Fei-Fei and P Perona A Bayesian Hierarchical Model for Learning Natural Scene Categories CVPR 2005

Outline

bull Matrix Decompositionndash PCA SVD NMFndash LDA ICA Sparse Coding etc

bull Graphical Modelndash Basic concepts in probabilistic machine learningndash EMndash pLSAndash LDA

bull Two Applicationsndash Document decomposition for ldquolong queryrdquo retrieval ICCV 2009ndash Modeling Threaded Discussions SIGIR 2009

Large-Scale Indexing for ldquoLong Queryrdquo Retrieval (Similarity Search)

Xiao Zhang Zhiwei Li Lei Zhang Wei-Ying Ma and Heung-Yeung Shum

ICCV 2009

The Long Query Problem

bull If a query contains 1000 keywordsndash Need to access 1000 inverted listsndash The intersection of 1000 inverted lists may be emptyndash The union of 1000 inverted list may be the whole corpus

bull Dimension reductionTerm1 Term2 Term3 Term4 hellip TermN

Img1 1 2 0 0 hellip 2

f1 f2 hellip fM

Img1 02 01 hellip 003

Topic Projection

Dim = 1 million

Dim = 200

Key Idea Dimension Reduction + Residual Error Preservation

bull p original TF-IDF vector in vocabulary spacebull X projection matrix for dimension reductionbull w low dimensional feature vectorbull residual error

p Xw

An image = 0 01 02 03 04 05 06 07 08 09 1

0

2

4

6

8

10

12

14

16

+ a few words(10 words)

Orthogonal Decomposition

p Xw 1 11 1 1

2 21 2 1 2

1 1

1

k

k

W k W

W W Wk W

p x xp x x w

p wp x x

Base vector

Low dimensional representation

Residual

An image = 0 01 02 03 04 05 06 07 08 09 1

0

2

4

6

8

10

12

14

16

+ a few words(10 words)

T Tp q p q

p q

w w

X1 X2 X3 hellip Xk

A Probabilistic Implementation

x is a switch variable It controls a word generated from

bull a topic specific distribution

bull a document specific distribution

bull a background distribution

( | )p w d 1( 0 | ) ( | ) ( | )K

kp x d p w z k p z k d

( 1| ) ( | )p x d p w d ( 2 | ) ( | )p x d p w

C Chemudugunta etal Modeling General and Specific Aspects of Documents with a Probabilistic Topic Model NIPS 2006

Search (Online)

DS1 DS2 hellip

DS1 DS2 hellip

DS1 DS2 hellip

DS1 DS2 hellip

DS1 DS2 hellip

hellip

hellip

hellip

hellip

hellip

LSH Index

Doc 300 Doc 401 hellip

A query = 0 01 02 03 04 05 06 07 08 09 1

0

2

4

6

8

10

12

14

16

+ a few words

Re-rankingDoc 401 hellip

Doc 1

Doc 2

Doc 300

Doc 401

Doc N

Doc 300

Index 10M Images 46GBSearch Speed lt 100ms

Doc Meta

Search Example

Query Image

Search Example

Query Image

Simultaneously Modeling Semantics and Structure of Threaded Discussions A Sparse

Coding Approach and Its Applications

Chen LIN Jiang-Ming YANG Rui CAI Xin-jing WANG Wei WANG Lei ZHANG

SIGIR 2009

123

Semantic amp structure

124

SemanticTopics

StructureWho reply to who

Optimize them together

Model semantic

Model structure

125

Reply reconstruction

126

DocumentSimilarity

TopicSimilarity

StructureSimilarity

BaselinesNP

Reply to Nearest PostRR

Reply to RootDS

Document SimilarityLDA

Latent Dirichlet Allocation Project documents to topic space

SWB Special Words Topic Model with Background distribution Project documents to topic and junk topic space

127

Evaluation

method Slashdot Apple

All Posts Good Posts All Posts Good Posts

NP 0021 0012 0289 0239

RR 0183 0319 0269 0474

DS 0463 0643 0409 0628

LDA 0465 0644 0410 0648

SWB 0463 0644 0410 0641

SMSS 0524 0737 0517 0772

128

Expert finding

Reply reconstruction

Network construction

Expert finding

Methods

HITS

PageRank

hellip

129

Baselines

LMFormal Models for Expert Finding in Enterprise Corpora SIGIR

06Achieves stable performance in expert finding task using a

language modelPageRank

Benchmark nodal ranking methodHITS

Find hub nodes and authority nodeEABIF

Personalized Recommendation Driven by Information Flow SIGIR rsquo06

Find most influential node130

Evaluation

131

bull Bayesian estimate

Method MRR MAP P10

LM 0821 0698 0800

EABIF(ori) 0674 0362 0243

EABIF(rec) 0742 0318 0281

PageRank(ori) 0675 0377 0263

PageRank(rec) 0743 0321 0266

HITS(ori) 0906 0832 0900

HITS(rec) 0938 0822 0906

Summary

bull Matrix and probability are fundamental mathematics in information retrieval and computer visionndash Matrix decomposition ndash a good practice to learn matrixndash Graphical model ndash a good practice to learn probability

bull Graphical model is a good tool to analyze problems

bull The essence of decomposition is to discover a set of mid-level features to describe original documentsimages

bull It is more adaptable for various applications than matrix decomposition

  • An Introduction To Matrix Decomposition and Graphical Model
  • Outline
  • What Is Matrix Decomposition
  • Why We Need Matrix Decomposition
  • Why We Need Matrix Decomposition (2)
  • Principle Component Analysis
  • Definition ndash Eigenvalue amp Eigenvector
  • Definition ndash Principle Component Analysis
  • Principle Component Analysis (2)
  • Maximizing Variance
  • Optimization Problem
  • Property Data Decomposition
  • Face Recognition ndash Eigenface
  • Slide 14
  • Slide 15
  • PageRank ndash Power Iteration
  • Column-Stochastic amp Irreducible
  • Iterative PageRank Calculation
  • Convergence of the power iteration
  • Singular value decomposition
  • SVD - Definition
  • Singular Values And Singular Vectors
  • Matrix approximation
  • SVD and PCA
  • Example - LSI
  • SVD and PCA (2)
  • Latent Semantic Indexing (LSI)
  • Latent Semantic Indexing
  • Similarity Measures
  • Similarity Measures (2)
  • HITS (Hyperlink Induced Topic Search)
  • Power Iteration
  • HITS and SVD
  • HITS vs PageRank
  • NMF ndash Non-Negative Matrix Factorization
  • Definition
  • Motivation
  • Multiplicative Update Algorithm
  • Multiplicative Update Algorithm (2)
  • NMF vs PCA
  • Reference
  • Major Reference
  • Outline (2)
  • Not Included
  • Basic Concepts
  • What Is Machine Learning
  • Likelihood Function
  • Maximum Likelihood (ML)
  • IID ndash Independent Identically Distributed
  • Reference (2)
  • Expectation Maximization
  • Why We Need EM
  • More General
  • The Expectation Maximization (EM) Algorithm
  • Jensenrsquos Inequality
  • Lower Bounding the Log Likelihood
  • The E and M Steps of EM
  • The E Step
  • The M Step
  • EM Never Decreases the Likelihood
  • Reference (3)
  • Why Do We Need Graphical Model
  • Why Do We Need Graphical Models
  • Directed Acyclic Graphical Models (Bayesian Networks)
  • Directed Graphs for Statistical Models Plate Notation
  • pLSA ndash Probabilistic Latent Semantic Analysis
  • Latent Semantic Indexing (LSI) Review
  • pLSA ndash Probabilistic Latent Semantic Analysis (2)
  • pLSA
  • pLSA (2)
  • Joint Probability vs Likelihood
  • Document Decomposition
  • pLSA ndash Objective Function
  • EM Steps
  • Lower Bounding the Log Likelihood (2)
  • EM Steps (2)
  • Latent Subspace
  • pLSA vs LSA
  • pLSA vs LSA (2)
  • Applications
  • Text Mining
  • Scene Classification
  • Classification Result
  • Reference (4)
  • LDA ndash Latent Dirichilet Allocation
  • Problems in pLSA
  • Problems in pLSA (2)
  • Dirichlet Distribution
  • Dirichlet Distribution (2)
  • Example Dirichlet Distributions (K=3)
  • Example Dirichlet Distributions (K=3) (2)
  • The LDA Model
  • The LDA Model (2)
  • Joint Probability
  • Likelihood
  • Inference
  • Inference (2)
  • Variational Inference
  • Variantional Inference
  • VBEM vs EM
  • Parameter Estimation
  • Parameter Estimation (2)
  • Topic Examples in a 100-topic LDA Model)
  • Classification (50-topic LDA + SVM)
  • Problems in LDA
  • A Bayesian Hierarchical Model for Learning Natural Scene Catego
  • Codebook
  • Topic Distribution in Different Categories
  • Topic Hierarchical Clustering
  • More Topic Models
  • Are you really into Graphical Models
  • Reference (5)
  • Outline (3)
  • Slide 115
  • The Long Query Problem
  • Key Idea Dimension Reduction + Residual Error Preservation
  • Orthogonal Decomposition
  • A Probabilistic Implementation
  • Search (Online)
  • Search Example
  • Search Example (2)
  • Simultaneously Modeling Semantics and Structure of Threaded Dis
  • Semantic amp structure
  • Optimize them together
  • Reply reconstruction
  • Baselines
  • Evaluation
  • Expert finding
  • Baselines (2)
  • Evaluation (2)
  • Summary
Page 77: An Introduction To Matrix Decomposition and Graphical Model

pLSA vs LSAbull The main difference is the way the approximation is done

bull PLSA generates a model (aspect model) and maximizes its predictive power

bull Selecting the proper value of K is heuristic in LSA

bull Model selection in statistics can determine optimal K in PLSA

Applications

bull Text mining topic discovering

bull Scene Classification

Text Mining

Scene Classification

Classification Result

Reference

bull Thomas Hofmann Probabilistic latent semantic analysis In Proc of Uncertainty in Artificial Intelligence UAI99 Stockholm 1999

bull Bosch A Zisserman A and Munoz X Scene Classification via pLSAProceedings of the European Conference on Computer Vision (2006)

bull Sivic J Russell B C Efros A A Zisserman A and Freeman W T Discovering Object Categories in Image Collections MIT AI Lab Memo AIM-2005-005 February 2005

LDA ndash LATENT DIRICHILET ALLOCATION

Problems in pLSA

bull pLSA provides no probabilistic model at the document level Each doc has its own topic mixture proportion

bull The number of parameters in the model grows linearly with M (the number of documents in the training set)

Problems in pLSA

bull There is no constraint for distributions p(z|di)

bull Easy to lead to serious problems with over-fitting

d1

z1

w1

z2

w2

zN1

wN1

hellip

d2

z1

w1

z2

w2

zN2

wN2

hellip

dm

z1

w1

z2

w2

zNm

wNm

hellip

p(z|d1) p(z|d2) p(z|dm)

Dirichlet Distribution

bull In the LDA model the topic mixture proportions for each document are assumed to follow some distribution

bull Requirement for such a distributionndash The samples (mixture proportions) generated from it are K-tuples of non-

negative numbers that sum to one That is the samples are multinormialsndash Easy to optimize

bull Dirichlet Distribution is one of such distributions

bull The space of all of these multinomials has a nice geometric interpretation as a (k-1)-simplex

Dirichlet Distribution

bull Definition

bull The density is zero outside this open (K minus 1)-dimensional simplex

k

i ii

ki ik

i i

ki i

kk

xx

xxxxp i

1

11

1

12121

1 0 st

)Γ()(Γ)|(

bull Various parameter α

(6 2 2) (3 7 5)

(2 3 4) (6 2 6)

Example Dirichlet Distributions (K=3)

Example Dirichlet Distributions (K=3)

bull Equal αi different

α0=01 α0=1 α0=10

k

i i10

The LDA Model

z4z3z2z1

w4w3w2w1

b

z4z3z2z1

w4w3w2w1

z4z3z2z1

w4w3w2w1

bull For each documentbull Choose ~Dirichlet()bull For each of the N words wn

ndash Choose a topic znraquo Multinomial()

ndash Choose a word wn from p(wn|znb) a multinomial probability conditioned on the topic zn

The LDA Model

bull For each documentbull Choose ~Dirichlet()bull For each of the N words wn

ndash Choose a topic znraquo Multinomial()

ndash Choose a word wn from p(wn|znb) a multinomial probability conditioned on the topic zn

Joint Probability

bull Given parameter α and β

where

Likelihood

bull Joint Probability

bull Marginal distribution of a document

bull Likelihood over all the documents

Inference

bull The likelihood can be computed by summing each documentbull Jansenrsquos inequality in EM

Inference

bull In E-Step we need to compute the posterior distribution of the hidden variables

bull Unfortunately this distribution is intractable to compute in general

bull We have to resort to variational approach

Variational Inference

bull In variational inference we consider a simplified graphical model with variational parameters and minimize the KL Divergence between the variational and posterior distributions

Variantional Inference

bull The difference between the lower bound and the likelihood is the KL divergence

bull Maximizing the lower bound L(b) with respect to and is equivalent to minimizing the KL divergence

VBEM vs EM

bull Only different in the E-Step

bull In standard EM q(X) is directly set to p(X|Dθ) and let KL=0bull In VBEM it is intractable to compute p(X|Dθ) Instead it

approximates p(X|Dθ) by a variational distribution q(X) by minimizing KL(q(X) | P(X|D θ)

bull This is also equivalent to maximizing the lower bound L(θ)

Parameter Estimation

bull Given a corpus of documents we would like to find the parameters and b which maximize the likelihood of the observed data

bull Strategy (Variational EM)

bull Lower bound log p(w|b) by a function L(b)bull Repeat until convergence

ndash E Maximize L(b) with respect to the variational parameters ndash M Maximize the bound with respect to parameters and b

Parameter Estimation

bull E-Step Variational Inference ndash repeat until convergence

bull M-Step Parameter estimation

β

α can be implemented using the Newton-Raphson method

Topic Examples in a 100-topic LDA Model)bull 16000 documents from a subset of the TREC AP corpus

Classification (50-topic LDA + SVM)

bull Reuters-21578 dataset ndash contains 8000 documents and 15818 words

(a) EARN vs NOT EARN (b) GRAIN vs NOT GRAIN

Problems in LDA

bull Dirichlet Distribution is helpful to avoid over-fitting But the assumption might be too strong

z4z3z2z1

w4w3w2w1

b

z4z3z2z1

w4w3w2w1

z4z3z2z1

w4w3w2w1

A Bayesian Hierarchical Model for Learning Natural Scene Categories

bull Incorporating category information

MNd

π

z

x

θ

β

Codebookbull 174 Local Image Patches

bull DetectionEvenly Sampled GridRandom SamplingSaliency DetectorLowersquos DoG Detector

bull RepresentationNormalized 11x11 gray values128-dim SIFT

Topic Distribution in Different Categories

Topic Hierarchical Clustering

More Topic Models

bull Dynamic topic models ICML 2006bull Correlated Topic Model NIPS 2005bull Hierarchical Dirichlet Process Journal of the American

Statistical Association 2003bull Nonparametric Bayes pachinko allocation UAI 2007bull Supervised LDA NIPS 2007bull MedLDA ndash Maximum Margin Discrimant LDA ICML 2009bull hellip

Are you really into Graphical Models

bull Describing Visual Scenes using Transformed Dirichlet Processes E Sudderth A Torralba W Freeman and A Willsky NIPS Dec 2005

Reference

bull David M Blei Andrew Y Ng Michael I Jordan Latent Dirichlet Allocation Journal of Machine Learning Research (JMLR) 2003

bull Matthew J Beal Variational Algorithms for Approximate Bayesian Inference PhD Thesis University of Cambridge 1998

bull L Fei-Fei and P Perona A Bayesian Hierarchical Model for Learning Natural Scene Categories CVPR 2005

Outline

bull Matrix Decompositionndash PCA SVD NMFndash LDA ICA Sparse Coding etc

bull Graphical Modelndash Basic concepts in probabilistic machine learningndash EMndash pLSAndash LDA

bull Two Applicationsndash Document decomposition for ldquolong queryrdquo retrieval ICCV 2009ndash Modeling Threaded Discussions SIGIR 2009

Large-Scale Indexing for ldquoLong Queryrdquo Retrieval (Similarity Search)

Xiao Zhang Zhiwei Li Lei Zhang Wei-Ying Ma and Heung-Yeung Shum

ICCV 2009

The Long Query Problem

bull If a query contains 1000 keywordsndash Need to access 1000 inverted listsndash The intersection of 1000 inverted lists may be emptyndash The union of 1000 inverted list may be the whole corpus

bull Dimension reductionTerm1 Term2 Term3 Term4 hellip TermN

Img1 1 2 0 0 hellip 2

f1 f2 hellip fM

Img1 02 01 hellip 003

Topic Projection

Dim = 1 million

Dim = 200

Key Idea Dimension Reduction + Residual Error Preservation

bull p original TF-IDF vector in vocabulary spacebull X projection matrix for dimension reductionbull w low dimensional feature vectorbull residual error

p Xw

An image = 0 01 02 03 04 05 06 07 08 09 1

0

2

4

6

8

10

12

14

16

+ a few words(10 words)

Orthogonal Decomposition

p Xw 1 11 1 1

2 21 2 1 2

1 1

1

k

k

W k W

W W Wk W

p x xp x x w

p wp x x

Base vector

Low dimensional representation

Residual

An image = 0 01 02 03 04 05 06 07 08 09 1

0

2

4

6

8

10

12

14

16

+ a few words(10 words)

T Tp q p q

p q

w w

X1 X2 X3 hellip Xk

A Probabilistic Implementation

x is a switch variable It controls a word generated from

bull a topic specific distribution

bull a document specific distribution

bull a background distribution

( | )p w d 1( 0 | ) ( | ) ( | )K

kp x d p w z k p z k d

( 1| ) ( | )p x d p w d ( 2 | ) ( | )p x d p w

C Chemudugunta etal Modeling General and Specific Aspects of Documents with a Probabilistic Topic Model NIPS 2006

Search (Online)

DS1 DS2 hellip

DS1 DS2 hellip

DS1 DS2 hellip

DS1 DS2 hellip

DS1 DS2 hellip

hellip

hellip

hellip

hellip

hellip

LSH Index

Doc 300 Doc 401 hellip

A query = 0 01 02 03 04 05 06 07 08 09 1

0

2

4

6

8

10

12

14

16

+ a few words

Re-rankingDoc 401 hellip

Doc 1

Doc 2

Doc 300

Doc 401

Doc N

Doc 300

Index 10M Images 46GBSearch Speed lt 100ms

Doc Meta

Search Example

Query Image

Search Example

Query Image

Simultaneously Modeling Semantics and Structure of Threaded Discussions A Sparse

Coding Approach and Its Applications

Chen LIN Jiang-Ming YANG Rui CAI Xin-jing WANG Wei WANG Lei ZHANG

SIGIR 2009

123

Semantic amp structure

124

SemanticTopics

StructureWho reply to who

Optimize them together

Model semantic

Model structure

125

Reply reconstruction

126

DocumentSimilarity

TopicSimilarity

StructureSimilarity

BaselinesNP

Reply to Nearest PostRR

Reply to RootDS

Document SimilarityLDA

Latent Dirichlet Allocation Project documents to topic space

SWB Special Words Topic Model with Background distribution Project documents to topic and junk topic space

127

Evaluation

method Slashdot Apple

All Posts Good Posts All Posts Good Posts

NP 0021 0012 0289 0239

RR 0183 0319 0269 0474

DS 0463 0643 0409 0628

LDA 0465 0644 0410 0648

SWB 0463 0644 0410 0641

SMSS 0524 0737 0517 0772

128

Expert finding

Reply reconstruction

Network construction

Expert finding

Methods

HITS

PageRank

hellip

129

Baselines

LMFormal Models for Expert Finding in Enterprise Corpora SIGIR

06Achieves stable performance in expert finding task using a

language modelPageRank

Benchmark nodal ranking methodHITS

Find hub nodes and authority nodeEABIF

Personalized Recommendation Driven by Information Flow SIGIR rsquo06

Find most influential node130

Evaluation

131

bull Bayesian estimate

Method MRR MAP P10

LM 0821 0698 0800

EABIF(ori) 0674 0362 0243

EABIF(rec) 0742 0318 0281

PageRank(ori) 0675 0377 0263

PageRank(rec) 0743 0321 0266

HITS(ori) 0906 0832 0900

HITS(rec) 0938 0822 0906

Summary

bull Matrix and probability are fundamental mathematics in information retrieval and computer visionndash Matrix decomposition ndash a good practice to learn matrixndash Graphical model ndash a good practice to learn probability

bull Graphical model is a good tool to analyze problems

bull The essence of decomposition is to discover a set of mid-level features to describe original documentsimages

bull It is more adaptable for various applications than matrix decomposition

  • An Introduction To Matrix Decomposition and Graphical Model
  • Outline
  • What Is Matrix Decomposition
  • Why We Need Matrix Decomposition
  • Why We Need Matrix Decomposition (2)
  • Principle Component Analysis
  • Definition ndash Eigenvalue amp Eigenvector
  • Definition ndash Principle Component Analysis
  • Principle Component Analysis (2)
  • Maximizing Variance
  • Optimization Problem
  • Property Data Decomposition
  • Face Recognition ndash Eigenface
  • Slide 14
  • Slide 15
  • PageRank ndash Power Iteration
  • Column-Stochastic amp Irreducible
  • Iterative PageRank Calculation
  • Convergence of the power iteration
  • Singular value decomposition
  • SVD - Definition
  • Singular Values And Singular Vectors
  • Matrix approximation
  • SVD and PCA
  • Example - LSI
  • SVD and PCA (2)
  • Latent Semantic Indexing (LSI)
  • Latent Semantic Indexing
  • Similarity Measures
  • Similarity Measures (2)
  • HITS (Hyperlink Induced Topic Search)
  • Power Iteration
  • HITS and SVD
  • HITS vs PageRank
  • NMF ndash Non-Negative Matrix Factorization
  • Definition
  • Motivation
  • Multiplicative Update Algorithm
  • Multiplicative Update Algorithm (2)
  • NMF vs PCA
  • Reference
  • Major Reference
  • Outline (2)
  • Not Included
  • Basic Concepts
  • What Is Machine Learning
  • Likelihood Function
  • Maximum Likelihood (ML)
  • IID ndash Independent Identically Distributed
  • Reference (2)
  • Expectation Maximization
  • Why We Need EM
  • More General
  • The Expectation Maximization (EM) Algorithm
  • Jensenrsquos Inequality
  • Lower Bounding the Log Likelihood
  • The E and M Steps of EM
  • The E Step
  • The M Step
  • EM Never Decreases the Likelihood
  • Reference (3)
  • Why Do We Need Graphical Model
  • Why Do We Need Graphical Models
  • Directed Acyclic Graphical Models (Bayesian Networks)
  • Directed Graphs for Statistical Models Plate Notation
  • pLSA ndash Probabilistic Latent Semantic Analysis
  • Latent Semantic Indexing (LSI) Review
  • pLSA ndash Probabilistic Latent Semantic Analysis (2)
  • pLSA
  • pLSA (2)
  • Joint Probability vs Likelihood
  • Document Decomposition
  • pLSA ndash Objective Function
  • EM Steps
  • Lower Bounding the Log Likelihood (2)
  • EM Steps (2)
  • Latent Subspace
  • pLSA vs LSA
  • pLSA vs LSA (2)
  • Applications
  • Text Mining
  • Scene Classification
  • Classification Result
  • Reference (4)
  • LDA ndash Latent Dirichilet Allocation
  • Problems in pLSA
  • Problems in pLSA (2)
  • Dirichlet Distribution
  • Dirichlet Distribution (2)
  • Example Dirichlet Distributions (K=3)
  • Example Dirichlet Distributions (K=3) (2)
  • The LDA Model
  • The LDA Model (2)
  • Joint Probability
  • Likelihood
  • Inference
  • Inference (2)
  • Variational Inference
  • Variantional Inference
  • VBEM vs EM
  • Parameter Estimation
  • Parameter Estimation (2)
  • Topic Examples in a 100-topic LDA Model)
  • Classification (50-topic LDA + SVM)
  • Problems in LDA
  • A Bayesian Hierarchical Model for Learning Natural Scene Catego
  • Codebook
  • Topic Distribution in Different Categories
  • Topic Hierarchical Clustering
  • More Topic Models
  • Are you really into Graphical Models
  • Reference (5)
  • Outline (3)
  • Slide 115
  • The Long Query Problem
  • Key Idea Dimension Reduction + Residual Error Preservation
  • Orthogonal Decomposition
  • A Probabilistic Implementation
  • Search (Online)
  • Search Example
  • Search Example (2)
  • Simultaneously Modeling Semantics and Structure of Threaded Dis
  • Semantic amp structure
  • Optimize them together
  • Reply reconstruction
  • Baselines
  • Evaluation
  • Expert finding
  • Baselines (2)
  • Evaluation (2)
  • Summary
Page 78: An Introduction To Matrix Decomposition and Graphical Model

Applications

bull Text mining topic discovering

bull Scene Classification

Text Mining

Scene Classification

Classification Result

Reference

bull Thomas Hofmann Probabilistic latent semantic analysis In Proc of Uncertainty in Artificial Intelligence UAI99 Stockholm 1999

bull Bosch A Zisserman A and Munoz X Scene Classification via pLSAProceedings of the European Conference on Computer Vision (2006)

bull Sivic J Russell B C Efros A A Zisserman A and Freeman W T Discovering Object Categories in Image Collections MIT AI Lab Memo AIM-2005-005 February 2005

LDA ndash LATENT DIRICHILET ALLOCATION

Problems in pLSA

bull pLSA provides no probabilistic model at the document level Each doc has its own topic mixture proportion

bull The number of parameters in the model grows linearly with M (the number of documents in the training set)

Problems in pLSA

bull There is no constraint for distributions p(z|di)

bull Easy to lead to serious problems with over-fitting

d1

z1

w1

z2

w2

zN1

wN1

hellip

d2

z1

w1

z2

w2

zN2

wN2

hellip

dm

z1

w1

z2

w2

zNm

wNm

hellip

p(z|d1) p(z|d2) p(z|dm)

Dirichlet Distribution

bull In the LDA model the topic mixture proportions for each document are assumed to follow some distribution

bull Requirement for such a distributionndash The samples (mixture proportions) generated from it are K-tuples of non-

negative numbers that sum to one That is the samples are multinormialsndash Easy to optimize

bull Dirichlet Distribution is one of such distributions

bull The space of all of these multinomials has a nice geometric interpretation as a (k-1)-simplex

Dirichlet Distribution

bull Definition

bull The density is zero outside this open (K minus 1)-dimensional simplex

k

i ii

ki ik

i i

ki i

kk

xx

xxxxp i

1

11

1

12121

1 0 st

)Γ()(Γ)|(

bull Various parameter α

(6 2 2) (3 7 5)

(2 3 4) (6 2 6)

Example Dirichlet Distributions (K=3)

Example Dirichlet Distributions (K=3)

bull Equal αi different

α0=01 α0=1 α0=10

k

i i10

The LDA Model

z4z3z2z1

w4w3w2w1

b

z4z3z2z1

w4w3w2w1

z4z3z2z1

w4w3w2w1

bull For each documentbull Choose ~Dirichlet()bull For each of the N words wn

ndash Choose a topic znraquo Multinomial()

ndash Choose a word wn from p(wn|znb) a multinomial probability conditioned on the topic zn

The LDA Model

bull For each documentbull Choose ~Dirichlet()bull For each of the N words wn

ndash Choose a topic znraquo Multinomial()

ndash Choose a word wn from p(wn|znb) a multinomial probability conditioned on the topic zn

Joint Probability

bull Given parameter α and β

where

Likelihood

bull Joint Probability

bull Marginal distribution of a document

bull Likelihood over all the documents

Inference

bull The likelihood can be computed by summing each documentbull Jansenrsquos inequality in EM

Inference

bull In E-Step we need to compute the posterior distribution of the hidden variables

bull Unfortunately this distribution is intractable to compute in general

bull We have to resort to variational approach

Variational Inference

bull In variational inference we consider a simplified graphical model with variational parameters and minimize the KL Divergence between the variational and posterior distributions

Variantional Inference

bull The difference between the lower bound and the likelihood is the KL divergence

bull Maximizing the lower bound L(b) with respect to and is equivalent to minimizing the KL divergence

VBEM vs EM

bull Only different in the E-Step

bull In standard EM q(X) is directly set to p(X|Dθ) and let KL=0bull In VBEM it is intractable to compute p(X|Dθ) Instead it

approximates p(X|Dθ) by a variational distribution q(X) by minimizing KL(q(X) | P(X|D θ)

bull This is also equivalent to maximizing the lower bound L(θ)

Parameter Estimation

bull Given a corpus of documents we would like to find the parameters and b which maximize the likelihood of the observed data

bull Strategy (Variational EM)

bull Lower bound log p(w|b) by a function L(b)bull Repeat until convergence

ndash E Maximize L(b) with respect to the variational parameters ndash M Maximize the bound with respect to parameters and b

Parameter Estimation

bull E-Step Variational Inference ndash repeat until convergence

bull M-Step Parameter estimation

β

α can be implemented using the Newton-Raphson method

Topic Examples in a 100-topic LDA Model)bull 16000 documents from a subset of the TREC AP corpus

Classification (50-topic LDA + SVM)

bull Reuters-21578 dataset ndash contains 8000 documents and 15818 words

(a) EARN vs NOT EARN (b) GRAIN vs NOT GRAIN

Problems in LDA

bull Dirichlet Distribution is helpful to avoid over-fitting But the assumption might be too strong

z4z3z2z1

w4w3w2w1

b

z4z3z2z1

w4w3w2w1

z4z3z2z1

w4w3w2w1

A Bayesian Hierarchical Model for Learning Natural Scene Categories

bull Incorporating category information

MNd

π

z

x

θ

β

Codebookbull 174 Local Image Patches

bull DetectionEvenly Sampled GridRandom SamplingSaliency DetectorLowersquos DoG Detector

bull RepresentationNormalized 11x11 gray values128-dim SIFT

Topic Distribution in Different Categories

Topic Hierarchical Clustering

More Topic Models

bull Dynamic topic models ICML 2006bull Correlated Topic Model NIPS 2005bull Hierarchical Dirichlet Process Journal of the American

Statistical Association 2003bull Nonparametric Bayes pachinko allocation UAI 2007bull Supervised LDA NIPS 2007bull MedLDA ndash Maximum Margin Discrimant LDA ICML 2009bull hellip

Are you really into Graphical Models

bull Describing Visual Scenes using Transformed Dirichlet Processes E Sudderth A Torralba W Freeman and A Willsky NIPS Dec 2005

Reference

bull David M Blei Andrew Y Ng Michael I Jordan Latent Dirichlet Allocation Journal of Machine Learning Research (JMLR) 2003

bull Matthew J Beal Variational Algorithms for Approximate Bayesian Inference PhD Thesis University of Cambridge 1998

bull L Fei-Fei and P Perona A Bayesian Hierarchical Model for Learning Natural Scene Categories CVPR 2005

Outline

bull Matrix Decompositionndash PCA SVD NMFndash LDA ICA Sparse Coding etc

bull Graphical Modelndash Basic concepts in probabilistic machine learningndash EMndash pLSAndash LDA

bull Two Applicationsndash Document decomposition for ldquolong queryrdquo retrieval ICCV 2009ndash Modeling Threaded Discussions SIGIR 2009

Large-Scale Indexing for ldquoLong Queryrdquo Retrieval (Similarity Search)

Xiao Zhang Zhiwei Li Lei Zhang Wei-Ying Ma and Heung-Yeung Shum

ICCV 2009

The Long Query Problem

bull If a query contains 1000 keywordsndash Need to access 1000 inverted listsndash The intersection of 1000 inverted lists may be emptyndash The union of 1000 inverted list may be the whole corpus

bull Dimension reductionTerm1 Term2 Term3 Term4 hellip TermN

Img1 1 2 0 0 hellip 2

f1 f2 hellip fM

Img1 02 01 hellip 003

Topic Projection

Dim = 1 million

Dim = 200

Key Idea Dimension Reduction + Residual Error Preservation

bull p original TF-IDF vector in vocabulary spacebull X projection matrix for dimension reductionbull w low dimensional feature vectorbull residual error

p Xw

An image = 0 01 02 03 04 05 06 07 08 09 1

0

2

4

6

8

10

12

14

16

+ a few words(10 words)

Orthogonal Decomposition

p Xw 1 11 1 1

2 21 2 1 2

1 1

1

k

k

W k W

W W Wk W

p x xp x x w

p wp x x

Base vector

Low dimensional representation

Residual

An image = 0 01 02 03 04 05 06 07 08 09 1

0

2

4

6

8

10

12

14

16

+ a few words(10 words)

T Tp q p q

p q

w w

X1 X2 X3 hellip Xk

A Probabilistic Implementation

x is a switch variable It controls a word generated from

bull a topic specific distribution

bull a document specific distribution

bull a background distribution

( | )p w d 1( 0 | ) ( | ) ( | )K

kp x d p w z k p z k d

( 1| ) ( | )p x d p w d ( 2 | ) ( | )p x d p w

C Chemudugunta etal Modeling General and Specific Aspects of Documents with a Probabilistic Topic Model NIPS 2006

Search (Online)

DS1 DS2 hellip

DS1 DS2 hellip

DS1 DS2 hellip

DS1 DS2 hellip

DS1 DS2 hellip

hellip

hellip

hellip

hellip

hellip

LSH Index

Doc 300 Doc 401 hellip

A query = 0 01 02 03 04 05 06 07 08 09 1

0

2

4

6

8

10

12

14

16

+ a few words

Re-rankingDoc 401 hellip

Doc 1

Doc 2

Doc 300

Doc 401

Doc N

Doc 300

Index 10M Images 46GBSearch Speed lt 100ms

Doc Meta

Search Example

Query Image

Search Example

Query Image

Simultaneously Modeling Semantics and Structure of Threaded Discussions A Sparse

Coding Approach and Its Applications

Chen LIN Jiang-Ming YANG Rui CAI Xin-jing WANG Wei WANG Lei ZHANG

SIGIR 2009

123

Semantic amp structure

124

SemanticTopics

StructureWho reply to who

Optimize them together

Model semantic

Model structure

125

Reply reconstruction

126

DocumentSimilarity

TopicSimilarity

StructureSimilarity

BaselinesNP

Reply to Nearest PostRR

Reply to RootDS

Document SimilarityLDA

Latent Dirichlet Allocation Project documents to topic space

SWB Special Words Topic Model with Background distribution Project documents to topic and junk topic space

127

Evaluation

method Slashdot Apple

All Posts Good Posts All Posts Good Posts

NP 0021 0012 0289 0239

RR 0183 0319 0269 0474

DS 0463 0643 0409 0628

LDA 0465 0644 0410 0648

SWB 0463 0644 0410 0641

SMSS 0524 0737 0517 0772

128

Expert finding

Reply reconstruction

Network construction

Expert finding

Methods

HITS

PageRank

hellip

129

Baselines

LMFormal Models for Expert Finding in Enterprise Corpora SIGIR

06Achieves stable performance in expert finding task using a

language modelPageRank

Benchmark nodal ranking methodHITS

Find hub nodes and authority nodeEABIF

Personalized Recommendation Driven by Information Flow SIGIR rsquo06

Find most influential node130

Evaluation

131

bull Bayesian estimate

Method MRR MAP P10

LM 0821 0698 0800

EABIF(ori) 0674 0362 0243

EABIF(rec) 0742 0318 0281

PageRank(ori) 0675 0377 0263

PageRank(rec) 0743 0321 0266

HITS(ori) 0906 0832 0900

HITS(rec) 0938 0822 0906

Summary

bull Matrix and probability are fundamental mathematics in information retrieval and computer visionndash Matrix decomposition ndash a good practice to learn matrixndash Graphical model ndash a good practice to learn probability

bull Graphical model is a good tool to analyze problems

bull The essence of decomposition is to discover a set of mid-level features to describe original documentsimages

bull It is more adaptable for various applications than matrix decomposition

  • An Introduction To Matrix Decomposition and Graphical Model
  • Outline
  • What Is Matrix Decomposition
  • Why We Need Matrix Decomposition
  • Why We Need Matrix Decomposition (2)
  • Principle Component Analysis
  • Definition ndash Eigenvalue amp Eigenvector
  • Definition ndash Principle Component Analysis
  • Principle Component Analysis (2)
  • Maximizing Variance
  • Optimization Problem
  • Property Data Decomposition
  • Face Recognition ndash Eigenface
  • Slide 14
  • Slide 15
  • PageRank ndash Power Iteration
  • Column-Stochastic amp Irreducible
  • Iterative PageRank Calculation
  • Convergence of the power iteration
  • Singular value decomposition
  • SVD - Definition
  • Singular Values And Singular Vectors
  • Matrix approximation
  • SVD and PCA
  • Example - LSI
  • SVD and PCA (2)
  • Latent Semantic Indexing (LSI)
  • Latent Semantic Indexing
  • Similarity Measures
  • Similarity Measures (2)
  • HITS (Hyperlink Induced Topic Search)
  • Power Iteration
  • HITS and SVD
  • HITS vs PageRank
  • NMF ndash Non-Negative Matrix Factorization
  • Definition
  • Motivation
  • Multiplicative Update Algorithm
  • Multiplicative Update Algorithm (2)
  • NMF vs PCA
  • Reference
  • Major Reference
  • Outline (2)
  • Not Included
  • Basic Concepts
  • What Is Machine Learning
  • Likelihood Function
  • Maximum Likelihood (ML)
  • IID ndash Independent Identically Distributed
  • Reference (2)
  • Expectation Maximization
  • Why We Need EM
  • More General
  • The Expectation Maximization (EM) Algorithm
  • Jensenrsquos Inequality
  • Lower Bounding the Log Likelihood
  • The E and M Steps of EM
  • The E Step
  • The M Step
  • EM Never Decreases the Likelihood
  • Reference (3)
  • Why Do We Need Graphical Model
  • Why Do We Need Graphical Models
  • Directed Acyclic Graphical Models (Bayesian Networks)
  • Directed Graphs for Statistical Models Plate Notation
  • pLSA ndash Probabilistic Latent Semantic Analysis
  • Latent Semantic Indexing (LSI) Review
  • pLSA ndash Probabilistic Latent Semantic Analysis (2)
  • pLSA
  • pLSA (2)
  • Joint Probability vs Likelihood
  • Document Decomposition
  • pLSA ndash Objective Function
  • EM Steps
  • Lower Bounding the Log Likelihood (2)
  • EM Steps (2)
  • Latent Subspace
  • pLSA vs LSA
  • pLSA vs LSA (2)
  • Applications
  • Text Mining
  • Scene Classification
  • Classification Result
  • Reference (4)
  • LDA ndash Latent Dirichilet Allocation
  • Problems in pLSA
  • Problems in pLSA (2)
  • Dirichlet Distribution
  • Dirichlet Distribution (2)
  • Example Dirichlet Distributions (K=3)
  • Example Dirichlet Distributions (K=3) (2)
  • The LDA Model
  • The LDA Model (2)
  • Joint Probability
  • Likelihood
  • Inference
  • Inference (2)
  • Variational Inference
  • Variantional Inference
  • VBEM vs EM
  • Parameter Estimation
  • Parameter Estimation (2)
  • Topic Examples in a 100-topic LDA Model)
  • Classification (50-topic LDA + SVM)
  • Problems in LDA
  • A Bayesian Hierarchical Model for Learning Natural Scene Catego
  • Codebook
  • Topic Distribution in Different Categories
  • Topic Hierarchical Clustering
  • More Topic Models
  • Are you really into Graphical Models
  • Reference (5)
  • Outline (3)
  • Slide 115
  • The Long Query Problem
  • Key Idea Dimension Reduction + Residual Error Preservation
  • Orthogonal Decomposition
  • A Probabilistic Implementation
  • Search (Online)
  • Search Example
  • Search Example (2)
  • Simultaneously Modeling Semantics and Structure of Threaded Dis
  • Semantic amp structure
  • Optimize them together
  • Reply reconstruction
  • Baselines
  • Evaluation
  • Expert finding
  • Baselines (2)
  • Evaluation (2)
  • Summary
Page 79: An Introduction To Matrix Decomposition and Graphical Model

Text Mining

Scene Classification

Classification Result

Reference

bull Thomas Hofmann Probabilistic latent semantic analysis In Proc of Uncertainty in Artificial Intelligence UAI99 Stockholm 1999

bull Bosch A Zisserman A and Munoz X Scene Classification via pLSAProceedings of the European Conference on Computer Vision (2006)

bull Sivic J Russell B C Efros A A Zisserman A and Freeman W T Discovering Object Categories in Image Collections MIT AI Lab Memo AIM-2005-005 February 2005

LDA ndash LATENT DIRICHILET ALLOCATION

Problems in pLSA

bull pLSA provides no probabilistic model at the document level Each doc has its own topic mixture proportion

bull The number of parameters in the model grows linearly with M (the number of documents in the training set)

Problems in pLSA

bull There is no constraint for distributions p(z|di)

bull Easy to lead to serious problems with over-fitting

d1

z1

w1

z2

w2

zN1

wN1

hellip

d2

z1

w1

z2

w2

zN2

wN2

hellip

dm

z1

w1

z2

w2

zNm

wNm

hellip

p(z|d1) p(z|d2) p(z|dm)

Dirichlet Distribution

bull In the LDA model the topic mixture proportions for each document are assumed to follow some distribution

bull Requirement for such a distributionndash The samples (mixture proportions) generated from it are K-tuples of non-

negative numbers that sum to one That is the samples are multinormialsndash Easy to optimize

bull Dirichlet Distribution is one of such distributions

bull The space of all of these multinomials has a nice geometric interpretation as a (k-1)-simplex

Dirichlet Distribution

bull Definition

bull The density is zero outside this open (K minus 1)-dimensional simplex

k

i ii

ki ik

i i

ki i

kk

xx

xxxxp i

1

11

1

12121

1 0 st

)Γ()(Γ)|(

bull Various parameter α

(6 2 2) (3 7 5)

(2 3 4) (6 2 6)

Example Dirichlet Distributions (K=3)

Example Dirichlet Distributions (K=3)

bull Equal αi different

α0=01 α0=1 α0=10

k

i i10

The LDA Model

z4z3z2z1

w4w3w2w1

b

z4z3z2z1

w4w3w2w1

z4z3z2z1

w4w3w2w1

bull For each documentbull Choose ~Dirichlet()bull For each of the N words wn

ndash Choose a topic znraquo Multinomial()

ndash Choose a word wn from p(wn|znb) a multinomial probability conditioned on the topic zn

The LDA Model

bull For each documentbull Choose ~Dirichlet()bull For each of the N words wn

ndash Choose a topic znraquo Multinomial()

ndash Choose a word wn from p(wn|znb) a multinomial probability conditioned on the topic zn

Joint Probability

bull Given parameter α and β

where

Likelihood

bull Joint Probability

bull Marginal distribution of a document

bull Likelihood over all the documents

Inference

bull The likelihood can be computed by summing each documentbull Jansenrsquos inequality in EM

Inference

bull In E-Step we need to compute the posterior distribution of the hidden variables

bull Unfortunately this distribution is intractable to compute in general

bull We have to resort to variational approach

Variational Inference

bull In variational inference we consider a simplified graphical model with variational parameters and minimize the KL Divergence between the variational and posterior distributions

Variantional Inference

bull The difference between the lower bound and the likelihood is the KL divergence

bull Maximizing the lower bound L(b) with respect to and is equivalent to minimizing the KL divergence

VBEM vs EM

bull Only different in the E-Step

bull In standard EM q(X) is directly set to p(X|Dθ) and let KL=0bull In VBEM it is intractable to compute p(X|Dθ) Instead it

approximates p(X|Dθ) by a variational distribution q(X) by minimizing KL(q(X) | P(X|D θ)

bull This is also equivalent to maximizing the lower bound L(θ)

Parameter Estimation

bull Given a corpus of documents we would like to find the parameters and b which maximize the likelihood of the observed data

bull Strategy (Variational EM)

bull Lower bound log p(w|b) by a function L(b)bull Repeat until convergence

ndash E Maximize L(b) with respect to the variational parameters ndash M Maximize the bound with respect to parameters and b

Parameter Estimation

bull E-Step Variational Inference ndash repeat until convergence

bull M-Step Parameter estimation

β

α can be implemented using the Newton-Raphson method

Topic Examples in a 100-topic LDA Model)bull 16000 documents from a subset of the TREC AP corpus

Classification (50-topic LDA + SVM)

bull Reuters-21578 dataset ndash contains 8000 documents and 15818 words

(a) EARN vs NOT EARN (b) GRAIN vs NOT GRAIN

Problems in LDA

bull Dirichlet Distribution is helpful to avoid over-fitting But the assumption might be too strong

z4z3z2z1

w4w3w2w1

b

z4z3z2z1

w4w3w2w1

z4z3z2z1

w4w3w2w1

A Bayesian Hierarchical Model for Learning Natural Scene Categories

bull Incorporating category information

MNd

π

z

x

θ

β

Codebookbull 174 Local Image Patches

bull DetectionEvenly Sampled GridRandom SamplingSaliency DetectorLowersquos DoG Detector

bull RepresentationNormalized 11x11 gray values128-dim SIFT

Topic Distribution in Different Categories

Topic Hierarchical Clustering

More Topic Models

bull Dynamic topic models ICML 2006bull Correlated Topic Model NIPS 2005bull Hierarchical Dirichlet Process Journal of the American

Statistical Association 2003bull Nonparametric Bayes pachinko allocation UAI 2007bull Supervised LDA NIPS 2007bull MedLDA ndash Maximum Margin Discrimant LDA ICML 2009bull hellip

Are you really into Graphical Models

bull Describing Visual Scenes using Transformed Dirichlet Processes E Sudderth A Torralba W Freeman and A Willsky NIPS Dec 2005

Reference

bull David M Blei Andrew Y Ng Michael I Jordan Latent Dirichlet Allocation Journal of Machine Learning Research (JMLR) 2003

bull Matthew J Beal Variational Algorithms for Approximate Bayesian Inference PhD Thesis University of Cambridge 1998

bull L Fei-Fei and P Perona A Bayesian Hierarchical Model for Learning Natural Scene Categories CVPR 2005

Outline

bull Matrix Decompositionndash PCA SVD NMFndash LDA ICA Sparse Coding etc

bull Graphical Modelndash Basic concepts in probabilistic machine learningndash EMndash pLSAndash LDA

bull Two Applicationsndash Document decomposition for ldquolong queryrdquo retrieval ICCV 2009ndash Modeling Threaded Discussions SIGIR 2009

Large-Scale Indexing for ldquoLong Queryrdquo Retrieval (Similarity Search)

Xiao Zhang Zhiwei Li Lei Zhang Wei-Ying Ma and Heung-Yeung Shum

ICCV 2009

The Long Query Problem

bull If a query contains 1000 keywordsndash Need to access 1000 inverted listsndash The intersection of 1000 inverted lists may be emptyndash The union of 1000 inverted list may be the whole corpus

bull Dimension reductionTerm1 Term2 Term3 Term4 hellip TermN

Img1 1 2 0 0 hellip 2

f1 f2 hellip fM

Img1 02 01 hellip 003

Topic Projection

Dim = 1 million

Dim = 200

Key Idea Dimension Reduction + Residual Error Preservation

bull p original TF-IDF vector in vocabulary spacebull X projection matrix for dimension reductionbull w low dimensional feature vectorbull residual error

p Xw

An image = 0 01 02 03 04 05 06 07 08 09 1

0

2

4

6

8

10

12

14

16

+ a few words(10 words)

Orthogonal Decomposition

p Xw 1 11 1 1

2 21 2 1 2

1 1

1

k

k

W k W

W W Wk W

p x xp x x w

p wp x x

Base vector

Low dimensional representation

Residual

An image = 0 01 02 03 04 05 06 07 08 09 1

0

2

4

6

8

10

12

14

16

+ a few words(10 words)

T Tp q p q

p q

w w

X1 X2 X3 hellip Xk

A Probabilistic Implementation

x is a switch variable It controls a word generated from

bull a topic specific distribution

bull a document specific distribution

bull a background distribution

( | )p w d 1( 0 | ) ( | ) ( | )K

kp x d p w z k p z k d

( 1| ) ( | )p x d p w d ( 2 | ) ( | )p x d p w

C Chemudugunta etal Modeling General and Specific Aspects of Documents with a Probabilistic Topic Model NIPS 2006

Search (Online)

DS1 DS2 hellip

DS1 DS2 hellip

DS1 DS2 hellip

DS1 DS2 hellip

DS1 DS2 hellip

hellip

hellip

hellip

hellip

hellip

LSH Index

Doc 300 Doc 401 hellip

A query = 0 01 02 03 04 05 06 07 08 09 1

0

2

4

6

8

10

12

14

16

+ a few words

Re-rankingDoc 401 hellip

Doc 1

Doc 2

Doc 300

Doc 401

Doc N

Doc 300

Index 10M Images 46GBSearch Speed lt 100ms

Doc Meta

Search Example

Query Image

Search Example

Query Image

Simultaneously Modeling Semantics and Structure of Threaded Discussions A Sparse

Coding Approach and Its Applications

Chen LIN Jiang-Ming YANG Rui CAI Xin-jing WANG Wei WANG Lei ZHANG

SIGIR 2009

123

Semantic amp structure

124

SemanticTopics

StructureWho reply to who

Optimize them together

Model semantic

Model structure

125

Reply reconstruction

126

DocumentSimilarity

TopicSimilarity

StructureSimilarity

BaselinesNP

Reply to Nearest PostRR

Reply to RootDS

Document SimilarityLDA

Latent Dirichlet Allocation Project documents to topic space

SWB Special Words Topic Model with Background distribution Project documents to topic and junk topic space

127

Evaluation

method Slashdot Apple

All Posts Good Posts All Posts Good Posts

NP 0021 0012 0289 0239

RR 0183 0319 0269 0474

DS 0463 0643 0409 0628

LDA 0465 0644 0410 0648

SWB 0463 0644 0410 0641

SMSS 0524 0737 0517 0772

128

Expert finding

Reply reconstruction

Network construction

Expert finding

Methods

HITS

PageRank

hellip

129

Baselines

LMFormal Models for Expert Finding in Enterprise Corpora SIGIR

06Achieves stable performance in expert finding task using a

language modelPageRank

Benchmark nodal ranking methodHITS

Find hub nodes and authority nodeEABIF

Personalized Recommendation Driven by Information Flow SIGIR rsquo06

Find most influential node130

Evaluation

131

bull Bayesian estimate

Method MRR MAP P10

LM 0821 0698 0800

EABIF(ori) 0674 0362 0243

EABIF(rec) 0742 0318 0281

PageRank(ori) 0675 0377 0263

PageRank(rec) 0743 0321 0266

HITS(ori) 0906 0832 0900

HITS(rec) 0938 0822 0906

Summary

bull Matrix and probability are fundamental mathematics in information retrieval and computer visionndash Matrix decomposition ndash a good practice to learn matrixndash Graphical model ndash a good practice to learn probability

bull Graphical model is a good tool to analyze problems

bull The essence of decomposition is to discover a set of mid-level features to describe original documentsimages

bull It is more adaptable for various applications than matrix decomposition

  • An Introduction To Matrix Decomposition and Graphical Model
  • Outline
  • What Is Matrix Decomposition
  • Why We Need Matrix Decomposition
  • Why We Need Matrix Decomposition (2)
  • Principle Component Analysis
  • Definition ndash Eigenvalue amp Eigenvector
  • Definition ndash Principle Component Analysis
  • Principle Component Analysis (2)
  • Maximizing Variance
  • Optimization Problem
  • Property Data Decomposition
  • Face Recognition ndash Eigenface
  • Slide 14
  • Slide 15
  • PageRank ndash Power Iteration
  • Column-Stochastic amp Irreducible
  • Iterative PageRank Calculation
  • Convergence of the power iteration
  • Singular value decomposition
  • SVD - Definition
  • Singular Values And Singular Vectors
  • Matrix approximation
  • SVD and PCA
  • Example - LSI
  • SVD and PCA (2)
  • Latent Semantic Indexing (LSI)
  • Latent Semantic Indexing
  • Similarity Measures
  • Similarity Measures (2)
  • HITS (Hyperlink Induced Topic Search)
  • Power Iteration
  • HITS and SVD
  • HITS vs PageRank
  • NMF ndash Non-Negative Matrix Factorization
  • Definition
  • Motivation
  • Multiplicative Update Algorithm
  • Multiplicative Update Algorithm (2)
  • NMF vs PCA
  • Reference
  • Major Reference
  • Outline (2)
  • Not Included
  • Basic Concepts
  • What Is Machine Learning
  • Likelihood Function
  • Maximum Likelihood (ML)
  • IID ndash Independent Identically Distributed
  • Reference (2)
  • Expectation Maximization
  • Why We Need EM
  • More General
  • The Expectation Maximization (EM) Algorithm
  • Jensenrsquos Inequality
  • Lower Bounding the Log Likelihood
  • The E and M Steps of EM
  • The E Step
  • The M Step
  • EM Never Decreases the Likelihood
  • Reference (3)
  • Why Do We Need Graphical Model
  • Why Do We Need Graphical Models
  • Directed Acyclic Graphical Models (Bayesian Networks)
  • Directed Graphs for Statistical Models Plate Notation
  • pLSA ndash Probabilistic Latent Semantic Analysis
  • Latent Semantic Indexing (LSI) Review
  • pLSA ndash Probabilistic Latent Semantic Analysis (2)
  • pLSA
  • pLSA (2)
  • Joint Probability vs Likelihood
  • Document Decomposition
  • pLSA ndash Objective Function
  • EM Steps
  • Lower Bounding the Log Likelihood (2)
  • EM Steps (2)
  • Latent Subspace
  • pLSA vs LSA
  • pLSA vs LSA (2)
  • Applications
  • Text Mining
  • Scene Classification
  • Classification Result
  • Reference (4)
  • LDA ndash Latent Dirichilet Allocation
  • Problems in pLSA
  • Problems in pLSA (2)
  • Dirichlet Distribution
  • Dirichlet Distribution (2)
  • Example Dirichlet Distributions (K=3)
  • Example Dirichlet Distributions (K=3) (2)
  • The LDA Model
  • The LDA Model (2)
  • Joint Probability
  • Likelihood
  • Inference
  • Inference (2)
  • Variational Inference
  • Variantional Inference
  • VBEM vs EM
  • Parameter Estimation
  • Parameter Estimation (2)
  • Topic Examples in a 100-topic LDA Model)
  • Classification (50-topic LDA + SVM)
  • Problems in LDA
  • A Bayesian Hierarchical Model for Learning Natural Scene Catego
  • Codebook
  • Topic Distribution in Different Categories
  • Topic Hierarchical Clustering
  • More Topic Models
  • Are you really into Graphical Models
  • Reference (5)
  • Outline (3)
  • Slide 115
  • The Long Query Problem
  • Key Idea Dimension Reduction + Residual Error Preservation
  • Orthogonal Decomposition
  • A Probabilistic Implementation
  • Search (Online)
  • Search Example
  • Search Example (2)
  • Simultaneously Modeling Semantics and Structure of Threaded Dis
  • Semantic amp structure
  • Optimize them together
  • Reply reconstruction
  • Baselines
  • Evaluation
  • Expert finding
  • Baselines (2)
  • Evaluation (2)
  • Summary
Page 80: An Introduction To Matrix Decomposition and Graphical Model

Scene Classification

Classification Result

Reference

bull Thomas Hofmann Probabilistic latent semantic analysis In Proc of Uncertainty in Artificial Intelligence UAI99 Stockholm 1999

bull Bosch A Zisserman A and Munoz X Scene Classification via pLSAProceedings of the European Conference on Computer Vision (2006)

bull Sivic J Russell B C Efros A A Zisserman A and Freeman W T Discovering Object Categories in Image Collections MIT AI Lab Memo AIM-2005-005 February 2005

LDA ndash LATENT DIRICHILET ALLOCATION

Problems in pLSA

bull pLSA provides no probabilistic model at the document level Each doc has its own topic mixture proportion

bull The number of parameters in the model grows linearly with M (the number of documents in the training set)

Problems in pLSA

bull There is no constraint for distributions p(z|di)

bull Easy to lead to serious problems with over-fitting

d1

z1

w1

z2

w2

zN1

wN1

hellip

d2

z1

w1

z2

w2

zN2

wN2

hellip

dm

z1

w1

z2

w2

zNm

wNm

hellip

p(z|d1) p(z|d2) p(z|dm)

Dirichlet Distribution

bull In the LDA model the topic mixture proportions for each document are assumed to follow some distribution

bull Requirement for such a distributionndash The samples (mixture proportions) generated from it are K-tuples of non-

negative numbers that sum to one That is the samples are multinormialsndash Easy to optimize

bull Dirichlet Distribution is one of such distributions

bull The space of all of these multinomials has a nice geometric interpretation as a (k-1)-simplex

Dirichlet Distribution

bull Definition

bull The density is zero outside this open (K minus 1)-dimensional simplex

k

i ii

ki ik

i i

ki i

kk

xx

xxxxp i

1

11

1

12121

1 0 st

)Γ()(Γ)|(

bull Various parameter α

(6 2 2) (3 7 5)

(2 3 4) (6 2 6)

Example Dirichlet Distributions (K=3)

Example Dirichlet Distributions (K=3)

bull Equal αi different

α0=01 α0=1 α0=10

k

i i10

The LDA Model

z4z3z2z1

w4w3w2w1

b

z4z3z2z1

w4w3w2w1

z4z3z2z1

w4w3w2w1

bull For each documentbull Choose ~Dirichlet()bull For each of the N words wn

ndash Choose a topic znraquo Multinomial()

ndash Choose a word wn from p(wn|znb) a multinomial probability conditioned on the topic zn

The LDA Model

bull For each documentbull Choose ~Dirichlet()bull For each of the N words wn

ndash Choose a topic znraquo Multinomial()

ndash Choose a word wn from p(wn|znb) a multinomial probability conditioned on the topic zn

Joint Probability

bull Given parameter α and β

where

Likelihood

bull Joint Probability

bull Marginal distribution of a document

bull Likelihood over all the documents

Inference

bull The likelihood can be computed by summing each documentbull Jansenrsquos inequality in EM

Inference

bull In E-Step we need to compute the posterior distribution of the hidden variables

bull Unfortunately this distribution is intractable to compute in general

bull We have to resort to variational approach

Variational Inference

bull In variational inference we consider a simplified graphical model with variational parameters and minimize the KL Divergence between the variational and posterior distributions

Variantional Inference

bull The difference between the lower bound and the likelihood is the KL divergence

bull Maximizing the lower bound L(b) with respect to and is equivalent to minimizing the KL divergence

VBEM vs EM

bull Only different in the E-Step

bull In standard EM q(X) is directly set to p(X|Dθ) and let KL=0bull In VBEM it is intractable to compute p(X|Dθ) Instead it

approximates p(X|Dθ) by a variational distribution q(X) by minimizing KL(q(X) | P(X|D θ)

bull This is also equivalent to maximizing the lower bound L(θ)

Parameter Estimation

bull Given a corpus of documents we would like to find the parameters and b which maximize the likelihood of the observed data

bull Strategy (Variational EM)

bull Lower bound log p(w|b) by a function L(b)bull Repeat until convergence

ndash E Maximize L(b) with respect to the variational parameters ndash M Maximize the bound with respect to parameters and b

Parameter Estimation

bull E-Step Variational Inference ndash repeat until convergence

bull M-Step Parameter estimation

β

α can be implemented using the Newton-Raphson method

Topic Examples in a 100-topic LDA Model)bull 16000 documents from a subset of the TREC AP corpus

Classification (50-topic LDA + SVM)

bull Reuters-21578 dataset ndash contains 8000 documents and 15818 words

(a) EARN vs NOT EARN (b) GRAIN vs NOT GRAIN

Problems in LDA

bull Dirichlet Distribution is helpful to avoid over-fitting But the assumption might be too strong

z4z3z2z1

w4w3w2w1

b

z4z3z2z1

w4w3w2w1

z4z3z2z1

w4w3w2w1

A Bayesian Hierarchical Model for Learning Natural Scene Categories

bull Incorporating category information

MNd

π

z

x

θ

β

Codebookbull 174 Local Image Patches

bull DetectionEvenly Sampled GridRandom SamplingSaliency DetectorLowersquos DoG Detector

bull RepresentationNormalized 11x11 gray values128-dim SIFT

Topic Distribution in Different Categories

Topic Hierarchical Clustering

More Topic Models

bull Dynamic topic models ICML 2006bull Correlated Topic Model NIPS 2005bull Hierarchical Dirichlet Process Journal of the American

Statistical Association 2003bull Nonparametric Bayes pachinko allocation UAI 2007bull Supervised LDA NIPS 2007bull MedLDA ndash Maximum Margin Discrimant LDA ICML 2009bull hellip

Are you really into Graphical Models

bull Describing Visual Scenes using Transformed Dirichlet Processes E Sudderth A Torralba W Freeman and A Willsky NIPS Dec 2005

Reference

bull David M Blei Andrew Y Ng Michael I Jordan Latent Dirichlet Allocation Journal of Machine Learning Research (JMLR) 2003

bull Matthew J Beal Variational Algorithms for Approximate Bayesian Inference PhD Thesis University of Cambridge 1998

bull L Fei-Fei and P Perona A Bayesian Hierarchical Model for Learning Natural Scene Categories CVPR 2005

Outline

bull Matrix Decompositionndash PCA SVD NMFndash LDA ICA Sparse Coding etc

bull Graphical Modelndash Basic concepts in probabilistic machine learningndash EMndash pLSAndash LDA

bull Two Applicationsndash Document decomposition for ldquolong queryrdquo retrieval ICCV 2009ndash Modeling Threaded Discussions SIGIR 2009

Large-Scale Indexing for ldquoLong Queryrdquo Retrieval (Similarity Search)

Xiao Zhang Zhiwei Li Lei Zhang Wei-Ying Ma and Heung-Yeung Shum

ICCV 2009

The Long Query Problem

bull If a query contains 1000 keywordsndash Need to access 1000 inverted listsndash The intersection of 1000 inverted lists may be emptyndash The union of 1000 inverted list may be the whole corpus

bull Dimension reductionTerm1 Term2 Term3 Term4 hellip TermN

Img1 1 2 0 0 hellip 2

f1 f2 hellip fM

Img1 02 01 hellip 003

Topic Projection

Dim = 1 million

Dim = 200

Key Idea Dimension Reduction + Residual Error Preservation

bull p original TF-IDF vector in vocabulary spacebull X projection matrix for dimension reductionbull w low dimensional feature vectorbull residual error

p Xw

An image = 0 01 02 03 04 05 06 07 08 09 1

0

2

4

6

8

10

12

14

16

+ a few words(10 words)

Orthogonal Decomposition

p Xw 1 11 1 1

2 21 2 1 2

1 1

1

k

k

W k W

W W Wk W

p x xp x x w

p wp x x

Base vector

Low dimensional representation

Residual

An image = 0 01 02 03 04 05 06 07 08 09 1

0

2

4

6

8

10

12

14

16

+ a few words(10 words)

T Tp q p q

p q

w w

X1 X2 X3 hellip Xk

A Probabilistic Implementation

x is a switch variable It controls a word generated from

bull a topic specific distribution

bull a document specific distribution

bull a background distribution

( | )p w d 1( 0 | ) ( | ) ( | )K

kp x d p w z k p z k d

( 1| ) ( | )p x d p w d ( 2 | ) ( | )p x d p w

C Chemudugunta etal Modeling General and Specific Aspects of Documents with a Probabilistic Topic Model NIPS 2006

Search (Online)

DS1 DS2 hellip

DS1 DS2 hellip

DS1 DS2 hellip

DS1 DS2 hellip

DS1 DS2 hellip

hellip

hellip

hellip

hellip

hellip

LSH Index

Doc 300 Doc 401 hellip

A query = 0 01 02 03 04 05 06 07 08 09 1

0

2

4

6

8

10

12

14

16

+ a few words

Re-rankingDoc 401 hellip

Doc 1

Doc 2

Doc 300

Doc 401

Doc N

Doc 300

Index 10M Images 46GBSearch Speed lt 100ms

Doc Meta

Search Example

Query Image

Search Example

Query Image

Simultaneously Modeling Semantics and Structure of Threaded Discussions A Sparse

Coding Approach and Its Applications

Chen LIN Jiang-Ming YANG Rui CAI Xin-jing WANG Wei WANG Lei ZHANG

SIGIR 2009

123

Semantic amp structure

124

SemanticTopics

StructureWho reply to who

Optimize them together

Model semantic

Model structure

125

Reply reconstruction

126

DocumentSimilarity

TopicSimilarity

StructureSimilarity

BaselinesNP

Reply to Nearest PostRR

Reply to RootDS

Document SimilarityLDA

Latent Dirichlet Allocation Project documents to topic space

SWB Special Words Topic Model with Background distribution Project documents to topic and junk topic space

127

Evaluation

method Slashdot Apple

All Posts Good Posts All Posts Good Posts

NP 0021 0012 0289 0239

RR 0183 0319 0269 0474

DS 0463 0643 0409 0628

LDA 0465 0644 0410 0648

SWB 0463 0644 0410 0641

SMSS 0524 0737 0517 0772

128

Expert finding

Reply reconstruction

Network construction

Expert finding

Methods

HITS

PageRank

hellip

129

Baselines

LMFormal Models for Expert Finding in Enterprise Corpora SIGIR

06Achieves stable performance in expert finding task using a

language modelPageRank

Benchmark nodal ranking methodHITS

Find hub nodes and authority nodeEABIF

Personalized Recommendation Driven by Information Flow SIGIR rsquo06

Find most influential node130

Evaluation

131

bull Bayesian estimate

Method MRR MAP P10

LM 0821 0698 0800

EABIF(ori) 0674 0362 0243

EABIF(rec) 0742 0318 0281

PageRank(ori) 0675 0377 0263

PageRank(rec) 0743 0321 0266

HITS(ori) 0906 0832 0900

HITS(rec) 0938 0822 0906

Summary

bull Matrix and probability are fundamental mathematics in information retrieval and computer visionndash Matrix decomposition ndash a good practice to learn matrixndash Graphical model ndash a good practice to learn probability

bull Graphical model is a good tool to analyze problems

bull The essence of decomposition is to discover a set of mid-level features to describe original documentsimages

bull It is more adaptable for various applications than matrix decomposition

  • An Introduction To Matrix Decomposition and Graphical Model
  • Outline
  • What Is Matrix Decomposition
  • Why We Need Matrix Decomposition
  • Why We Need Matrix Decomposition (2)
  • Principle Component Analysis
  • Definition ndash Eigenvalue amp Eigenvector
  • Definition ndash Principle Component Analysis
  • Principle Component Analysis (2)
  • Maximizing Variance
  • Optimization Problem
  • Property Data Decomposition
  • Face Recognition ndash Eigenface
  • Slide 14
  • Slide 15
  • PageRank ndash Power Iteration
  • Column-Stochastic amp Irreducible
  • Iterative PageRank Calculation
  • Convergence of the power iteration
  • Singular value decomposition
  • SVD - Definition
  • Singular Values And Singular Vectors
  • Matrix approximation
  • SVD and PCA
  • Example - LSI
  • SVD and PCA (2)
  • Latent Semantic Indexing (LSI)
  • Latent Semantic Indexing
  • Similarity Measures
  • Similarity Measures (2)
  • HITS (Hyperlink Induced Topic Search)
  • Power Iteration
  • HITS and SVD
  • HITS vs PageRank
  • NMF ndash Non-Negative Matrix Factorization
  • Definition
  • Motivation
  • Multiplicative Update Algorithm
  • Multiplicative Update Algorithm (2)
  • NMF vs PCA
  • Reference
  • Major Reference
  • Outline (2)
  • Not Included
  • Basic Concepts
  • What Is Machine Learning
  • Likelihood Function
  • Maximum Likelihood (ML)
  • IID ndash Independent Identically Distributed
  • Reference (2)
  • Expectation Maximization
  • Why We Need EM
  • More General
  • The Expectation Maximization (EM) Algorithm
  • Jensenrsquos Inequality
  • Lower Bounding the Log Likelihood
  • The E and M Steps of EM
  • The E Step
  • The M Step
  • EM Never Decreases the Likelihood
  • Reference (3)
  • Why Do We Need Graphical Model
  • Why Do We Need Graphical Models
  • Directed Acyclic Graphical Models (Bayesian Networks)
  • Directed Graphs for Statistical Models Plate Notation
  • pLSA ndash Probabilistic Latent Semantic Analysis
  • Latent Semantic Indexing (LSI) Review
  • pLSA ndash Probabilistic Latent Semantic Analysis (2)
  • pLSA
  • pLSA (2)
  • Joint Probability vs Likelihood
  • Document Decomposition
  • pLSA ndash Objective Function
  • EM Steps
  • Lower Bounding the Log Likelihood (2)
  • EM Steps (2)
  • Latent Subspace
  • pLSA vs LSA
  • pLSA vs LSA (2)
  • Applications
  • Text Mining
  • Scene Classification
  • Classification Result
  • Reference (4)
  • LDA ndash Latent Dirichilet Allocation
  • Problems in pLSA
  • Problems in pLSA (2)
  • Dirichlet Distribution
  • Dirichlet Distribution (2)
  • Example Dirichlet Distributions (K=3)
  • Example Dirichlet Distributions (K=3) (2)
  • The LDA Model
  • The LDA Model (2)
  • Joint Probability
  • Likelihood
  • Inference
  • Inference (2)
  • Variational Inference
  • Variantional Inference
  • VBEM vs EM
  • Parameter Estimation
  • Parameter Estimation (2)
  • Topic Examples in a 100-topic LDA Model)
  • Classification (50-topic LDA + SVM)
  • Problems in LDA
  • A Bayesian Hierarchical Model for Learning Natural Scene Catego
  • Codebook
  • Topic Distribution in Different Categories
  • Topic Hierarchical Clustering
  • More Topic Models
  • Are you really into Graphical Models
  • Reference (5)
  • Outline (3)
  • Slide 115
  • The Long Query Problem
  • Key Idea Dimension Reduction + Residual Error Preservation
  • Orthogonal Decomposition
  • A Probabilistic Implementation
  • Search (Online)
  • Search Example
  • Search Example (2)
  • Simultaneously Modeling Semantics and Structure of Threaded Dis
  • Semantic amp structure
  • Optimize them together
  • Reply reconstruction
  • Baselines
  • Evaluation
  • Expert finding
  • Baselines (2)
  • Evaluation (2)
  • Summary
Page 81: An Introduction To Matrix Decomposition and Graphical Model

Classification Result

Reference

bull Thomas Hofmann Probabilistic latent semantic analysis In Proc of Uncertainty in Artificial Intelligence UAI99 Stockholm 1999

bull Bosch A Zisserman A and Munoz X Scene Classification via pLSAProceedings of the European Conference on Computer Vision (2006)

bull Sivic J Russell B C Efros A A Zisserman A and Freeman W T Discovering Object Categories in Image Collections MIT AI Lab Memo AIM-2005-005 February 2005

LDA ndash LATENT DIRICHILET ALLOCATION

Problems in pLSA

bull pLSA provides no probabilistic model at the document level Each doc has its own topic mixture proportion

bull The number of parameters in the model grows linearly with M (the number of documents in the training set)

Problems in pLSA

bull There is no constraint for distributions p(z|di)

bull Easy to lead to serious problems with over-fitting

d1

z1

w1

z2

w2

zN1

wN1

hellip

d2

z1

w1

z2

w2

zN2

wN2

hellip

dm

z1

w1

z2

w2

zNm

wNm

hellip

p(z|d1) p(z|d2) p(z|dm)

Dirichlet Distribution

bull In the LDA model the topic mixture proportions for each document are assumed to follow some distribution

bull Requirement for such a distributionndash The samples (mixture proportions) generated from it are K-tuples of non-

negative numbers that sum to one That is the samples are multinormialsndash Easy to optimize

bull Dirichlet Distribution is one of such distributions

bull The space of all of these multinomials has a nice geometric interpretation as a (k-1)-simplex

Dirichlet Distribution

bull Definition

bull The density is zero outside this open (K minus 1)-dimensional simplex

k

i ii

ki ik

i i

ki i

kk

xx

xxxxp i

1

11

1

12121

1 0 st

)Γ()(Γ)|(

bull Various parameter α

(6 2 2) (3 7 5)

(2 3 4) (6 2 6)

Example Dirichlet Distributions (K=3)

Example Dirichlet Distributions (K=3)

bull Equal αi different

α0=01 α0=1 α0=10

k

i i10

The LDA Model

z4z3z2z1

w4w3w2w1

b

z4z3z2z1

w4w3w2w1

z4z3z2z1

w4w3w2w1

bull For each documentbull Choose ~Dirichlet()bull For each of the N words wn

ndash Choose a topic znraquo Multinomial()

ndash Choose a word wn from p(wn|znb) a multinomial probability conditioned on the topic zn

The LDA Model

bull For each documentbull Choose ~Dirichlet()bull For each of the N words wn

ndash Choose a topic znraquo Multinomial()

ndash Choose a word wn from p(wn|znb) a multinomial probability conditioned on the topic zn

Joint Probability

bull Given parameter α and β

where

Likelihood

bull Joint Probability

bull Marginal distribution of a document

bull Likelihood over all the documents

Inference

bull The likelihood can be computed by summing each documentbull Jansenrsquos inequality in EM

Inference

bull In E-Step we need to compute the posterior distribution of the hidden variables

bull Unfortunately this distribution is intractable to compute in general

bull We have to resort to variational approach

Variational Inference

bull In variational inference we consider a simplified graphical model with variational parameters and minimize the KL Divergence between the variational and posterior distributions

Variantional Inference

bull The difference between the lower bound and the likelihood is the KL divergence

bull Maximizing the lower bound L(b) with respect to and is equivalent to minimizing the KL divergence

VBEM vs EM

bull Only different in the E-Step

bull In standard EM q(X) is directly set to p(X|Dθ) and let KL=0bull In VBEM it is intractable to compute p(X|Dθ) Instead it

approximates p(X|Dθ) by a variational distribution q(X) by minimizing KL(q(X) | P(X|D θ)

bull This is also equivalent to maximizing the lower bound L(θ)

Parameter Estimation

bull Given a corpus of documents we would like to find the parameters and b which maximize the likelihood of the observed data

bull Strategy (Variational EM)

bull Lower bound log p(w|b) by a function L(b)bull Repeat until convergence

ndash E Maximize L(b) with respect to the variational parameters ndash M Maximize the bound with respect to parameters and b

Parameter Estimation

bull E-Step Variational Inference ndash repeat until convergence

bull M-Step Parameter estimation

β

α can be implemented using the Newton-Raphson method

Topic Examples in a 100-topic LDA Model)bull 16000 documents from a subset of the TREC AP corpus

Classification (50-topic LDA + SVM)

bull Reuters-21578 dataset ndash contains 8000 documents and 15818 words

(a) EARN vs NOT EARN (b) GRAIN vs NOT GRAIN

Problems in LDA

bull Dirichlet Distribution is helpful to avoid over-fitting But the assumption might be too strong

z4z3z2z1

w4w3w2w1

b

z4z3z2z1

w4w3w2w1

z4z3z2z1

w4w3w2w1

A Bayesian Hierarchical Model for Learning Natural Scene Categories

bull Incorporating category information

MNd

π

z

x

θ

β

Codebookbull 174 Local Image Patches

bull DetectionEvenly Sampled GridRandom SamplingSaliency DetectorLowersquos DoG Detector

bull RepresentationNormalized 11x11 gray values128-dim SIFT

Topic Distribution in Different Categories

Topic Hierarchical Clustering

More Topic Models

bull Dynamic topic models ICML 2006bull Correlated Topic Model NIPS 2005bull Hierarchical Dirichlet Process Journal of the American

Statistical Association 2003bull Nonparametric Bayes pachinko allocation UAI 2007bull Supervised LDA NIPS 2007bull MedLDA ndash Maximum Margin Discrimant LDA ICML 2009bull hellip

Are you really into Graphical Models

bull Describing Visual Scenes using Transformed Dirichlet Processes E Sudderth A Torralba W Freeman and A Willsky NIPS Dec 2005

Reference

bull David M Blei Andrew Y Ng Michael I Jordan Latent Dirichlet Allocation Journal of Machine Learning Research (JMLR) 2003

bull Matthew J Beal Variational Algorithms for Approximate Bayesian Inference PhD Thesis University of Cambridge 1998

bull L Fei-Fei and P Perona A Bayesian Hierarchical Model for Learning Natural Scene Categories CVPR 2005

Outline

bull Matrix Decompositionndash PCA SVD NMFndash LDA ICA Sparse Coding etc

bull Graphical Modelndash Basic concepts in probabilistic machine learningndash EMndash pLSAndash LDA

bull Two Applicationsndash Document decomposition for ldquolong queryrdquo retrieval ICCV 2009ndash Modeling Threaded Discussions SIGIR 2009

Large-Scale Indexing for ldquoLong Queryrdquo Retrieval (Similarity Search)

Xiao Zhang Zhiwei Li Lei Zhang Wei-Ying Ma and Heung-Yeung Shum

ICCV 2009

The Long Query Problem

bull If a query contains 1000 keywordsndash Need to access 1000 inverted listsndash The intersection of 1000 inverted lists may be emptyndash The union of 1000 inverted list may be the whole corpus

bull Dimension reductionTerm1 Term2 Term3 Term4 hellip TermN

Img1 1 2 0 0 hellip 2

f1 f2 hellip fM

Img1 02 01 hellip 003

Topic Projection

Dim = 1 million

Dim = 200

Key Idea Dimension Reduction + Residual Error Preservation

bull p original TF-IDF vector in vocabulary spacebull X projection matrix for dimension reductionbull w low dimensional feature vectorbull residual error

p Xw

An image = 0 01 02 03 04 05 06 07 08 09 1

0

2

4

6

8

10

12

14

16

+ a few words(10 words)

Orthogonal Decomposition

p Xw 1 11 1 1

2 21 2 1 2

1 1

1

k

k

W k W

W W Wk W

p x xp x x w

p wp x x

Base vector

Low dimensional representation

Residual

An image = 0 01 02 03 04 05 06 07 08 09 1

0

2

4

6

8

10

12

14

16

+ a few words(10 words)

T Tp q p q

p q

w w

X1 X2 X3 hellip Xk

A Probabilistic Implementation

x is a switch variable It controls a word generated from

bull a topic specific distribution

bull a document specific distribution

bull a background distribution

( | )p w d 1( 0 | ) ( | ) ( | )K

kp x d p w z k p z k d

( 1| ) ( | )p x d p w d ( 2 | ) ( | )p x d p w

C Chemudugunta etal Modeling General and Specific Aspects of Documents with a Probabilistic Topic Model NIPS 2006

Search (Online)

DS1 DS2 hellip

DS1 DS2 hellip

DS1 DS2 hellip

DS1 DS2 hellip

DS1 DS2 hellip

hellip

hellip

hellip

hellip

hellip

LSH Index

Doc 300 Doc 401 hellip

A query = 0 01 02 03 04 05 06 07 08 09 1

0

2

4

6

8

10

12

14

16

+ a few words

Re-rankingDoc 401 hellip

Doc 1

Doc 2

Doc 300

Doc 401

Doc N

Doc 300

Index 10M Images 46GBSearch Speed lt 100ms

Doc Meta

Search Example

Query Image

Search Example

Query Image

Simultaneously Modeling Semantics and Structure of Threaded Discussions A Sparse

Coding Approach and Its Applications

Chen LIN Jiang-Ming YANG Rui CAI Xin-jing WANG Wei WANG Lei ZHANG

SIGIR 2009

123

Semantic amp structure

124

SemanticTopics

StructureWho reply to who

Optimize them together

Model semantic

Model structure

125

Reply reconstruction

126

DocumentSimilarity

TopicSimilarity

StructureSimilarity

BaselinesNP

Reply to Nearest PostRR

Reply to RootDS

Document SimilarityLDA

Latent Dirichlet Allocation Project documents to topic space

SWB Special Words Topic Model with Background distribution Project documents to topic and junk topic space

127

Evaluation

method Slashdot Apple

All Posts Good Posts All Posts Good Posts

NP 0021 0012 0289 0239

RR 0183 0319 0269 0474

DS 0463 0643 0409 0628

LDA 0465 0644 0410 0648

SWB 0463 0644 0410 0641

SMSS 0524 0737 0517 0772

128

Expert finding

Reply reconstruction

Network construction

Expert finding

Methods

HITS

PageRank

hellip

129

Baselines

LMFormal Models for Expert Finding in Enterprise Corpora SIGIR

06Achieves stable performance in expert finding task using a

language modelPageRank

Benchmark nodal ranking methodHITS

Find hub nodes and authority nodeEABIF

Personalized Recommendation Driven by Information Flow SIGIR rsquo06

Find most influential node130

Evaluation

131

bull Bayesian estimate

Method MRR MAP P10

LM 0821 0698 0800

EABIF(ori) 0674 0362 0243

EABIF(rec) 0742 0318 0281

PageRank(ori) 0675 0377 0263

PageRank(rec) 0743 0321 0266

HITS(ori) 0906 0832 0900

HITS(rec) 0938 0822 0906

Summary

bull Matrix and probability are fundamental mathematics in information retrieval and computer visionndash Matrix decomposition ndash a good practice to learn matrixndash Graphical model ndash a good practice to learn probability

bull Graphical model is a good tool to analyze problems

bull The essence of decomposition is to discover a set of mid-level features to describe original documentsimages

bull It is more adaptable for various applications than matrix decomposition

  • An Introduction To Matrix Decomposition and Graphical Model
  • Outline
  • What Is Matrix Decomposition
  • Why We Need Matrix Decomposition
  • Why We Need Matrix Decomposition (2)
  • Principle Component Analysis
  • Definition ndash Eigenvalue amp Eigenvector
  • Definition ndash Principle Component Analysis
  • Principle Component Analysis (2)
  • Maximizing Variance
  • Optimization Problem
  • Property Data Decomposition
  • Face Recognition ndash Eigenface
  • Slide 14
  • Slide 15
  • PageRank ndash Power Iteration
  • Column-Stochastic amp Irreducible
  • Iterative PageRank Calculation
  • Convergence of the power iteration
  • Singular value decomposition
  • SVD - Definition
  • Singular Values And Singular Vectors
  • Matrix approximation
  • SVD and PCA
  • Example - LSI
  • SVD and PCA (2)
  • Latent Semantic Indexing (LSI)
  • Latent Semantic Indexing
  • Similarity Measures
  • Similarity Measures (2)
  • HITS (Hyperlink Induced Topic Search)
  • Power Iteration
  • HITS and SVD
  • HITS vs PageRank
  • NMF ndash Non-Negative Matrix Factorization
  • Definition
  • Motivation
  • Multiplicative Update Algorithm
  • Multiplicative Update Algorithm (2)
  • NMF vs PCA
  • Reference
  • Major Reference
  • Outline (2)
  • Not Included
  • Basic Concepts
  • What Is Machine Learning
  • Likelihood Function
  • Maximum Likelihood (ML)
  • IID ndash Independent Identically Distributed
  • Reference (2)
  • Expectation Maximization
  • Why We Need EM
  • More General
  • The Expectation Maximization (EM) Algorithm
  • Jensenrsquos Inequality
  • Lower Bounding the Log Likelihood
  • The E and M Steps of EM
  • The E Step
  • The M Step
  • EM Never Decreases the Likelihood
  • Reference (3)
  • Why Do We Need Graphical Model
  • Why Do We Need Graphical Models
  • Directed Acyclic Graphical Models (Bayesian Networks)
  • Directed Graphs for Statistical Models Plate Notation
  • pLSA ndash Probabilistic Latent Semantic Analysis
  • Latent Semantic Indexing (LSI) Review
  • pLSA ndash Probabilistic Latent Semantic Analysis (2)
  • pLSA
  • pLSA (2)
  • Joint Probability vs Likelihood
  • Document Decomposition
  • pLSA ndash Objective Function
  • EM Steps
  • Lower Bounding the Log Likelihood (2)
  • EM Steps (2)
  • Latent Subspace
  • pLSA vs LSA
  • pLSA vs LSA (2)
  • Applications
  • Text Mining
  • Scene Classification
  • Classification Result
  • Reference (4)
  • LDA ndash Latent Dirichilet Allocation
  • Problems in pLSA
  • Problems in pLSA (2)
  • Dirichlet Distribution
  • Dirichlet Distribution (2)
  • Example Dirichlet Distributions (K=3)
  • Example Dirichlet Distributions (K=3) (2)
  • The LDA Model
  • The LDA Model (2)
  • Joint Probability
  • Likelihood
  • Inference
  • Inference (2)
  • Variational Inference
  • Variantional Inference
  • VBEM vs EM
  • Parameter Estimation
  • Parameter Estimation (2)
  • Topic Examples in a 100-topic LDA Model)
  • Classification (50-topic LDA + SVM)
  • Problems in LDA
  • A Bayesian Hierarchical Model for Learning Natural Scene Catego
  • Codebook
  • Topic Distribution in Different Categories
  • Topic Hierarchical Clustering
  • More Topic Models
  • Are you really into Graphical Models
  • Reference (5)
  • Outline (3)
  • Slide 115
  • The Long Query Problem
  • Key Idea Dimension Reduction + Residual Error Preservation
  • Orthogonal Decomposition
  • A Probabilistic Implementation
  • Search (Online)
  • Search Example
  • Search Example (2)
  • Simultaneously Modeling Semantics and Structure of Threaded Dis
  • Semantic amp structure
  • Optimize them together
  • Reply reconstruction
  • Baselines
  • Evaluation
  • Expert finding
  • Baselines (2)
  • Evaluation (2)
  • Summary
Page 82: An Introduction To Matrix Decomposition and Graphical Model

Reference

bull Thomas Hofmann Probabilistic latent semantic analysis In Proc of Uncertainty in Artificial Intelligence UAI99 Stockholm 1999

bull Bosch A Zisserman A and Munoz X Scene Classification via pLSAProceedings of the European Conference on Computer Vision (2006)

bull Sivic J Russell B C Efros A A Zisserman A and Freeman W T Discovering Object Categories in Image Collections MIT AI Lab Memo AIM-2005-005 February 2005

LDA ndash LATENT DIRICHILET ALLOCATION

Problems in pLSA

bull pLSA provides no probabilistic model at the document level Each doc has its own topic mixture proportion

bull The number of parameters in the model grows linearly with M (the number of documents in the training set)

Problems in pLSA

bull There is no constraint for distributions p(z|di)

bull Easy to lead to serious problems with over-fitting

d1

z1

w1

z2

w2

zN1

wN1

hellip

d2

z1

w1

z2

w2

zN2

wN2

hellip

dm

z1

w1

z2

w2

zNm

wNm

hellip

p(z|d1) p(z|d2) p(z|dm)

Dirichlet Distribution

bull In the LDA model the topic mixture proportions for each document are assumed to follow some distribution

bull Requirement for such a distributionndash The samples (mixture proportions) generated from it are K-tuples of non-

negative numbers that sum to one That is the samples are multinormialsndash Easy to optimize

bull Dirichlet Distribution is one of such distributions

bull The space of all of these multinomials has a nice geometric interpretation as a (k-1)-simplex

Dirichlet Distribution

bull Definition

bull The density is zero outside this open (K minus 1)-dimensional simplex

k

i ii

ki ik

i i

ki i

kk

xx

xxxxp i

1

11

1

12121

1 0 st

)Γ()(Γ)|(

bull Various parameter α

(6 2 2) (3 7 5)

(2 3 4) (6 2 6)

Example Dirichlet Distributions (K=3)

Example Dirichlet Distributions (K=3)

bull Equal αi different

α0=01 α0=1 α0=10

k

i i10

The LDA Model

z4z3z2z1

w4w3w2w1

b

z4z3z2z1

w4w3w2w1

z4z3z2z1

w4w3w2w1

bull For each documentbull Choose ~Dirichlet()bull For each of the N words wn

ndash Choose a topic znraquo Multinomial()

ndash Choose a word wn from p(wn|znb) a multinomial probability conditioned on the topic zn

The LDA Model

bull For each documentbull Choose ~Dirichlet()bull For each of the N words wn

ndash Choose a topic znraquo Multinomial()

ndash Choose a word wn from p(wn|znb) a multinomial probability conditioned on the topic zn

Joint Probability

bull Given parameter α and β

where

Likelihood

bull Joint Probability

bull Marginal distribution of a document

bull Likelihood over all the documents

Inference

bull The likelihood can be computed by summing each documentbull Jansenrsquos inequality in EM

Inference

bull In E-Step we need to compute the posterior distribution of the hidden variables

bull Unfortunately this distribution is intractable to compute in general

bull We have to resort to variational approach

Variational Inference

bull In variational inference we consider a simplified graphical model with variational parameters and minimize the KL Divergence between the variational and posterior distributions

Variantional Inference

bull The difference between the lower bound and the likelihood is the KL divergence

bull Maximizing the lower bound L(b) with respect to and is equivalent to minimizing the KL divergence

VBEM vs EM

bull Only different in the E-Step

bull In standard EM q(X) is directly set to p(X|Dθ) and let KL=0bull In VBEM it is intractable to compute p(X|Dθ) Instead it

approximates p(X|Dθ) by a variational distribution q(X) by minimizing KL(q(X) | P(X|D θ)

bull This is also equivalent to maximizing the lower bound L(θ)

Parameter Estimation

bull Given a corpus of documents we would like to find the parameters and b which maximize the likelihood of the observed data

bull Strategy (Variational EM)

bull Lower bound log p(w|b) by a function L(b)bull Repeat until convergence

ndash E Maximize L(b) with respect to the variational parameters ndash M Maximize the bound with respect to parameters and b

Parameter Estimation

bull E-Step Variational Inference ndash repeat until convergence

bull M-Step Parameter estimation

β

α can be implemented using the Newton-Raphson method

Topic Examples in a 100-topic LDA Model)bull 16000 documents from a subset of the TREC AP corpus

Classification (50-topic LDA + SVM)

bull Reuters-21578 dataset ndash contains 8000 documents and 15818 words

(a) EARN vs NOT EARN (b) GRAIN vs NOT GRAIN

Problems in LDA

bull Dirichlet Distribution is helpful to avoid over-fitting But the assumption might be too strong

z4z3z2z1

w4w3w2w1

b

z4z3z2z1

w4w3w2w1

z4z3z2z1

w4w3w2w1

A Bayesian Hierarchical Model for Learning Natural Scene Categories

bull Incorporating category information

MNd

π

z

x

θ

β

Codebookbull 174 Local Image Patches

bull DetectionEvenly Sampled GridRandom SamplingSaliency DetectorLowersquos DoG Detector

bull RepresentationNormalized 11x11 gray values128-dim SIFT

Topic Distribution in Different Categories

Topic Hierarchical Clustering

More Topic Models

bull Dynamic topic models ICML 2006bull Correlated Topic Model NIPS 2005bull Hierarchical Dirichlet Process Journal of the American

Statistical Association 2003bull Nonparametric Bayes pachinko allocation UAI 2007bull Supervised LDA NIPS 2007bull MedLDA ndash Maximum Margin Discrimant LDA ICML 2009bull hellip

Are you really into Graphical Models

bull Describing Visual Scenes using Transformed Dirichlet Processes E Sudderth A Torralba W Freeman and A Willsky NIPS Dec 2005

Reference

bull David M Blei Andrew Y Ng Michael I Jordan Latent Dirichlet Allocation Journal of Machine Learning Research (JMLR) 2003

bull Matthew J Beal Variational Algorithms for Approximate Bayesian Inference PhD Thesis University of Cambridge 1998

bull L Fei-Fei and P Perona A Bayesian Hierarchical Model for Learning Natural Scene Categories CVPR 2005

Outline

bull Matrix Decompositionndash PCA SVD NMFndash LDA ICA Sparse Coding etc

bull Graphical Modelndash Basic concepts in probabilistic machine learningndash EMndash pLSAndash LDA

bull Two Applicationsndash Document decomposition for ldquolong queryrdquo retrieval ICCV 2009ndash Modeling Threaded Discussions SIGIR 2009

Large-Scale Indexing for ldquoLong Queryrdquo Retrieval (Similarity Search)

Xiao Zhang Zhiwei Li Lei Zhang Wei-Ying Ma and Heung-Yeung Shum

ICCV 2009

The Long Query Problem

bull If a query contains 1000 keywordsndash Need to access 1000 inverted listsndash The intersection of 1000 inverted lists may be emptyndash The union of 1000 inverted list may be the whole corpus

bull Dimension reductionTerm1 Term2 Term3 Term4 hellip TermN

Img1 1 2 0 0 hellip 2

f1 f2 hellip fM

Img1 02 01 hellip 003

Topic Projection

Dim = 1 million

Dim = 200

Key Idea Dimension Reduction + Residual Error Preservation

bull p original TF-IDF vector in vocabulary spacebull X projection matrix for dimension reductionbull w low dimensional feature vectorbull residual error

p Xw

An image = 0 01 02 03 04 05 06 07 08 09 1

0

2

4

6

8

10

12

14

16

+ a few words(10 words)

Orthogonal Decomposition

p Xw 1 11 1 1

2 21 2 1 2

1 1

1

k

k

W k W

W W Wk W

p x xp x x w

p wp x x

Base vector

Low dimensional representation

Residual

An image = 0 01 02 03 04 05 06 07 08 09 1

0

2

4

6

8

10

12

14

16

+ a few words(10 words)

T Tp q p q

p q

w w

X1 X2 X3 hellip Xk

A Probabilistic Implementation

x is a switch variable It controls a word generated from

bull a topic specific distribution

bull a document specific distribution

bull a background distribution

( | )p w d 1( 0 | ) ( | ) ( | )K

kp x d p w z k p z k d

( 1| ) ( | )p x d p w d ( 2 | ) ( | )p x d p w

C Chemudugunta etal Modeling General and Specific Aspects of Documents with a Probabilistic Topic Model NIPS 2006

Search (Online)

DS1 DS2 hellip

DS1 DS2 hellip

DS1 DS2 hellip

DS1 DS2 hellip

DS1 DS2 hellip

hellip

hellip

hellip

hellip

hellip

LSH Index

Doc 300 Doc 401 hellip

A query = 0 01 02 03 04 05 06 07 08 09 1

0

2

4

6

8

10

12

14

16

+ a few words

Re-rankingDoc 401 hellip

Doc 1

Doc 2

Doc 300

Doc 401

Doc N

Doc 300

Index 10M Images 46GBSearch Speed lt 100ms

Doc Meta

Search Example

Query Image

Search Example

Query Image

Simultaneously Modeling Semantics and Structure of Threaded Discussions A Sparse

Coding Approach and Its Applications

Chen LIN Jiang-Ming YANG Rui CAI Xin-jing WANG Wei WANG Lei ZHANG

SIGIR 2009

123

Semantic amp structure

124

SemanticTopics

StructureWho reply to who

Optimize them together

Model semantic

Model structure

125

Reply reconstruction

126

DocumentSimilarity

TopicSimilarity

StructureSimilarity

BaselinesNP

Reply to Nearest PostRR

Reply to RootDS

Document SimilarityLDA

Latent Dirichlet Allocation Project documents to topic space

SWB Special Words Topic Model with Background distribution Project documents to topic and junk topic space

127

Evaluation

method Slashdot Apple

All Posts Good Posts All Posts Good Posts

NP 0021 0012 0289 0239

RR 0183 0319 0269 0474

DS 0463 0643 0409 0628

LDA 0465 0644 0410 0648

SWB 0463 0644 0410 0641

SMSS 0524 0737 0517 0772

128

Expert finding

Reply reconstruction

Network construction

Expert finding

Methods

HITS

PageRank

hellip

129

Baselines

LMFormal Models for Expert Finding in Enterprise Corpora SIGIR

06Achieves stable performance in expert finding task using a

language modelPageRank

Benchmark nodal ranking methodHITS

Find hub nodes and authority nodeEABIF

Personalized Recommendation Driven by Information Flow SIGIR rsquo06

Find most influential node130

Evaluation

131

bull Bayesian estimate

Method MRR MAP P10

LM 0821 0698 0800

EABIF(ori) 0674 0362 0243

EABIF(rec) 0742 0318 0281

PageRank(ori) 0675 0377 0263

PageRank(rec) 0743 0321 0266

HITS(ori) 0906 0832 0900

HITS(rec) 0938 0822 0906

Summary

bull Matrix and probability are fundamental mathematics in information retrieval and computer visionndash Matrix decomposition ndash a good practice to learn matrixndash Graphical model ndash a good practice to learn probability

bull Graphical model is a good tool to analyze problems

bull The essence of decomposition is to discover a set of mid-level features to describe original documentsimages

bull It is more adaptable for various applications than matrix decomposition

  • An Introduction To Matrix Decomposition and Graphical Model
  • Outline
  • What Is Matrix Decomposition
  • Why We Need Matrix Decomposition
  • Why We Need Matrix Decomposition (2)
  • Principle Component Analysis
  • Definition ndash Eigenvalue amp Eigenvector
  • Definition ndash Principle Component Analysis
  • Principle Component Analysis (2)
  • Maximizing Variance
  • Optimization Problem
  • Property Data Decomposition
  • Face Recognition ndash Eigenface
  • Slide 14
  • Slide 15
  • PageRank ndash Power Iteration
  • Column-Stochastic amp Irreducible
  • Iterative PageRank Calculation
  • Convergence of the power iteration
  • Singular value decomposition
  • SVD - Definition
  • Singular Values And Singular Vectors
  • Matrix approximation
  • SVD and PCA
  • Example - LSI
  • SVD and PCA (2)
  • Latent Semantic Indexing (LSI)
  • Latent Semantic Indexing
  • Similarity Measures
  • Similarity Measures (2)
  • HITS (Hyperlink Induced Topic Search)
  • Power Iteration
  • HITS and SVD
  • HITS vs PageRank
  • NMF ndash Non-Negative Matrix Factorization
  • Definition
  • Motivation
  • Multiplicative Update Algorithm
  • Multiplicative Update Algorithm (2)
  • NMF vs PCA
  • Reference
  • Major Reference
  • Outline (2)
  • Not Included
  • Basic Concepts
  • What Is Machine Learning
  • Likelihood Function
  • Maximum Likelihood (ML)
  • IID ndash Independent Identically Distributed
  • Reference (2)
  • Expectation Maximization
  • Why We Need EM
  • More General
  • The Expectation Maximization (EM) Algorithm
  • Jensenrsquos Inequality
  • Lower Bounding the Log Likelihood
  • The E and M Steps of EM
  • The E Step
  • The M Step
  • EM Never Decreases the Likelihood
  • Reference (3)
  • Why Do We Need Graphical Model
  • Why Do We Need Graphical Models
  • Directed Acyclic Graphical Models (Bayesian Networks)
  • Directed Graphs for Statistical Models Plate Notation
  • pLSA ndash Probabilistic Latent Semantic Analysis
  • Latent Semantic Indexing (LSI) Review
  • pLSA ndash Probabilistic Latent Semantic Analysis (2)
  • pLSA
  • pLSA (2)
  • Joint Probability vs Likelihood
  • Document Decomposition
  • pLSA ndash Objective Function
  • EM Steps
  • Lower Bounding the Log Likelihood (2)
  • EM Steps (2)
  • Latent Subspace
  • pLSA vs LSA
  • pLSA vs LSA (2)
  • Applications
  • Text Mining
  • Scene Classification
  • Classification Result
  • Reference (4)
  • LDA ndash Latent Dirichilet Allocation
  • Problems in pLSA
  • Problems in pLSA (2)
  • Dirichlet Distribution
  • Dirichlet Distribution (2)
  • Example Dirichlet Distributions (K=3)
  • Example Dirichlet Distributions (K=3) (2)
  • The LDA Model
  • The LDA Model (2)
  • Joint Probability
  • Likelihood
  • Inference
  • Inference (2)
  • Variational Inference
  • Variantional Inference
  • VBEM vs EM
  • Parameter Estimation
  • Parameter Estimation (2)
  • Topic Examples in a 100-topic LDA Model)
  • Classification (50-topic LDA + SVM)
  • Problems in LDA
  • A Bayesian Hierarchical Model for Learning Natural Scene Catego
  • Codebook
  • Topic Distribution in Different Categories
  • Topic Hierarchical Clustering
  • More Topic Models
  • Are you really into Graphical Models
  • Reference (5)
  • Outline (3)
  • Slide 115
  • The Long Query Problem
  • Key Idea Dimension Reduction + Residual Error Preservation
  • Orthogonal Decomposition
  • A Probabilistic Implementation
  • Search (Online)
  • Search Example
  • Search Example (2)
  • Simultaneously Modeling Semantics and Structure of Threaded Dis
  • Semantic amp structure
  • Optimize them together
  • Reply reconstruction
  • Baselines
  • Evaluation
  • Expert finding
  • Baselines (2)
  • Evaluation (2)
  • Summary
Page 83: An Introduction To Matrix Decomposition and Graphical Model

LDA ndash LATENT DIRICHILET ALLOCATION

Problems in pLSA

bull pLSA provides no probabilistic model at the document level Each doc has its own topic mixture proportion

bull The number of parameters in the model grows linearly with M (the number of documents in the training set)

Problems in pLSA

bull There is no constraint for distributions p(z|di)

bull Easy to lead to serious problems with over-fitting

d1

z1

w1

z2

w2

zN1

wN1

hellip

d2

z1

w1

z2

w2

zN2

wN2

hellip

dm

z1

w1

z2

w2

zNm

wNm

hellip

p(z|d1) p(z|d2) p(z|dm)

Dirichlet Distribution

bull In the LDA model the topic mixture proportions for each document are assumed to follow some distribution

bull Requirement for such a distributionndash The samples (mixture proportions) generated from it are K-tuples of non-

negative numbers that sum to one That is the samples are multinormialsndash Easy to optimize

bull Dirichlet Distribution is one of such distributions

bull The space of all of these multinomials has a nice geometric interpretation as a (k-1)-simplex

Dirichlet Distribution

bull Definition

bull The density is zero outside this open (K minus 1)-dimensional simplex

k

i ii

ki ik

i i

ki i

kk

xx

xxxxp i

1

11

1

12121

1 0 st

)Γ()(Γ)|(

bull Various parameter α

(6 2 2) (3 7 5)

(2 3 4) (6 2 6)

Example Dirichlet Distributions (K=3)

Example Dirichlet Distributions (K=3)

bull Equal αi different

α0=01 α0=1 α0=10

k

i i10

The LDA Model

z4z3z2z1

w4w3w2w1

b

z4z3z2z1

w4w3w2w1

z4z3z2z1

w4w3w2w1

bull For each documentbull Choose ~Dirichlet()bull For each of the N words wn

ndash Choose a topic znraquo Multinomial()

ndash Choose a word wn from p(wn|znb) a multinomial probability conditioned on the topic zn

The LDA Model

bull For each documentbull Choose ~Dirichlet()bull For each of the N words wn

ndash Choose a topic znraquo Multinomial()

ndash Choose a word wn from p(wn|znb) a multinomial probability conditioned on the topic zn

Joint Probability

bull Given parameter α and β

where

Likelihood

bull Joint Probability

bull Marginal distribution of a document

bull Likelihood over all the documents

Inference

bull The likelihood can be computed by summing each documentbull Jansenrsquos inequality in EM

Inference

bull In E-Step we need to compute the posterior distribution of the hidden variables

bull Unfortunately this distribution is intractable to compute in general

bull We have to resort to variational approach

Variational Inference

bull In variational inference we consider a simplified graphical model with variational parameters and minimize the KL Divergence between the variational and posterior distributions

Variantional Inference

bull The difference between the lower bound and the likelihood is the KL divergence

bull Maximizing the lower bound L(b) with respect to and is equivalent to minimizing the KL divergence

VBEM vs EM

bull Only different in the E-Step

bull In standard EM q(X) is directly set to p(X|Dθ) and let KL=0bull In VBEM it is intractable to compute p(X|Dθ) Instead it

approximates p(X|Dθ) by a variational distribution q(X) by minimizing KL(q(X) | P(X|D θ)

bull This is also equivalent to maximizing the lower bound L(θ)

Parameter Estimation

bull Given a corpus of documents we would like to find the parameters and b which maximize the likelihood of the observed data

bull Strategy (Variational EM)

bull Lower bound log p(w|b) by a function L(b)bull Repeat until convergence

ndash E Maximize L(b) with respect to the variational parameters ndash M Maximize the bound with respect to parameters and b

Parameter Estimation

bull E-Step Variational Inference ndash repeat until convergence

bull M-Step Parameter estimation

β

α can be implemented using the Newton-Raphson method

Topic Examples in a 100-topic LDA Model)bull 16000 documents from a subset of the TREC AP corpus

Classification (50-topic LDA + SVM)

bull Reuters-21578 dataset ndash contains 8000 documents and 15818 words

(a) EARN vs NOT EARN (b) GRAIN vs NOT GRAIN

Problems in LDA

bull Dirichlet Distribution is helpful to avoid over-fitting But the assumption might be too strong

z4z3z2z1

w4w3w2w1

b

z4z3z2z1

w4w3w2w1

z4z3z2z1

w4w3w2w1

A Bayesian Hierarchical Model for Learning Natural Scene Categories

bull Incorporating category information

MNd

π

z

x

θ

β

Codebookbull 174 Local Image Patches

bull DetectionEvenly Sampled GridRandom SamplingSaliency DetectorLowersquos DoG Detector

bull RepresentationNormalized 11x11 gray values128-dim SIFT

Topic Distribution in Different Categories

Topic Hierarchical Clustering

More Topic Models

bull Dynamic topic models ICML 2006bull Correlated Topic Model NIPS 2005bull Hierarchical Dirichlet Process Journal of the American

Statistical Association 2003bull Nonparametric Bayes pachinko allocation UAI 2007bull Supervised LDA NIPS 2007bull MedLDA ndash Maximum Margin Discrimant LDA ICML 2009bull hellip

Are you really into Graphical Models

bull Describing Visual Scenes using Transformed Dirichlet Processes E Sudderth A Torralba W Freeman and A Willsky NIPS Dec 2005

Reference

bull David M Blei Andrew Y Ng Michael I Jordan Latent Dirichlet Allocation Journal of Machine Learning Research (JMLR) 2003

bull Matthew J Beal Variational Algorithms for Approximate Bayesian Inference PhD Thesis University of Cambridge 1998

bull L Fei-Fei and P Perona A Bayesian Hierarchical Model for Learning Natural Scene Categories CVPR 2005

Outline

bull Matrix Decompositionndash PCA SVD NMFndash LDA ICA Sparse Coding etc

bull Graphical Modelndash Basic concepts in probabilistic machine learningndash EMndash pLSAndash LDA

bull Two Applicationsndash Document decomposition for ldquolong queryrdquo retrieval ICCV 2009ndash Modeling Threaded Discussions SIGIR 2009

Large-Scale Indexing for ldquoLong Queryrdquo Retrieval (Similarity Search)

Xiao Zhang Zhiwei Li Lei Zhang Wei-Ying Ma and Heung-Yeung Shum

ICCV 2009

The Long Query Problem

bull If a query contains 1000 keywordsndash Need to access 1000 inverted listsndash The intersection of 1000 inverted lists may be emptyndash The union of 1000 inverted list may be the whole corpus

bull Dimension reductionTerm1 Term2 Term3 Term4 hellip TermN

Img1 1 2 0 0 hellip 2

f1 f2 hellip fM

Img1 02 01 hellip 003

Topic Projection

Dim = 1 million

Dim = 200

Key Idea Dimension Reduction + Residual Error Preservation

bull p original TF-IDF vector in vocabulary spacebull X projection matrix for dimension reductionbull w low dimensional feature vectorbull residual error

p Xw

An image = 0 01 02 03 04 05 06 07 08 09 1

0

2

4

6

8

10

12

14

16

+ a few words(10 words)

Orthogonal Decomposition

p Xw 1 11 1 1

2 21 2 1 2

1 1

1

k

k

W k W

W W Wk W

p x xp x x w

p wp x x

Base vector

Low dimensional representation

Residual

An image = 0 01 02 03 04 05 06 07 08 09 1

0

2

4

6

8

10

12

14

16

+ a few words(10 words)

T Tp q p q

p q

w w

X1 X2 X3 hellip Xk

A Probabilistic Implementation

x is a switch variable It controls a word generated from

bull a topic specific distribution

bull a document specific distribution

bull a background distribution

( | )p w d 1( 0 | ) ( | ) ( | )K

kp x d p w z k p z k d

( 1| ) ( | )p x d p w d ( 2 | ) ( | )p x d p w

C Chemudugunta etal Modeling General and Specific Aspects of Documents with a Probabilistic Topic Model NIPS 2006

Search (Online)

DS1 DS2 hellip

DS1 DS2 hellip

DS1 DS2 hellip

DS1 DS2 hellip

DS1 DS2 hellip

hellip

hellip

hellip

hellip

hellip

LSH Index

Doc 300 Doc 401 hellip

A query = 0 01 02 03 04 05 06 07 08 09 1

0

2

4

6

8

10

12

14

16

+ a few words

Re-rankingDoc 401 hellip

Doc 1

Doc 2

Doc 300

Doc 401

Doc N

Doc 300

Index 10M Images 46GBSearch Speed lt 100ms

Doc Meta

Search Example

Query Image

Search Example

Query Image

Simultaneously Modeling Semantics and Structure of Threaded Discussions A Sparse

Coding Approach and Its Applications

Chen LIN Jiang-Ming YANG Rui CAI Xin-jing WANG Wei WANG Lei ZHANG

SIGIR 2009

123

Semantic amp structure

124

SemanticTopics

StructureWho reply to who

Optimize them together

Model semantic

Model structure

125

Reply reconstruction

126

DocumentSimilarity

TopicSimilarity

StructureSimilarity

BaselinesNP

Reply to Nearest PostRR

Reply to RootDS

Document SimilarityLDA

Latent Dirichlet Allocation Project documents to topic space

SWB Special Words Topic Model with Background distribution Project documents to topic and junk topic space

127

Evaluation

method Slashdot Apple

All Posts Good Posts All Posts Good Posts

NP 0021 0012 0289 0239

RR 0183 0319 0269 0474

DS 0463 0643 0409 0628

LDA 0465 0644 0410 0648

SWB 0463 0644 0410 0641

SMSS 0524 0737 0517 0772

128

Expert finding

Reply reconstruction

Network construction

Expert finding

Methods

HITS

PageRank

hellip

129

Baselines

LMFormal Models for Expert Finding in Enterprise Corpora SIGIR

06Achieves stable performance in expert finding task using a

language modelPageRank

Benchmark nodal ranking methodHITS

Find hub nodes and authority nodeEABIF

Personalized Recommendation Driven by Information Flow SIGIR rsquo06

Find most influential node130

Evaluation

131

bull Bayesian estimate

Method MRR MAP P10

LM 0821 0698 0800

EABIF(ori) 0674 0362 0243

EABIF(rec) 0742 0318 0281

PageRank(ori) 0675 0377 0263

PageRank(rec) 0743 0321 0266

HITS(ori) 0906 0832 0900

HITS(rec) 0938 0822 0906

Summary

bull Matrix and probability are fundamental mathematics in information retrieval and computer visionndash Matrix decomposition ndash a good practice to learn matrixndash Graphical model ndash a good practice to learn probability

bull Graphical model is a good tool to analyze problems

bull The essence of decomposition is to discover a set of mid-level features to describe original documentsimages

bull It is more adaptable for various applications than matrix decomposition

  • An Introduction To Matrix Decomposition and Graphical Model
  • Outline
  • What Is Matrix Decomposition
  • Why We Need Matrix Decomposition
  • Why We Need Matrix Decomposition (2)
  • Principle Component Analysis
  • Definition ndash Eigenvalue amp Eigenvector
  • Definition ndash Principle Component Analysis
  • Principle Component Analysis (2)
  • Maximizing Variance
  • Optimization Problem
  • Property Data Decomposition
  • Face Recognition ndash Eigenface
  • Slide 14
  • Slide 15
  • PageRank ndash Power Iteration
  • Column-Stochastic amp Irreducible
  • Iterative PageRank Calculation
  • Convergence of the power iteration
  • Singular value decomposition
  • SVD - Definition
  • Singular Values And Singular Vectors
  • Matrix approximation
  • SVD and PCA
  • Example - LSI
  • SVD and PCA (2)
  • Latent Semantic Indexing (LSI)
  • Latent Semantic Indexing
  • Similarity Measures
  • Similarity Measures (2)
  • HITS (Hyperlink Induced Topic Search)
  • Power Iteration
  • HITS and SVD
  • HITS vs PageRank
  • NMF ndash Non-Negative Matrix Factorization
  • Definition
  • Motivation
  • Multiplicative Update Algorithm
  • Multiplicative Update Algorithm (2)
  • NMF vs PCA
  • Reference
  • Major Reference
  • Outline (2)
  • Not Included
  • Basic Concepts
  • What Is Machine Learning
  • Likelihood Function
  • Maximum Likelihood (ML)
  • IID ndash Independent Identically Distributed
  • Reference (2)
  • Expectation Maximization
  • Why We Need EM
  • More General
  • The Expectation Maximization (EM) Algorithm
  • Jensenrsquos Inequality
  • Lower Bounding the Log Likelihood
  • The E and M Steps of EM
  • The E Step
  • The M Step
  • EM Never Decreases the Likelihood
  • Reference (3)
  • Why Do We Need Graphical Model
  • Why Do We Need Graphical Models
  • Directed Acyclic Graphical Models (Bayesian Networks)
  • Directed Graphs for Statistical Models Plate Notation
  • pLSA ndash Probabilistic Latent Semantic Analysis
  • Latent Semantic Indexing (LSI) Review
  • pLSA ndash Probabilistic Latent Semantic Analysis (2)
  • pLSA
  • pLSA (2)
  • Joint Probability vs Likelihood
  • Document Decomposition
  • pLSA ndash Objective Function
  • EM Steps
  • Lower Bounding the Log Likelihood (2)
  • EM Steps (2)
  • Latent Subspace
  • pLSA vs LSA
  • pLSA vs LSA (2)
  • Applications
  • Text Mining
  • Scene Classification
  • Classification Result
  • Reference (4)
  • LDA ndash Latent Dirichilet Allocation
  • Problems in pLSA
  • Problems in pLSA (2)
  • Dirichlet Distribution
  • Dirichlet Distribution (2)
  • Example Dirichlet Distributions (K=3)
  • Example Dirichlet Distributions (K=3) (2)
  • The LDA Model
  • The LDA Model (2)
  • Joint Probability
  • Likelihood
  • Inference
  • Inference (2)
  • Variational Inference
  • Variantional Inference
  • VBEM vs EM
  • Parameter Estimation
  • Parameter Estimation (2)
  • Topic Examples in a 100-topic LDA Model)
  • Classification (50-topic LDA + SVM)
  • Problems in LDA
  • A Bayesian Hierarchical Model for Learning Natural Scene Catego
  • Codebook
  • Topic Distribution in Different Categories
  • Topic Hierarchical Clustering
  • More Topic Models
  • Are you really into Graphical Models
  • Reference (5)
  • Outline (3)
  • Slide 115
  • The Long Query Problem
  • Key Idea Dimension Reduction + Residual Error Preservation
  • Orthogonal Decomposition
  • A Probabilistic Implementation
  • Search (Online)
  • Search Example
  • Search Example (2)
  • Simultaneously Modeling Semantics and Structure of Threaded Dis
  • Semantic amp structure
  • Optimize them together
  • Reply reconstruction
  • Baselines
  • Evaluation
  • Expert finding
  • Baselines (2)
  • Evaluation (2)
  • Summary
Page 84: An Introduction To Matrix Decomposition and Graphical Model

Problems in pLSA

bull pLSA provides no probabilistic model at the document level Each doc has its own topic mixture proportion

bull The number of parameters in the model grows linearly with M (the number of documents in the training set)

Problems in pLSA

bull There is no constraint for distributions p(z|di)

bull Easy to lead to serious problems with over-fitting

d1

z1

w1

z2

w2

zN1

wN1

hellip

d2

z1

w1

z2

w2

zN2

wN2

hellip

dm

z1

w1

z2

w2

zNm

wNm

hellip

p(z|d1) p(z|d2) p(z|dm)

Dirichlet Distribution

bull In the LDA model the topic mixture proportions for each document are assumed to follow some distribution

bull Requirement for such a distributionndash The samples (mixture proportions) generated from it are K-tuples of non-

negative numbers that sum to one That is the samples are multinormialsndash Easy to optimize

bull Dirichlet Distribution is one of such distributions

bull The space of all of these multinomials has a nice geometric interpretation as a (k-1)-simplex

Dirichlet Distribution

bull Definition

bull The density is zero outside this open (K minus 1)-dimensional simplex

k

i ii

ki ik

i i

ki i

kk

xx

xxxxp i

1

11

1

12121

1 0 st

)Γ()(Γ)|(

bull Various parameter α

(6 2 2) (3 7 5)

(2 3 4) (6 2 6)

Example Dirichlet Distributions (K=3)

Example Dirichlet Distributions (K=3)

bull Equal αi different

α0=01 α0=1 α0=10

k

i i10

The LDA Model

z4z3z2z1

w4w3w2w1

b

z4z3z2z1

w4w3w2w1

z4z3z2z1

w4w3w2w1

bull For each documentbull Choose ~Dirichlet()bull For each of the N words wn

ndash Choose a topic znraquo Multinomial()

ndash Choose a word wn from p(wn|znb) a multinomial probability conditioned on the topic zn

The LDA Model

bull For each documentbull Choose ~Dirichlet()bull For each of the N words wn

ndash Choose a topic znraquo Multinomial()

ndash Choose a word wn from p(wn|znb) a multinomial probability conditioned on the topic zn

Joint Probability

bull Given parameter α and β

where

Likelihood

bull Joint Probability

bull Marginal distribution of a document

bull Likelihood over all the documents

Inference

bull The likelihood can be computed by summing each documentbull Jansenrsquos inequality in EM

Inference

bull In E-Step we need to compute the posterior distribution of the hidden variables

bull Unfortunately this distribution is intractable to compute in general

bull We have to resort to variational approach

Variational Inference

bull In variational inference we consider a simplified graphical model with variational parameters and minimize the KL Divergence between the variational and posterior distributions

Variantional Inference

bull The difference between the lower bound and the likelihood is the KL divergence

bull Maximizing the lower bound L(b) with respect to and is equivalent to minimizing the KL divergence

VBEM vs EM

bull Only different in the E-Step

bull In standard EM q(X) is directly set to p(X|Dθ) and let KL=0bull In VBEM it is intractable to compute p(X|Dθ) Instead it

approximates p(X|Dθ) by a variational distribution q(X) by minimizing KL(q(X) | P(X|D θ)

bull This is also equivalent to maximizing the lower bound L(θ)

Parameter Estimation

bull Given a corpus of documents we would like to find the parameters and b which maximize the likelihood of the observed data

bull Strategy (Variational EM)

bull Lower bound log p(w|b) by a function L(b)bull Repeat until convergence

ndash E Maximize L(b) with respect to the variational parameters ndash M Maximize the bound with respect to parameters and b

Parameter Estimation

bull E-Step Variational Inference ndash repeat until convergence

bull M-Step Parameter estimation

β

α can be implemented using the Newton-Raphson method

Topic Examples in a 100-topic LDA Model)bull 16000 documents from a subset of the TREC AP corpus

Classification (50-topic LDA + SVM)

bull Reuters-21578 dataset ndash contains 8000 documents and 15818 words

(a) EARN vs NOT EARN (b) GRAIN vs NOT GRAIN

Problems in LDA

bull Dirichlet Distribution is helpful to avoid over-fitting But the assumption might be too strong

z4z3z2z1

w4w3w2w1

b

z4z3z2z1

w4w3w2w1

z4z3z2z1

w4w3w2w1

A Bayesian Hierarchical Model for Learning Natural Scene Categories

bull Incorporating category information

MNd

π

z

x

θ

β

Codebookbull 174 Local Image Patches

bull DetectionEvenly Sampled GridRandom SamplingSaliency DetectorLowersquos DoG Detector

bull RepresentationNormalized 11x11 gray values128-dim SIFT

Topic Distribution in Different Categories

Topic Hierarchical Clustering

More Topic Models

bull Dynamic topic models ICML 2006bull Correlated Topic Model NIPS 2005bull Hierarchical Dirichlet Process Journal of the American

Statistical Association 2003bull Nonparametric Bayes pachinko allocation UAI 2007bull Supervised LDA NIPS 2007bull MedLDA ndash Maximum Margin Discrimant LDA ICML 2009bull hellip

Are you really into Graphical Models

bull Describing Visual Scenes using Transformed Dirichlet Processes E Sudderth A Torralba W Freeman and A Willsky NIPS Dec 2005

Reference

bull David M Blei Andrew Y Ng Michael I Jordan Latent Dirichlet Allocation Journal of Machine Learning Research (JMLR) 2003

bull Matthew J Beal Variational Algorithms for Approximate Bayesian Inference PhD Thesis University of Cambridge 1998

bull L Fei-Fei and P Perona A Bayesian Hierarchical Model for Learning Natural Scene Categories CVPR 2005

Outline

bull Matrix Decompositionndash PCA SVD NMFndash LDA ICA Sparse Coding etc

bull Graphical Modelndash Basic concepts in probabilistic machine learningndash EMndash pLSAndash LDA

bull Two Applicationsndash Document decomposition for ldquolong queryrdquo retrieval ICCV 2009ndash Modeling Threaded Discussions SIGIR 2009

Large-Scale Indexing for ldquoLong Queryrdquo Retrieval (Similarity Search)

Xiao Zhang Zhiwei Li Lei Zhang Wei-Ying Ma and Heung-Yeung Shum

ICCV 2009

The Long Query Problem

bull If a query contains 1000 keywordsndash Need to access 1000 inverted listsndash The intersection of 1000 inverted lists may be emptyndash The union of 1000 inverted list may be the whole corpus

bull Dimension reductionTerm1 Term2 Term3 Term4 hellip TermN

Img1 1 2 0 0 hellip 2

f1 f2 hellip fM

Img1 02 01 hellip 003

Topic Projection

Dim = 1 million

Dim = 200

Key Idea Dimension Reduction + Residual Error Preservation

bull p original TF-IDF vector in vocabulary spacebull X projection matrix for dimension reductionbull w low dimensional feature vectorbull residual error

p Xw

An image = 0 01 02 03 04 05 06 07 08 09 1

0

2

4

6

8

10

12

14

16

+ a few words(10 words)

Orthogonal Decomposition

p Xw 1 11 1 1

2 21 2 1 2

1 1

1

k

k

W k W

W W Wk W

p x xp x x w

p wp x x

Base vector

Low dimensional representation

Residual

An image = 0 01 02 03 04 05 06 07 08 09 1

0

2

4

6

8

10

12

14

16

+ a few words(10 words)

T Tp q p q

p q

w w

X1 X2 X3 hellip Xk

A Probabilistic Implementation

x is a switch variable It controls a word generated from

bull a topic specific distribution

bull a document specific distribution

bull a background distribution

( | )p w d 1( 0 | ) ( | ) ( | )K

kp x d p w z k p z k d

( 1| ) ( | )p x d p w d ( 2 | ) ( | )p x d p w

C Chemudugunta etal Modeling General and Specific Aspects of Documents with a Probabilistic Topic Model NIPS 2006

Search (Online)

DS1 DS2 hellip

DS1 DS2 hellip

DS1 DS2 hellip

DS1 DS2 hellip

DS1 DS2 hellip

hellip

hellip

hellip

hellip

hellip

LSH Index

Doc 300 Doc 401 hellip

A query = 0 01 02 03 04 05 06 07 08 09 1

0

2

4

6

8

10

12

14

16

+ a few words

Re-rankingDoc 401 hellip

Doc 1

Doc 2

Doc 300

Doc 401

Doc N

Doc 300

Index 10M Images 46GBSearch Speed lt 100ms

Doc Meta

Search Example

Query Image

Search Example

Query Image

Simultaneously Modeling Semantics and Structure of Threaded Discussions A Sparse

Coding Approach and Its Applications

Chen LIN Jiang-Ming YANG Rui CAI Xin-jing WANG Wei WANG Lei ZHANG

SIGIR 2009

123

Semantic amp structure

124

SemanticTopics

StructureWho reply to who

Optimize them together

Model semantic

Model structure

125

Reply reconstruction

126

DocumentSimilarity

TopicSimilarity

StructureSimilarity

BaselinesNP

Reply to Nearest PostRR

Reply to RootDS

Document SimilarityLDA

Latent Dirichlet Allocation Project documents to topic space

SWB Special Words Topic Model with Background distribution Project documents to topic and junk topic space

127

Evaluation

method Slashdot Apple

All Posts Good Posts All Posts Good Posts

NP 0021 0012 0289 0239

RR 0183 0319 0269 0474

DS 0463 0643 0409 0628

LDA 0465 0644 0410 0648

SWB 0463 0644 0410 0641

SMSS 0524 0737 0517 0772

128

Expert finding

Reply reconstruction

Network construction

Expert finding

Methods

HITS

PageRank

hellip

129

Baselines

LMFormal Models for Expert Finding in Enterprise Corpora SIGIR

06Achieves stable performance in expert finding task using a

language modelPageRank

Benchmark nodal ranking methodHITS

Find hub nodes and authority nodeEABIF

Personalized Recommendation Driven by Information Flow SIGIR rsquo06

Find most influential node130

Evaluation

131

bull Bayesian estimate

Method MRR MAP P10

LM 0821 0698 0800

EABIF(ori) 0674 0362 0243

EABIF(rec) 0742 0318 0281

PageRank(ori) 0675 0377 0263

PageRank(rec) 0743 0321 0266

HITS(ori) 0906 0832 0900

HITS(rec) 0938 0822 0906

Summary

bull Matrix and probability are fundamental mathematics in information retrieval and computer visionndash Matrix decomposition ndash a good practice to learn matrixndash Graphical model ndash a good practice to learn probability

bull Graphical model is a good tool to analyze problems

bull The essence of decomposition is to discover a set of mid-level features to describe original documentsimages

bull It is more adaptable for various applications than matrix decomposition

  • An Introduction To Matrix Decomposition and Graphical Model
  • Outline
  • What Is Matrix Decomposition
  • Why We Need Matrix Decomposition
  • Why We Need Matrix Decomposition (2)
  • Principle Component Analysis
  • Definition ndash Eigenvalue amp Eigenvector
  • Definition ndash Principle Component Analysis
  • Principle Component Analysis (2)
  • Maximizing Variance
  • Optimization Problem
  • Property Data Decomposition
  • Face Recognition ndash Eigenface
  • Slide 14
  • Slide 15
  • PageRank ndash Power Iteration
  • Column-Stochastic amp Irreducible
  • Iterative PageRank Calculation
  • Convergence of the power iteration
  • Singular value decomposition
  • SVD - Definition
  • Singular Values And Singular Vectors
  • Matrix approximation
  • SVD and PCA
  • Example - LSI
  • SVD and PCA (2)
  • Latent Semantic Indexing (LSI)
  • Latent Semantic Indexing
  • Similarity Measures
  • Similarity Measures (2)
  • HITS (Hyperlink Induced Topic Search)
  • Power Iteration
  • HITS and SVD
  • HITS vs PageRank
  • NMF ndash Non-Negative Matrix Factorization
  • Definition
  • Motivation
  • Multiplicative Update Algorithm
  • Multiplicative Update Algorithm (2)
  • NMF vs PCA
  • Reference
  • Major Reference
  • Outline (2)
  • Not Included
  • Basic Concepts
  • What Is Machine Learning
  • Likelihood Function
  • Maximum Likelihood (ML)
  • IID ndash Independent Identically Distributed
  • Reference (2)
  • Expectation Maximization
  • Why We Need EM
  • More General
  • The Expectation Maximization (EM) Algorithm
  • Jensenrsquos Inequality
  • Lower Bounding the Log Likelihood
  • The E and M Steps of EM
  • The E Step
  • The M Step
  • EM Never Decreases the Likelihood
  • Reference (3)
  • Why Do We Need Graphical Model
  • Why Do We Need Graphical Models
  • Directed Acyclic Graphical Models (Bayesian Networks)
  • Directed Graphs for Statistical Models Plate Notation
  • pLSA ndash Probabilistic Latent Semantic Analysis
  • Latent Semantic Indexing (LSI) Review
  • pLSA ndash Probabilistic Latent Semantic Analysis (2)
  • pLSA
  • pLSA (2)
  • Joint Probability vs Likelihood
  • Document Decomposition
  • pLSA ndash Objective Function
  • EM Steps
  • Lower Bounding the Log Likelihood (2)
  • EM Steps (2)
  • Latent Subspace
  • pLSA vs LSA
  • pLSA vs LSA (2)
  • Applications
  • Text Mining
  • Scene Classification
  • Classification Result
  • Reference (4)
  • LDA ndash Latent Dirichilet Allocation
  • Problems in pLSA
  • Problems in pLSA (2)
  • Dirichlet Distribution
  • Dirichlet Distribution (2)
  • Example Dirichlet Distributions (K=3)
  • Example Dirichlet Distributions (K=3) (2)
  • The LDA Model
  • The LDA Model (2)
  • Joint Probability
  • Likelihood
  • Inference
  • Inference (2)
  • Variational Inference
  • Variantional Inference
  • VBEM vs EM
  • Parameter Estimation
  • Parameter Estimation (2)
  • Topic Examples in a 100-topic LDA Model)
  • Classification (50-topic LDA + SVM)
  • Problems in LDA
  • A Bayesian Hierarchical Model for Learning Natural Scene Catego
  • Codebook
  • Topic Distribution in Different Categories
  • Topic Hierarchical Clustering
  • More Topic Models
  • Are you really into Graphical Models
  • Reference (5)
  • Outline (3)
  • Slide 115
  • The Long Query Problem
  • Key Idea Dimension Reduction + Residual Error Preservation
  • Orthogonal Decomposition
  • A Probabilistic Implementation
  • Search (Online)
  • Search Example
  • Search Example (2)
  • Simultaneously Modeling Semantics and Structure of Threaded Dis
  • Semantic amp structure
  • Optimize them together
  • Reply reconstruction
  • Baselines
  • Evaluation
  • Expert finding
  • Baselines (2)
  • Evaluation (2)
  • Summary
Page 85: An Introduction To Matrix Decomposition and Graphical Model

Problems in pLSA

bull There is no constraint for distributions p(z|di)

bull Easy to lead to serious problems with over-fitting

d1

z1

w1

z2

w2

zN1

wN1

hellip

d2

z1

w1

z2

w2

zN2

wN2

hellip

dm

z1

w1

z2

w2

zNm

wNm

hellip

p(z|d1) p(z|d2) p(z|dm)

Dirichlet Distribution

bull In the LDA model the topic mixture proportions for each document are assumed to follow some distribution

bull Requirement for such a distributionndash The samples (mixture proportions) generated from it are K-tuples of non-

negative numbers that sum to one That is the samples are multinormialsndash Easy to optimize

bull Dirichlet Distribution is one of such distributions

bull The space of all of these multinomials has a nice geometric interpretation as a (k-1)-simplex

Dirichlet Distribution

bull Definition

bull The density is zero outside this open (K minus 1)-dimensional simplex

k

i ii

ki ik

i i

ki i

kk

xx

xxxxp i

1

11

1

12121

1 0 st

)Γ()(Γ)|(

bull Various parameter α

(6 2 2) (3 7 5)

(2 3 4) (6 2 6)

Example Dirichlet Distributions (K=3)

Example Dirichlet Distributions (K=3)

bull Equal αi different

α0=01 α0=1 α0=10

k

i i10

The LDA Model

z4z3z2z1

w4w3w2w1

b

z4z3z2z1

w4w3w2w1

z4z3z2z1

w4w3w2w1

bull For each documentbull Choose ~Dirichlet()bull For each of the N words wn

ndash Choose a topic znraquo Multinomial()

ndash Choose a word wn from p(wn|znb) a multinomial probability conditioned on the topic zn

The LDA Model

bull For each documentbull Choose ~Dirichlet()bull For each of the N words wn

ndash Choose a topic znraquo Multinomial()

ndash Choose a word wn from p(wn|znb) a multinomial probability conditioned on the topic zn

Joint Probability

bull Given parameter α and β

where

Likelihood

bull Joint Probability

bull Marginal distribution of a document

bull Likelihood over all the documents

Inference

bull The likelihood can be computed by summing each documentbull Jansenrsquos inequality in EM

Inference

bull In E-Step we need to compute the posterior distribution of the hidden variables

bull Unfortunately this distribution is intractable to compute in general

bull We have to resort to variational approach

Variational Inference

bull In variational inference we consider a simplified graphical model with variational parameters and minimize the KL Divergence between the variational and posterior distributions

Variantional Inference

bull The difference between the lower bound and the likelihood is the KL divergence

bull Maximizing the lower bound L(b) with respect to and is equivalent to minimizing the KL divergence

VBEM vs EM

bull Only different in the E-Step

bull In standard EM q(X) is directly set to p(X|Dθ) and let KL=0bull In VBEM it is intractable to compute p(X|Dθ) Instead it

approximates p(X|Dθ) by a variational distribution q(X) by minimizing KL(q(X) | P(X|D θ)

bull This is also equivalent to maximizing the lower bound L(θ)

Parameter Estimation

bull Given a corpus of documents we would like to find the parameters and b which maximize the likelihood of the observed data

bull Strategy (Variational EM)

bull Lower bound log p(w|b) by a function L(b)bull Repeat until convergence

ndash E Maximize L(b) with respect to the variational parameters ndash M Maximize the bound with respect to parameters and b

Parameter Estimation

bull E-Step Variational Inference ndash repeat until convergence

bull M-Step Parameter estimation

β

α can be implemented using the Newton-Raphson method

Topic Examples in a 100-topic LDA Model)bull 16000 documents from a subset of the TREC AP corpus

Classification (50-topic LDA + SVM)

bull Reuters-21578 dataset ndash contains 8000 documents and 15818 words

(a) EARN vs NOT EARN (b) GRAIN vs NOT GRAIN

Problems in LDA

bull Dirichlet Distribution is helpful to avoid over-fitting But the assumption might be too strong

z4z3z2z1

w4w3w2w1

b

z4z3z2z1

w4w3w2w1

z4z3z2z1

w4w3w2w1

A Bayesian Hierarchical Model for Learning Natural Scene Categories

bull Incorporating category information

MNd

π

z

x

θ

β

Codebookbull 174 Local Image Patches

bull DetectionEvenly Sampled GridRandom SamplingSaliency DetectorLowersquos DoG Detector

bull RepresentationNormalized 11x11 gray values128-dim SIFT

Topic Distribution in Different Categories

Topic Hierarchical Clustering

More Topic Models

bull Dynamic topic models ICML 2006bull Correlated Topic Model NIPS 2005bull Hierarchical Dirichlet Process Journal of the American

Statistical Association 2003bull Nonparametric Bayes pachinko allocation UAI 2007bull Supervised LDA NIPS 2007bull MedLDA ndash Maximum Margin Discrimant LDA ICML 2009bull hellip

Are you really into Graphical Models

bull Describing Visual Scenes using Transformed Dirichlet Processes E Sudderth A Torralba W Freeman and A Willsky NIPS Dec 2005

Reference

bull David M Blei Andrew Y Ng Michael I Jordan Latent Dirichlet Allocation Journal of Machine Learning Research (JMLR) 2003

bull Matthew J Beal Variational Algorithms for Approximate Bayesian Inference PhD Thesis University of Cambridge 1998

bull L Fei-Fei and P Perona A Bayesian Hierarchical Model for Learning Natural Scene Categories CVPR 2005

Outline

bull Matrix Decompositionndash PCA SVD NMFndash LDA ICA Sparse Coding etc

bull Graphical Modelndash Basic concepts in probabilistic machine learningndash EMndash pLSAndash LDA

bull Two Applicationsndash Document decomposition for ldquolong queryrdquo retrieval ICCV 2009ndash Modeling Threaded Discussions SIGIR 2009

Large-Scale Indexing for ldquoLong Queryrdquo Retrieval (Similarity Search)

Xiao Zhang Zhiwei Li Lei Zhang Wei-Ying Ma and Heung-Yeung Shum

ICCV 2009

The Long Query Problem

bull If a query contains 1000 keywordsndash Need to access 1000 inverted listsndash The intersection of 1000 inverted lists may be emptyndash The union of 1000 inverted list may be the whole corpus

bull Dimension reductionTerm1 Term2 Term3 Term4 hellip TermN

Img1 1 2 0 0 hellip 2

f1 f2 hellip fM

Img1 02 01 hellip 003

Topic Projection

Dim = 1 million

Dim = 200

Key Idea Dimension Reduction + Residual Error Preservation

bull p original TF-IDF vector in vocabulary spacebull X projection matrix for dimension reductionbull w low dimensional feature vectorbull residual error

p Xw

An image = 0 01 02 03 04 05 06 07 08 09 1

0

2

4

6

8

10

12

14

16

+ a few words(10 words)

Orthogonal Decomposition

p Xw 1 11 1 1

2 21 2 1 2

1 1

1

k

k

W k W

W W Wk W

p x xp x x w

p wp x x

Base vector

Low dimensional representation

Residual

An image = 0 01 02 03 04 05 06 07 08 09 1

0

2

4

6

8

10

12

14

16

+ a few words(10 words)

T Tp q p q

p q

w w

X1 X2 X3 hellip Xk

A Probabilistic Implementation

x is a switch variable It controls a word generated from

bull a topic specific distribution

bull a document specific distribution

bull a background distribution

( | )p w d 1( 0 | ) ( | ) ( | )K

kp x d p w z k p z k d

( 1| ) ( | )p x d p w d ( 2 | ) ( | )p x d p w

C Chemudugunta etal Modeling General and Specific Aspects of Documents with a Probabilistic Topic Model NIPS 2006

Search (Online)

DS1 DS2 hellip

DS1 DS2 hellip

DS1 DS2 hellip

DS1 DS2 hellip

DS1 DS2 hellip

hellip

hellip

hellip

hellip

hellip

LSH Index

Doc 300 Doc 401 hellip

A query = 0 01 02 03 04 05 06 07 08 09 1

0

2

4

6

8

10

12

14

16

+ a few words

Re-rankingDoc 401 hellip

Doc 1

Doc 2

Doc 300

Doc 401

Doc N

Doc 300

Index 10M Images 46GBSearch Speed lt 100ms

Doc Meta

Search Example

Query Image

Search Example

Query Image

Simultaneously Modeling Semantics and Structure of Threaded Discussions A Sparse

Coding Approach and Its Applications

Chen LIN Jiang-Ming YANG Rui CAI Xin-jing WANG Wei WANG Lei ZHANG

SIGIR 2009

123

Semantic amp structure

124

SemanticTopics

StructureWho reply to who

Optimize them together

Model semantic

Model structure

125

Reply reconstruction

126

DocumentSimilarity

TopicSimilarity

StructureSimilarity

BaselinesNP

Reply to Nearest PostRR

Reply to RootDS

Document SimilarityLDA

Latent Dirichlet Allocation Project documents to topic space

SWB Special Words Topic Model with Background distribution Project documents to topic and junk topic space

127

Evaluation

method Slashdot Apple

All Posts Good Posts All Posts Good Posts

NP 0021 0012 0289 0239

RR 0183 0319 0269 0474

DS 0463 0643 0409 0628

LDA 0465 0644 0410 0648

SWB 0463 0644 0410 0641

SMSS 0524 0737 0517 0772

128

Expert finding

Reply reconstruction

Network construction

Expert finding

Methods

HITS

PageRank

hellip

129

Baselines

LMFormal Models for Expert Finding in Enterprise Corpora SIGIR

06Achieves stable performance in expert finding task using a

language modelPageRank

Benchmark nodal ranking methodHITS

Find hub nodes and authority nodeEABIF

Personalized Recommendation Driven by Information Flow SIGIR rsquo06

Find most influential node130

Evaluation

131

bull Bayesian estimate

Method MRR MAP P10

LM 0821 0698 0800

EABIF(ori) 0674 0362 0243

EABIF(rec) 0742 0318 0281

PageRank(ori) 0675 0377 0263

PageRank(rec) 0743 0321 0266

HITS(ori) 0906 0832 0900

HITS(rec) 0938 0822 0906

Summary

bull Matrix and probability are fundamental mathematics in information retrieval and computer visionndash Matrix decomposition ndash a good practice to learn matrixndash Graphical model ndash a good practice to learn probability

bull Graphical model is a good tool to analyze problems

bull The essence of decomposition is to discover a set of mid-level features to describe original documentsimages

bull It is more adaptable for various applications than matrix decomposition

  • An Introduction To Matrix Decomposition and Graphical Model
  • Outline
  • What Is Matrix Decomposition
  • Why We Need Matrix Decomposition
  • Why We Need Matrix Decomposition (2)
  • Principle Component Analysis
  • Definition ndash Eigenvalue amp Eigenvector
  • Definition ndash Principle Component Analysis
  • Principle Component Analysis (2)
  • Maximizing Variance
  • Optimization Problem
  • Property Data Decomposition
  • Face Recognition ndash Eigenface
  • Slide 14
  • Slide 15
  • PageRank ndash Power Iteration
  • Column-Stochastic amp Irreducible
  • Iterative PageRank Calculation
  • Convergence of the power iteration
  • Singular value decomposition
  • SVD - Definition
  • Singular Values And Singular Vectors
  • Matrix approximation
  • SVD and PCA
  • Example - LSI
  • SVD and PCA (2)
  • Latent Semantic Indexing (LSI)
  • Latent Semantic Indexing
  • Similarity Measures
  • Similarity Measures (2)
  • HITS (Hyperlink Induced Topic Search)
  • Power Iteration
  • HITS and SVD
  • HITS vs PageRank
  • NMF ndash Non-Negative Matrix Factorization
  • Definition
  • Motivation
  • Multiplicative Update Algorithm
  • Multiplicative Update Algorithm (2)
  • NMF vs PCA
  • Reference
  • Major Reference
  • Outline (2)
  • Not Included
  • Basic Concepts
  • What Is Machine Learning
  • Likelihood Function
  • Maximum Likelihood (ML)
  • IID ndash Independent Identically Distributed
  • Reference (2)
  • Expectation Maximization
  • Why We Need EM
  • More General
  • The Expectation Maximization (EM) Algorithm
  • Jensenrsquos Inequality
  • Lower Bounding the Log Likelihood
  • The E and M Steps of EM
  • The E Step
  • The M Step
  • EM Never Decreases the Likelihood
  • Reference (3)
  • Why Do We Need Graphical Model
  • Why Do We Need Graphical Models
  • Directed Acyclic Graphical Models (Bayesian Networks)
  • Directed Graphs for Statistical Models Plate Notation
  • pLSA ndash Probabilistic Latent Semantic Analysis
  • Latent Semantic Indexing (LSI) Review
  • pLSA ndash Probabilistic Latent Semantic Analysis (2)
  • pLSA
  • pLSA (2)
  • Joint Probability vs Likelihood
  • Document Decomposition
  • pLSA ndash Objective Function
  • EM Steps
  • Lower Bounding the Log Likelihood (2)
  • EM Steps (2)
  • Latent Subspace
  • pLSA vs LSA
  • pLSA vs LSA (2)
  • Applications
  • Text Mining
  • Scene Classification
  • Classification Result
  • Reference (4)
  • LDA ndash Latent Dirichilet Allocation
  • Problems in pLSA
  • Problems in pLSA (2)
  • Dirichlet Distribution
  • Dirichlet Distribution (2)
  • Example Dirichlet Distributions (K=3)
  • Example Dirichlet Distributions (K=3) (2)
  • The LDA Model
  • The LDA Model (2)
  • Joint Probability
  • Likelihood
  • Inference
  • Inference (2)
  • Variational Inference
  • Variantional Inference
  • VBEM vs EM
  • Parameter Estimation
  • Parameter Estimation (2)
  • Topic Examples in a 100-topic LDA Model)
  • Classification (50-topic LDA + SVM)
  • Problems in LDA
  • A Bayesian Hierarchical Model for Learning Natural Scene Catego
  • Codebook
  • Topic Distribution in Different Categories
  • Topic Hierarchical Clustering
  • More Topic Models
  • Are you really into Graphical Models
  • Reference (5)
  • Outline (3)
  • Slide 115
  • The Long Query Problem
  • Key Idea Dimension Reduction + Residual Error Preservation
  • Orthogonal Decomposition
  • A Probabilistic Implementation
  • Search (Online)
  • Search Example
  • Search Example (2)
  • Simultaneously Modeling Semantics and Structure of Threaded Dis
  • Semantic amp structure
  • Optimize them together
  • Reply reconstruction
  • Baselines
  • Evaluation
  • Expert finding
  • Baselines (2)
  • Evaluation (2)
  • Summary
Page 86: An Introduction To Matrix Decomposition and Graphical Model

Dirichlet Distribution

bull In the LDA model the topic mixture proportions for each document are assumed to follow some distribution

bull Requirement for such a distributionndash The samples (mixture proportions) generated from it are K-tuples of non-

negative numbers that sum to one That is the samples are multinormialsndash Easy to optimize

bull Dirichlet Distribution is one of such distributions

bull The space of all of these multinomials has a nice geometric interpretation as a (k-1)-simplex

Dirichlet Distribution

bull Definition

bull The density is zero outside this open (K minus 1)-dimensional simplex

k

i ii

ki ik

i i

ki i

kk

xx

xxxxp i

1

11

1

12121

1 0 st

)Γ()(Γ)|(

bull Various parameter α

(6 2 2) (3 7 5)

(2 3 4) (6 2 6)

Example Dirichlet Distributions (K=3)

Example Dirichlet Distributions (K=3)

bull Equal αi different

α0=01 α0=1 α0=10

k

i i10

The LDA Model

z4z3z2z1

w4w3w2w1

b

z4z3z2z1

w4w3w2w1

z4z3z2z1

w4w3w2w1

bull For each documentbull Choose ~Dirichlet()bull For each of the N words wn

ndash Choose a topic znraquo Multinomial()

ndash Choose a word wn from p(wn|znb) a multinomial probability conditioned on the topic zn

The LDA Model

bull For each documentbull Choose ~Dirichlet()bull For each of the N words wn

ndash Choose a topic znraquo Multinomial()

ndash Choose a word wn from p(wn|znb) a multinomial probability conditioned on the topic zn

Joint Probability

bull Given parameter α and β

where

Likelihood

bull Joint Probability

bull Marginal distribution of a document

bull Likelihood over all the documents

Inference

bull The likelihood can be computed by summing each documentbull Jansenrsquos inequality in EM

Inference

bull In E-Step we need to compute the posterior distribution of the hidden variables

bull Unfortunately this distribution is intractable to compute in general

bull We have to resort to variational approach

Variational Inference

bull In variational inference we consider a simplified graphical model with variational parameters and minimize the KL Divergence between the variational and posterior distributions

Variantional Inference

bull The difference between the lower bound and the likelihood is the KL divergence

bull Maximizing the lower bound L(b) with respect to and is equivalent to minimizing the KL divergence

VBEM vs EM

bull Only different in the E-Step

bull In standard EM q(X) is directly set to p(X|Dθ) and let KL=0bull In VBEM it is intractable to compute p(X|Dθ) Instead it

approximates p(X|Dθ) by a variational distribution q(X) by minimizing KL(q(X) | P(X|D θ)

bull This is also equivalent to maximizing the lower bound L(θ)

Parameter Estimation

bull Given a corpus of documents we would like to find the parameters and b which maximize the likelihood of the observed data

bull Strategy (Variational EM)

bull Lower bound log p(w|b) by a function L(b)bull Repeat until convergence

ndash E Maximize L(b) with respect to the variational parameters ndash M Maximize the bound with respect to parameters and b

Parameter Estimation

bull E-Step Variational Inference ndash repeat until convergence

bull M-Step Parameter estimation

β

α can be implemented using the Newton-Raphson method

Topic Examples in a 100-topic LDA Model)bull 16000 documents from a subset of the TREC AP corpus

Classification (50-topic LDA + SVM)

bull Reuters-21578 dataset ndash contains 8000 documents and 15818 words

(a) EARN vs NOT EARN (b) GRAIN vs NOT GRAIN

Problems in LDA

bull Dirichlet Distribution is helpful to avoid over-fitting But the assumption might be too strong

z4z3z2z1

w4w3w2w1

b

z4z3z2z1

w4w3w2w1

z4z3z2z1

w4w3w2w1

A Bayesian Hierarchical Model for Learning Natural Scene Categories

bull Incorporating category information

MNd

π

z

x

θ

β

Codebookbull 174 Local Image Patches

bull DetectionEvenly Sampled GridRandom SamplingSaliency DetectorLowersquos DoG Detector

bull RepresentationNormalized 11x11 gray values128-dim SIFT

Topic Distribution in Different Categories

Topic Hierarchical Clustering

More Topic Models

bull Dynamic topic models ICML 2006bull Correlated Topic Model NIPS 2005bull Hierarchical Dirichlet Process Journal of the American

Statistical Association 2003bull Nonparametric Bayes pachinko allocation UAI 2007bull Supervised LDA NIPS 2007bull MedLDA ndash Maximum Margin Discrimant LDA ICML 2009bull hellip

Are you really into Graphical Models

bull Describing Visual Scenes using Transformed Dirichlet Processes E Sudderth A Torralba W Freeman and A Willsky NIPS Dec 2005

Reference

bull David M Blei Andrew Y Ng Michael I Jordan Latent Dirichlet Allocation Journal of Machine Learning Research (JMLR) 2003

bull Matthew J Beal Variational Algorithms for Approximate Bayesian Inference PhD Thesis University of Cambridge 1998

bull L Fei-Fei and P Perona A Bayesian Hierarchical Model for Learning Natural Scene Categories CVPR 2005

Outline

bull Matrix Decompositionndash PCA SVD NMFndash LDA ICA Sparse Coding etc

bull Graphical Modelndash Basic concepts in probabilistic machine learningndash EMndash pLSAndash LDA

bull Two Applicationsndash Document decomposition for ldquolong queryrdquo retrieval ICCV 2009ndash Modeling Threaded Discussions SIGIR 2009

Large-Scale Indexing for ldquoLong Queryrdquo Retrieval (Similarity Search)

Xiao Zhang Zhiwei Li Lei Zhang Wei-Ying Ma and Heung-Yeung Shum

ICCV 2009

The Long Query Problem

bull If a query contains 1000 keywordsndash Need to access 1000 inverted listsndash The intersection of 1000 inverted lists may be emptyndash The union of 1000 inverted list may be the whole corpus

bull Dimension reductionTerm1 Term2 Term3 Term4 hellip TermN

Img1 1 2 0 0 hellip 2

f1 f2 hellip fM

Img1 02 01 hellip 003

Topic Projection

Dim = 1 million

Dim = 200

Key Idea Dimension Reduction + Residual Error Preservation

bull p original TF-IDF vector in vocabulary spacebull X projection matrix for dimension reductionbull w low dimensional feature vectorbull residual error

p Xw

An image = 0 01 02 03 04 05 06 07 08 09 1

0

2

4

6

8

10

12

14

16

+ a few words(10 words)

Orthogonal Decomposition

p Xw 1 11 1 1

2 21 2 1 2

1 1

1

k

k

W k W

W W Wk W

p x xp x x w

p wp x x

Base vector

Low dimensional representation

Residual

An image = 0 01 02 03 04 05 06 07 08 09 1

0

2

4

6

8

10

12

14

16

+ a few words(10 words)

T Tp q p q

p q

w w

X1 X2 X3 hellip Xk

A Probabilistic Implementation

x is a switch variable It controls a word generated from

bull a topic specific distribution

bull a document specific distribution

bull a background distribution

( | )p w d 1( 0 | ) ( | ) ( | )K

kp x d p w z k p z k d

( 1| ) ( | )p x d p w d ( 2 | ) ( | )p x d p w

C Chemudugunta etal Modeling General and Specific Aspects of Documents with a Probabilistic Topic Model NIPS 2006

Search (Online)

DS1 DS2 hellip

DS1 DS2 hellip

DS1 DS2 hellip

DS1 DS2 hellip

DS1 DS2 hellip

hellip

hellip

hellip

hellip

hellip

LSH Index

Doc 300 Doc 401 hellip

A query = 0 01 02 03 04 05 06 07 08 09 1

0

2

4

6

8

10

12

14

16

+ a few words

Re-rankingDoc 401 hellip

Doc 1

Doc 2

Doc 300

Doc 401

Doc N

Doc 300

Index 10M Images 46GBSearch Speed lt 100ms

Doc Meta

Search Example

Query Image

Search Example

Query Image

Simultaneously Modeling Semantics and Structure of Threaded Discussions A Sparse

Coding Approach and Its Applications

Chen LIN Jiang-Ming YANG Rui CAI Xin-jing WANG Wei WANG Lei ZHANG

SIGIR 2009

123

Semantic amp structure

124

SemanticTopics

StructureWho reply to who

Optimize them together

Model semantic

Model structure

125

Reply reconstruction

126

DocumentSimilarity

TopicSimilarity

StructureSimilarity

BaselinesNP

Reply to Nearest PostRR

Reply to RootDS

Document SimilarityLDA

Latent Dirichlet Allocation Project documents to topic space

SWB Special Words Topic Model with Background distribution Project documents to topic and junk topic space

127

Evaluation

method Slashdot Apple

All Posts Good Posts All Posts Good Posts

NP 0021 0012 0289 0239

RR 0183 0319 0269 0474

DS 0463 0643 0409 0628

LDA 0465 0644 0410 0648

SWB 0463 0644 0410 0641

SMSS 0524 0737 0517 0772

128

Expert finding

Reply reconstruction

Network construction

Expert finding

Methods

HITS

PageRank

hellip

129

Baselines

LMFormal Models for Expert Finding in Enterprise Corpora SIGIR

06Achieves stable performance in expert finding task using a

language modelPageRank

Benchmark nodal ranking methodHITS

Find hub nodes and authority nodeEABIF

Personalized Recommendation Driven by Information Flow SIGIR rsquo06

Find most influential node130

Evaluation

131

bull Bayesian estimate

Method MRR MAP P10

LM 0821 0698 0800

EABIF(ori) 0674 0362 0243

EABIF(rec) 0742 0318 0281

PageRank(ori) 0675 0377 0263

PageRank(rec) 0743 0321 0266

HITS(ori) 0906 0832 0900

HITS(rec) 0938 0822 0906

Summary

bull Matrix and probability are fundamental mathematics in information retrieval and computer visionndash Matrix decomposition ndash a good practice to learn matrixndash Graphical model ndash a good practice to learn probability

bull Graphical model is a good tool to analyze problems

bull The essence of decomposition is to discover a set of mid-level features to describe original documentsimages

bull It is more adaptable for various applications than matrix decomposition

  • An Introduction To Matrix Decomposition and Graphical Model
  • Outline
  • What Is Matrix Decomposition
  • Why We Need Matrix Decomposition
  • Why We Need Matrix Decomposition (2)
  • Principle Component Analysis
  • Definition ndash Eigenvalue amp Eigenvector
  • Definition ndash Principle Component Analysis
  • Principle Component Analysis (2)
  • Maximizing Variance
  • Optimization Problem
  • Property Data Decomposition
  • Face Recognition ndash Eigenface
  • Slide 14
  • Slide 15
  • PageRank ndash Power Iteration
  • Column-Stochastic amp Irreducible
  • Iterative PageRank Calculation
  • Convergence of the power iteration
  • Singular value decomposition
  • SVD - Definition
  • Singular Values And Singular Vectors
  • Matrix approximation
  • SVD and PCA
  • Example - LSI
  • SVD and PCA (2)
  • Latent Semantic Indexing (LSI)
  • Latent Semantic Indexing
  • Similarity Measures
  • Similarity Measures (2)
  • HITS (Hyperlink Induced Topic Search)
  • Power Iteration
  • HITS and SVD
  • HITS vs PageRank
  • NMF ndash Non-Negative Matrix Factorization
  • Definition
  • Motivation
  • Multiplicative Update Algorithm
  • Multiplicative Update Algorithm (2)
  • NMF vs PCA
  • Reference
  • Major Reference
  • Outline (2)
  • Not Included
  • Basic Concepts
  • What Is Machine Learning
  • Likelihood Function
  • Maximum Likelihood (ML)
  • IID ndash Independent Identically Distributed
  • Reference (2)
  • Expectation Maximization
  • Why We Need EM
  • More General
  • The Expectation Maximization (EM) Algorithm
  • Jensenrsquos Inequality
  • Lower Bounding the Log Likelihood
  • The E and M Steps of EM
  • The E Step
  • The M Step
  • EM Never Decreases the Likelihood
  • Reference (3)
  • Why Do We Need Graphical Model
  • Why Do We Need Graphical Models
  • Directed Acyclic Graphical Models (Bayesian Networks)
  • Directed Graphs for Statistical Models Plate Notation
  • pLSA ndash Probabilistic Latent Semantic Analysis
  • Latent Semantic Indexing (LSI) Review
  • pLSA ndash Probabilistic Latent Semantic Analysis (2)
  • pLSA
  • pLSA (2)
  • Joint Probability vs Likelihood
  • Document Decomposition
  • pLSA ndash Objective Function
  • EM Steps
  • Lower Bounding the Log Likelihood (2)
  • EM Steps (2)
  • Latent Subspace
  • pLSA vs LSA
  • pLSA vs LSA (2)
  • Applications
  • Text Mining
  • Scene Classification
  • Classification Result
  • Reference (4)
  • LDA ndash Latent Dirichilet Allocation
  • Problems in pLSA
  • Problems in pLSA (2)
  • Dirichlet Distribution
  • Dirichlet Distribution (2)
  • Example Dirichlet Distributions (K=3)
  • Example Dirichlet Distributions (K=3) (2)
  • The LDA Model
  • The LDA Model (2)
  • Joint Probability
  • Likelihood
  • Inference
  • Inference (2)
  • Variational Inference
  • Variantional Inference
  • VBEM vs EM
  • Parameter Estimation
  • Parameter Estimation (2)
  • Topic Examples in a 100-topic LDA Model)
  • Classification (50-topic LDA + SVM)
  • Problems in LDA
  • A Bayesian Hierarchical Model for Learning Natural Scene Catego
  • Codebook
  • Topic Distribution in Different Categories
  • Topic Hierarchical Clustering
  • More Topic Models
  • Are you really into Graphical Models
  • Reference (5)
  • Outline (3)
  • Slide 115
  • The Long Query Problem
  • Key Idea Dimension Reduction + Residual Error Preservation
  • Orthogonal Decomposition
  • A Probabilistic Implementation
  • Search (Online)
  • Search Example
  • Search Example (2)
  • Simultaneously Modeling Semantics and Structure of Threaded Dis
  • Semantic amp structure
  • Optimize them together
  • Reply reconstruction
  • Baselines
  • Evaluation
  • Expert finding
  • Baselines (2)
  • Evaluation (2)
  • Summary
Page 87: An Introduction To Matrix Decomposition and Graphical Model

Dirichlet Distribution

bull Definition

bull The density is zero outside this open (K minus 1)-dimensional simplex

k

i ii

ki ik

i i

ki i

kk

xx

xxxxp i

1

11

1

12121

1 0 st

)Γ()(Γ)|(

bull Various parameter α

(6 2 2) (3 7 5)

(2 3 4) (6 2 6)

Example Dirichlet Distributions (K=3)

Example Dirichlet Distributions (K=3)

bull Equal αi different

α0=01 α0=1 α0=10

k

i i10

The LDA Model

z4z3z2z1

w4w3w2w1

b

z4z3z2z1

w4w3w2w1

z4z3z2z1

w4w3w2w1

bull For each documentbull Choose ~Dirichlet()bull For each of the N words wn

ndash Choose a topic znraquo Multinomial()

ndash Choose a word wn from p(wn|znb) a multinomial probability conditioned on the topic zn

The LDA Model

bull For each documentbull Choose ~Dirichlet()bull For each of the N words wn

ndash Choose a topic znraquo Multinomial()

ndash Choose a word wn from p(wn|znb) a multinomial probability conditioned on the topic zn

Joint Probability

bull Given parameter α and β

where

Likelihood

bull Joint Probability

bull Marginal distribution of a document

bull Likelihood over all the documents

Inference

bull The likelihood can be computed by summing each documentbull Jansenrsquos inequality in EM

Inference

bull In E-Step we need to compute the posterior distribution of the hidden variables

bull Unfortunately this distribution is intractable to compute in general

bull We have to resort to variational approach

Variational Inference

bull In variational inference we consider a simplified graphical model with variational parameters and minimize the KL Divergence between the variational and posterior distributions

Variantional Inference

bull The difference between the lower bound and the likelihood is the KL divergence

bull Maximizing the lower bound L(b) with respect to and is equivalent to minimizing the KL divergence

VBEM vs EM

bull Only different in the E-Step

bull In standard EM q(X) is directly set to p(X|Dθ) and let KL=0bull In VBEM it is intractable to compute p(X|Dθ) Instead it

approximates p(X|Dθ) by a variational distribution q(X) by minimizing KL(q(X) | P(X|D θ)

bull This is also equivalent to maximizing the lower bound L(θ)

Parameter Estimation

bull Given a corpus of documents we would like to find the parameters and b which maximize the likelihood of the observed data

bull Strategy (Variational EM)

bull Lower bound log p(w|b) by a function L(b)bull Repeat until convergence

ndash E Maximize L(b) with respect to the variational parameters ndash M Maximize the bound with respect to parameters and b

Parameter Estimation

bull E-Step Variational Inference ndash repeat until convergence

bull M-Step Parameter estimation

β

α can be implemented using the Newton-Raphson method

Topic Examples in a 100-topic LDA Model)bull 16000 documents from a subset of the TREC AP corpus

Classification (50-topic LDA + SVM)

bull Reuters-21578 dataset ndash contains 8000 documents and 15818 words

(a) EARN vs NOT EARN (b) GRAIN vs NOT GRAIN

Problems in LDA

bull Dirichlet Distribution is helpful to avoid over-fitting But the assumption might be too strong

z4z3z2z1

w4w3w2w1

b

z4z3z2z1

w4w3w2w1

z4z3z2z1

w4w3w2w1

A Bayesian Hierarchical Model for Learning Natural Scene Categories

bull Incorporating category information

MNd

π

z

x

θ

β

Codebookbull 174 Local Image Patches

bull DetectionEvenly Sampled GridRandom SamplingSaliency DetectorLowersquos DoG Detector

bull RepresentationNormalized 11x11 gray values128-dim SIFT

Topic Distribution in Different Categories

Topic Hierarchical Clustering

More Topic Models

bull Dynamic topic models ICML 2006bull Correlated Topic Model NIPS 2005bull Hierarchical Dirichlet Process Journal of the American

Statistical Association 2003bull Nonparametric Bayes pachinko allocation UAI 2007bull Supervised LDA NIPS 2007bull MedLDA ndash Maximum Margin Discrimant LDA ICML 2009bull hellip

Are you really into Graphical Models

bull Describing Visual Scenes using Transformed Dirichlet Processes E Sudderth A Torralba W Freeman and A Willsky NIPS Dec 2005

Reference

bull David M Blei Andrew Y Ng Michael I Jordan Latent Dirichlet Allocation Journal of Machine Learning Research (JMLR) 2003

bull Matthew J Beal Variational Algorithms for Approximate Bayesian Inference PhD Thesis University of Cambridge 1998

bull L Fei-Fei and P Perona A Bayesian Hierarchical Model for Learning Natural Scene Categories CVPR 2005

Outline

bull Matrix Decompositionndash PCA SVD NMFndash LDA ICA Sparse Coding etc

bull Graphical Modelndash Basic concepts in probabilistic machine learningndash EMndash pLSAndash LDA

bull Two Applicationsndash Document decomposition for ldquolong queryrdquo retrieval ICCV 2009ndash Modeling Threaded Discussions SIGIR 2009

Large-Scale Indexing for ldquoLong Queryrdquo Retrieval (Similarity Search)

Xiao Zhang Zhiwei Li Lei Zhang Wei-Ying Ma and Heung-Yeung Shum

ICCV 2009

The Long Query Problem

bull If a query contains 1000 keywordsndash Need to access 1000 inverted listsndash The intersection of 1000 inverted lists may be emptyndash The union of 1000 inverted list may be the whole corpus

bull Dimension reductionTerm1 Term2 Term3 Term4 hellip TermN

Img1 1 2 0 0 hellip 2

f1 f2 hellip fM

Img1 02 01 hellip 003

Topic Projection

Dim = 1 million

Dim = 200

Key Idea Dimension Reduction + Residual Error Preservation

bull p original TF-IDF vector in vocabulary spacebull X projection matrix for dimension reductionbull w low dimensional feature vectorbull residual error

p Xw

An image = 0 01 02 03 04 05 06 07 08 09 1

0

2

4

6

8

10

12

14

16

+ a few words(10 words)

Orthogonal Decomposition

p Xw 1 11 1 1

2 21 2 1 2

1 1

1

k

k

W k W

W W Wk W

p x xp x x w

p wp x x

Base vector

Low dimensional representation

Residual

An image = 0 01 02 03 04 05 06 07 08 09 1

0

2

4

6

8

10

12

14

16

+ a few words(10 words)

T Tp q p q

p q

w w

X1 X2 X3 hellip Xk

A Probabilistic Implementation

x is a switch variable It controls a word generated from

bull a topic specific distribution

bull a document specific distribution

bull a background distribution

( | )p w d 1( 0 | ) ( | ) ( | )K

kp x d p w z k p z k d

( 1| ) ( | )p x d p w d ( 2 | ) ( | )p x d p w

C Chemudugunta etal Modeling General and Specific Aspects of Documents with a Probabilistic Topic Model NIPS 2006

Search (Online)

DS1 DS2 hellip

DS1 DS2 hellip

DS1 DS2 hellip

DS1 DS2 hellip

DS1 DS2 hellip

hellip

hellip

hellip

hellip

hellip

LSH Index

Doc 300 Doc 401 hellip

A query = 0 01 02 03 04 05 06 07 08 09 1

0

2

4

6

8

10

12

14

16

+ a few words

Re-rankingDoc 401 hellip

Doc 1

Doc 2

Doc 300

Doc 401

Doc N

Doc 300

Index 10M Images 46GBSearch Speed lt 100ms

Doc Meta

Search Example

Query Image

Search Example

Query Image

Simultaneously Modeling Semantics and Structure of Threaded Discussions A Sparse

Coding Approach and Its Applications

Chen LIN Jiang-Ming YANG Rui CAI Xin-jing WANG Wei WANG Lei ZHANG

SIGIR 2009

123

Semantic amp structure

124

SemanticTopics

StructureWho reply to who

Optimize them together

Model semantic

Model structure

125

Reply reconstruction

126

DocumentSimilarity

TopicSimilarity

StructureSimilarity

BaselinesNP

Reply to Nearest PostRR

Reply to RootDS

Document SimilarityLDA

Latent Dirichlet Allocation Project documents to topic space

SWB Special Words Topic Model with Background distribution Project documents to topic and junk topic space

127

Evaluation

method Slashdot Apple

All Posts Good Posts All Posts Good Posts

NP 0021 0012 0289 0239

RR 0183 0319 0269 0474

DS 0463 0643 0409 0628

LDA 0465 0644 0410 0648

SWB 0463 0644 0410 0641

SMSS 0524 0737 0517 0772

128

Expert finding

Reply reconstruction

Network construction

Expert finding

Methods

HITS

PageRank

hellip

129

Baselines

LMFormal Models for Expert Finding in Enterprise Corpora SIGIR

06Achieves stable performance in expert finding task using a

language modelPageRank

Benchmark nodal ranking methodHITS

Find hub nodes and authority nodeEABIF

Personalized Recommendation Driven by Information Flow SIGIR rsquo06

Find most influential node130

Evaluation

131

bull Bayesian estimate

Method MRR MAP P10

LM 0821 0698 0800

EABIF(ori) 0674 0362 0243

EABIF(rec) 0742 0318 0281

PageRank(ori) 0675 0377 0263

PageRank(rec) 0743 0321 0266

HITS(ori) 0906 0832 0900

HITS(rec) 0938 0822 0906

Summary

bull Matrix and probability are fundamental mathematics in information retrieval and computer visionndash Matrix decomposition ndash a good practice to learn matrixndash Graphical model ndash a good practice to learn probability

bull Graphical model is a good tool to analyze problems

bull The essence of decomposition is to discover a set of mid-level features to describe original documentsimages

bull It is more adaptable for various applications than matrix decomposition

  • An Introduction To Matrix Decomposition and Graphical Model
  • Outline
  • What Is Matrix Decomposition
  • Why We Need Matrix Decomposition
  • Why We Need Matrix Decomposition (2)
  • Principle Component Analysis
  • Definition ndash Eigenvalue amp Eigenvector
  • Definition ndash Principle Component Analysis
  • Principle Component Analysis (2)
  • Maximizing Variance
  • Optimization Problem
  • Property Data Decomposition
  • Face Recognition ndash Eigenface
  • Slide 14
  • Slide 15
  • PageRank ndash Power Iteration
  • Column-Stochastic amp Irreducible
  • Iterative PageRank Calculation
  • Convergence of the power iteration
  • Singular value decomposition
  • SVD - Definition
  • Singular Values And Singular Vectors
  • Matrix approximation
  • SVD and PCA
  • Example - LSI
  • SVD and PCA (2)
  • Latent Semantic Indexing (LSI)
  • Latent Semantic Indexing
  • Similarity Measures
  • Similarity Measures (2)
  • HITS (Hyperlink Induced Topic Search)
  • Power Iteration
  • HITS and SVD
  • HITS vs PageRank
  • NMF ndash Non-Negative Matrix Factorization
  • Definition
  • Motivation
  • Multiplicative Update Algorithm
  • Multiplicative Update Algorithm (2)
  • NMF vs PCA
  • Reference
  • Major Reference
  • Outline (2)
  • Not Included
  • Basic Concepts
  • What Is Machine Learning
  • Likelihood Function
  • Maximum Likelihood (ML)
  • IID ndash Independent Identically Distributed
  • Reference (2)
  • Expectation Maximization
  • Why We Need EM
  • More General
  • The Expectation Maximization (EM) Algorithm
  • Jensenrsquos Inequality
  • Lower Bounding the Log Likelihood
  • The E and M Steps of EM
  • The E Step
  • The M Step
  • EM Never Decreases the Likelihood
  • Reference (3)
  • Why Do We Need Graphical Model
  • Why Do We Need Graphical Models
  • Directed Acyclic Graphical Models (Bayesian Networks)
  • Directed Graphs for Statistical Models Plate Notation
  • pLSA ndash Probabilistic Latent Semantic Analysis
  • Latent Semantic Indexing (LSI) Review
  • pLSA ndash Probabilistic Latent Semantic Analysis (2)
  • pLSA
  • pLSA (2)
  • Joint Probability vs Likelihood
  • Document Decomposition
  • pLSA ndash Objective Function
  • EM Steps
  • Lower Bounding the Log Likelihood (2)
  • EM Steps (2)
  • Latent Subspace
  • pLSA vs LSA
  • pLSA vs LSA (2)
  • Applications
  • Text Mining
  • Scene Classification
  • Classification Result
  • Reference (4)
  • LDA ndash Latent Dirichilet Allocation
  • Problems in pLSA
  • Problems in pLSA (2)
  • Dirichlet Distribution
  • Dirichlet Distribution (2)
  • Example Dirichlet Distributions (K=3)
  • Example Dirichlet Distributions (K=3) (2)
  • The LDA Model
  • The LDA Model (2)
  • Joint Probability
  • Likelihood
  • Inference
  • Inference (2)
  • Variational Inference
  • Variantional Inference
  • VBEM vs EM
  • Parameter Estimation
  • Parameter Estimation (2)
  • Topic Examples in a 100-topic LDA Model)
  • Classification (50-topic LDA + SVM)
  • Problems in LDA
  • A Bayesian Hierarchical Model for Learning Natural Scene Catego
  • Codebook
  • Topic Distribution in Different Categories
  • Topic Hierarchical Clustering
  • More Topic Models
  • Are you really into Graphical Models
  • Reference (5)
  • Outline (3)
  • Slide 115
  • The Long Query Problem
  • Key Idea Dimension Reduction + Residual Error Preservation
  • Orthogonal Decomposition
  • A Probabilistic Implementation
  • Search (Online)
  • Search Example
  • Search Example (2)
  • Simultaneously Modeling Semantics and Structure of Threaded Dis
  • Semantic amp structure
  • Optimize them together
  • Reply reconstruction
  • Baselines
  • Evaluation
  • Expert finding
  • Baselines (2)
  • Evaluation (2)
  • Summary
Page 88: An Introduction To Matrix Decomposition and Graphical Model

bull Various parameter α

(6 2 2) (3 7 5)

(2 3 4) (6 2 6)

Example Dirichlet Distributions (K=3)

Example Dirichlet Distributions (K=3)

bull Equal αi different

α0=01 α0=1 α0=10

k

i i10

The LDA Model

z4z3z2z1

w4w3w2w1

b

z4z3z2z1

w4w3w2w1

z4z3z2z1

w4w3w2w1

bull For each documentbull Choose ~Dirichlet()bull For each of the N words wn

ndash Choose a topic znraquo Multinomial()

ndash Choose a word wn from p(wn|znb) a multinomial probability conditioned on the topic zn

The LDA Model

bull For each documentbull Choose ~Dirichlet()bull For each of the N words wn

ndash Choose a topic znraquo Multinomial()

ndash Choose a word wn from p(wn|znb) a multinomial probability conditioned on the topic zn

Joint Probability

bull Given parameter α and β

where

Likelihood

bull Joint Probability

bull Marginal distribution of a document

bull Likelihood over all the documents

Inference

bull The likelihood can be computed by summing each documentbull Jansenrsquos inequality in EM

Inference

bull In E-Step we need to compute the posterior distribution of the hidden variables

bull Unfortunately this distribution is intractable to compute in general

bull We have to resort to variational approach

Variational Inference

bull In variational inference we consider a simplified graphical model with variational parameters and minimize the KL Divergence between the variational and posterior distributions

Variantional Inference

bull The difference between the lower bound and the likelihood is the KL divergence

bull Maximizing the lower bound L(b) with respect to and is equivalent to minimizing the KL divergence

VBEM vs EM

bull Only different in the E-Step

bull In standard EM q(X) is directly set to p(X|Dθ) and let KL=0bull In VBEM it is intractable to compute p(X|Dθ) Instead it

approximates p(X|Dθ) by a variational distribution q(X) by minimizing KL(q(X) | P(X|D θ)

bull This is also equivalent to maximizing the lower bound L(θ)

Parameter Estimation

bull Given a corpus of documents we would like to find the parameters and b which maximize the likelihood of the observed data

bull Strategy (Variational EM)

bull Lower bound log p(w|b) by a function L(b)bull Repeat until convergence

ndash E Maximize L(b) with respect to the variational parameters ndash M Maximize the bound with respect to parameters and b

Parameter Estimation

bull E-Step Variational Inference ndash repeat until convergence

bull M-Step Parameter estimation

β

α can be implemented using the Newton-Raphson method

Topic Examples in a 100-topic LDA Model)bull 16000 documents from a subset of the TREC AP corpus

Classification (50-topic LDA + SVM)

bull Reuters-21578 dataset ndash contains 8000 documents and 15818 words

(a) EARN vs NOT EARN (b) GRAIN vs NOT GRAIN

Problems in LDA

bull Dirichlet Distribution is helpful to avoid over-fitting But the assumption might be too strong

z4z3z2z1

w4w3w2w1

b

z4z3z2z1

w4w3w2w1

z4z3z2z1

w4w3w2w1

A Bayesian Hierarchical Model for Learning Natural Scene Categories

bull Incorporating category information

MNd

π

z

x

θ

β

Codebookbull 174 Local Image Patches

bull DetectionEvenly Sampled GridRandom SamplingSaliency DetectorLowersquos DoG Detector

bull RepresentationNormalized 11x11 gray values128-dim SIFT

Topic Distribution in Different Categories

Topic Hierarchical Clustering

More Topic Models

bull Dynamic topic models ICML 2006bull Correlated Topic Model NIPS 2005bull Hierarchical Dirichlet Process Journal of the American

Statistical Association 2003bull Nonparametric Bayes pachinko allocation UAI 2007bull Supervised LDA NIPS 2007bull MedLDA ndash Maximum Margin Discrimant LDA ICML 2009bull hellip

Are you really into Graphical Models

bull Describing Visual Scenes using Transformed Dirichlet Processes E Sudderth A Torralba W Freeman and A Willsky NIPS Dec 2005

Reference

bull David M Blei Andrew Y Ng Michael I Jordan Latent Dirichlet Allocation Journal of Machine Learning Research (JMLR) 2003

bull Matthew J Beal Variational Algorithms for Approximate Bayesian Inference PhD Thesis University of Cambridge 1998

bull L Fei-Fei and P Perona A Bayesian Hierarchical Model for Learning Natural Scene Categories CVPR 2005

Outline

bull Matrix Decompositionndash PCA SVD NMFndash LDA ICA Sparse Coding etc

bull Graphical Modelndash Basic concepts in probabilistic machine learningndash EMndash pLSAndash LDA

bull Two Applicationsndash Document decomposition for ldquolong queryrdquo retrieval ICCV 2009ndash Modeling Threaded Discussions SIGIR 2009

Large-Scale Indexing for ldquoLong Queryrdquo Retrieval (Similarity Search)

Xiao Zhang Zhiwei Li Lei Zhang Wei-Ying Ma and Heung-Yeung Shum

ICCV 2009

The Long Query Problem

bull If a query contains 1000 keywordsndash Need to access 1000 inverted listsndash The intersection of 1000 inverted lists may be emptyndash The union of 1000 inverted list may be the whole corpus

bull Dimension reductionTerm1 Term2 Term3 Term4 hellip TermN

Img1 1 2 0 0 hellip 2

f1 f2 hellip fM

Img1 02 01 hellip 003

Topic Projection

Dim = 1 million

Dim = 200

Key Idea Dimension Reduction + Residual Error Preservation

bull p original TF-IDF vector in vocabulary spacebull X projection matrix for dimension reductionbull w low dimensional feature vectorbull residual error

p Xw

An image = 0 01 02 03 04 05 06 07 08 09 1

0

2

4

6

8

10

12

14

16

+ a few words(10 words)

Orthogonal Decomposition

p Xw 1 11 1 1

2 21 2 1 2

1 1

1

k

k

W k W

W W Wk W

p x xp x x w

p wp x x

Base vector

Low dimensional representation

Residual

An image = 0 01 02 03 04 05 06 07 08 09 1

0

2

4

6

8

10

12

14

16

+ a few words(10 words)

T Tp q p q

p q

w w

X1 X2 X3 hellip Xk

A Probabilistic Implementation

x is a switch variable It controls a word generated from

bull a topic specific distribution

bull a document specific distribution

bull a background distribution

( | )p w d 1( 0 | ) ( | ) ( | )K

kp x d p w z k p z k d

( 1| ) ( | )p x d p w d ( 2 | ) ( | )p x d p w

C Chemudugunta etal Modeling General and Specific Aspects of Documents with a Probabilistic Topic Model NIPS 2006

Search (Online)

DS1 DS2 hellip

DS1 DS2 hellip

DS1 DS2 hellip

DS1 DS2 hellip

DS1 DS2 hellip

hellip

hellip

hellip

hellip

hellip

LSH Index

Doc 300 Doc 401 hellip

A query = 0 01 02 03 04 05 06 07 08 09 1

0

2

4

6

8

10

12

14

16

+ a few words

Re-rankingDoc 401 hellip

Doc 1

Doc 2

Doc 300

Doc 401

Doc N

Doc 300

Index 10M Images 46GBSearch Speed lt 100ms

Doc Meta

Search Example

Query Image

Search Example

Query Image

Simultaneously Modeling Semantics and Structure of Threaded Discussions A Sparse

Coding Approach and Its Applications

Chen LIN Jiang-Ming YANG Rui CAI Xin-jing WANG Wei WANG Lei ZHANG

SIGIR 2009

123

Semantic amp structure

124

SemanticTopics

StructureWho reply to who

Optimize them together

Model semantic

Model structure

125

Reply reconstruction

126

DocumentSimilarity

TopicSimilarity

StructureSimilarity

BaselinesNP

Reply to Nearest PostRR

Reply to RootDS

Document SimilarityLDA

Latent Dirichlet Allocation Project documents to topic space

SWB Special Words Topic Model with Background distribution Project documents to topic and junk topic space

127

Evaluation

method Slashdot Apple

All Posts Good Posts All Posts Good Posts

NP 0021 0012 0289 0239

RR 0183 0319 0269 0474

DS 0463 0643 0409 0628

LDA 0465 0644 0410 0648

SWB 0463 0644 0410 0641

SMSS 0524 0737 0517 0772

128

Expert finding

Reply reconstruction

Network construction

Expert finding

Methods

HITS

PageRank

hellip

129

Baselines

LMFormal Models for Expert Finding in Enterprise Corpora SIGIR

06Achieves stable performance in expert finding task using a

language modelPageRank

Benchmark nodal ranking methodHITS

Find hub nodes and authority nodeEABIF

Personalized Recommendation Driven by Information Flow SIGIR rsquo06

Find most influential node130

Evaluation

131

bull Bayesian estimate

Method MRR MAP P10

LM 0821 0698 0800

EABIF(ori) 0674 0362 0243

EABIF(rec) 0742 0318 0281

PageRank(ori) 0675 0377 0263

PageRank(rec) 0743 0321 0266

HITS(ori) 0906 0832 0900

HITS(rec) 0938 0822 0906

Summary

bull Matrix and probability are fundamental mathematics in information retrieval and computer visionndash Matrix decomposition ndash a good practice to learn matrixndash Graphical model ndash a good practice to learn probability

bull Graphical model is a good tool to analyze problems

bull The essence of decomposition is to discover a set of mid-level features to describe original documentsimages

bull It is more adaptable for various applications than matrix decomposition

  • An Introduction To Matrix Decomposition and Graphical Model
  • Outline
  • What Is Matrix Decomposition
  • Why We Need Matrix Decomposition
  • Why We Need Matrix Decomposition (2)
  • Principle Component Analysis
  • Definition ndash Eigenvalue amp Eigenvector
  • Definition ndash Principle Component Analysis
  • Principle Component Analysis (2)
  • Maximizing Variance
  • Optimization Problem
  • Property Data Decomposition
  • Face Recognition ndash Eigenface
  • Slide 14
  • Slide 15
  • PageRank ndash Power Iteration
  • Column-Stochastic amp Irreducible
  • Iterative PageRank Calculation
  • Convergence of the power iteration
  • Singular value decomposition
  • SVD - Definition
  • Singular Values And Singular Vectors
  • Matrix approximation
  • SVD and PCA
  • Example - LSI
  • SVD and PCA (2)
  • Latent Semantic Indexing (LSI)
  • Latent Semantic Indexing
  • Similarity Measures
  • Similarity Measures (2)
  • HITS (Hyperlink Induced Topic Search)
  • Power Iteration
  • HITS and SVD
  • HITS vs PageRank
  • NMF ndash Non-Negative Matrix Factorization
  • Definition
  • Motivation
  • Multiplicative Update Algorithm
  • Multiplicative Update Algorithm (2)
  • NMF vs PCA
  • Reference
  • Major Reference
  • Outline (2)
  • Not Included
  • Basic Concepts
  • What Is Machine Learning
  • Likelihood Function
  • Maximum Likelihood (ML)
  • IID ndash Independent Identically Distributed
  • Reference (2)
  • Expectation Maximization
  • Why We Need EM
  • More General
  • The Expectation Maximization (EM) Algorithm
  • Jensenrsquos Inequality
  • Lower Bounding the Log Likelihood
  • The E and M Steps of EM
  • The E Step
  • The M Step
  • EM Never Decreases the Likelihood
  • Reference (3)
  • Why Do We Need Graphical Model
  • Why Do We Need Graphical Models
  • Directed Acyclic Graphical Models (Bayesian Networks)
  • Directed Graphs for Statistical Models Plate Notation
  • pLSA ndash Probabilistic Latent Semantic Analysis
  • Latent Semantic Indexing (LSI) Review
  • pLSA ndash Probabilistic Latent Semantic Analysis (2)
  • pLSA
  • pLSA (2)
  • Joint Probability vs Likelihood
  • Document Decomposition
  • pLSA ndash Objective Function
  • EM Steps
  • Lower Bounding the Log Likelihood (2)
  • EM Steps (2)
  • Latent Subspace
  • pLSA vs LSA
  • pLSA vs LSA (2)
  • Applications
  • Text Mining
  • Scene Classification
  • Classification Result
  • Reference (4)
  • LDA ndash Latent Dirichilet Allocation
  • Problems in pLSA
  • Problems in pLSA (2)
  • Dirichlet Distribution
  • Dirichlet Distribution (2)
  • Example Dirichlet Distributions (K=3)
  • Example Dirichlet Distributions (K=3) (2)
  • The LDA Model
  • The LDA Model (2)
  • Joint Probability
  • Likelihood
  • Inference
  • Inference (2)
  • Variational Inference
  • Variantional Inference
  • VBEM vs EM
  • Parameter Estimation
  • Parameter Estimation (2)
  • Topic Examples in a 100-topic LDA Model)
  • Classification (50-topic LDA + SVM)
  • Problems in LDA
  • A Bayesian Hierarchical Model for Learning Natural Scene Catego
  • Codebook
  • Topic Distribution in Different Categories
  • Topic Hierarchical Clustering
  • More Topic Models
  • Are you really into Graphical Models
  • Reference (5)
  • Outline (3)
  • Slide 115
  • The Long Query Problem
  • Key Idea Dimension Reduction + Residual Error Preservation
  • Orthogonal Decomposition
  • A Probabilistic Implementation
  • Search (Online)
  • Search Example
  • Search Example (2)
  • Simultaneously Modeling Semantics and Structure of Threaded Dis
  • Semantic amp structure
  • Optimize them together
  • Reply reconstruction
  • Baselines
  • Evaluation
  • Expert finding
  • Baselines (2)
  • Evaluation (2)
  • Summary
Page 89: An Introduction To Matrix Decomposition and Graphical Model

Example Dirichlet Distributions (K=3)

bull Equal αi different

α0=01 α0=1 α0=10

k

i i10

The LDA Model

z4z3z2z1

w4w3w2w1

b

z4z3z2z1

w4w3w2w1

z4z3z2z1

w4w3w2w1

bull For each documentbull Choose ~Dirichlet()bull For each of the N words wn

ndash Choose a topic znraquo Multinomial()

ndash Choose a word wn from p(wn|znb) a multinomial probability conditioned on the topic zn

The LDA Model

bull For each documentbull Choose ~Dirichlet()bull For each of the N words wn

ndash Choose a topic znraquo Multinomial()

ndash Choose a word wn from p(wn|znb) a multinomial probability conditioned on the topic zn

Joint Probability

bull Given parameter α and β

where

Likelihood

bull Joint Probability

bull Marginal distribution of a document

bull Likelihood over all the documents

Inference

bull The likelihood can be computed by summing each documentbull Jansenrsquos inequality in EM

Inference

bull In E-Step we need to compute the posterior distribution of the hidden variables

bull Unfortunately this distribution is intractable to compute in general

bull We have to resort to variational approach

Variational Inference

bull In variational inference we consider a simplified graphical model with variational parameters and minimize the KL Divergence between the variational and posterior distributions

Variantional Inference

bull The difference between the lower bound and the likelihood is the KL divergence

bull Maximizing the lower bound L(b) with respect to and is equivalent to minimizing the KL divergence

VBEM vs EM

bull Only different in the E-Step

bull In standard EM q(X) is directly set to p(X|Dθ) and let KL=0bull In VBEM it is intractable to compute p(X|Dθ) Instead it

approximates p(X|Dθ) by a variational distribution q(X) by minimizing KL(q(X) | P(X|D θ)

bull This is also equivalent to maximizing the lower bound L(θ)

Parameter Estimation

bull Given a corpus of documents we would like to find the parameters and b which maximize the likelihood of the observed data

bull Strategy (Variational EM)

bull Lower bound log p(w|b) by a function L(b)bull Repeat until convergence

ndash E Maximize L(b) with respect to the variational parameters ndash M Maximize the bound with respect to parameters and b

Parameter Estimation

bull E-Step Variational Inference ndash repeat until convergence

bull M-Step Parameter estimation

β

α can be implemented using the Newton-Raphson method

Topic Examples in a 100-topic LDA Model)bull 16000 documents from a subset of the TREC AP corpus

Classification (50-topic LDA + SVM)

bull Reuters-21578 dataset ndash contains 8000 documents and 15818 words

(a) EARN vs NOT EARN (b) GRAIN vs NOT GRAIN

Problems in LDA

bull Dirichlet Distribution is helpful to avoid over-fitting But the assumption might be too strong

z4z3z2z1

w4w3w2w1

b

z4z3z2z1

w4w3w2w1

z4z3z2z1

w4w3w2w1

A Bayesian Hierarchical Model for Learning Natural Scene Categories

bull Incorporating category information

MNd

π

z

x

θ

β

Codebookbull 174 Local Image Patches

bull DetectionEvenly Sampled GridRandom SamplingSaliency DetectorLowersquos DoG Detector

bull RepresentationNormalized 11x11 gray values128-dim SIFT

Topic Distribution in Different Categories

Topic Hierarchical Clustering

More Topic Models

bull Dynamic topic models ICML 2006bull Correlated Topic Model NIPS 2005bull Hierarchical Dirichlet Process Journal of the American

Statistical Association 2003bull Nonparametric Bayes pachinko allocation UAI 2007bull Supervised LDA NIPS 2007bull MedLDA ndash Maximum Margin Discrimant LDA ICML 2009bull hellip

Are you really into Graphical Models

bull Describing Visual Scenes using Transformed Dirichlet Processes E Sudderth A Torralba W Freeman and A Willsky NIPS Dec 2005

Reference

bull David M Blei Andrew Y Ng Michael I Jordan Latent Dirichlet Allocation Journal of Machine Learning Research (JMLR) 2003

bull Matthew J Beal Variational Algorithms for Approximate Bayesian Inference PhD Thesis University of Cambridge 1998

bull L Fei-Fei and P Perona A Bayesian Hierarchical Model for Learning Natural Scene Categories CVPR 2005

Outline

bull Matrix Decompositionndash PCA SVD NMFndash LDA ICA Sparse Coding etc

bull Graphical Modelndash Basic concepts in probabilistic machine learningndash EMndash pLSAndash LDA

bull Two Applicationsndash Document decomposition for ldquolong queryrdquo retrieval ICCV 2009ndash Modeling Threaded Discussions SIGIR 2009

Large-Scale Indexing for ldquoLong Queryrdquo Retrieval (Similarity Search)

Xiao Zhang Zhiwei Li Lei Zhang Wei-Ying Ma and Heung-Yeung Shum

ICCV 2009

The Long Query Problem

bull If a query contains 1000 keywordsndash Need to access 1000 inverted listsndash The intersection of 1000 inverted lists may be emptyndash The union of 1000 inverted list may be the whole corpus

bull Dimension reductionTerm1 Term2 Term3 Term4 hellip TermN

Img1 1 2 0 0 hellip 2

f1 f2 hellip fM

Img1 02 01 hellip 003

Topic Projection

Dim = 1 million

Dim = 200

Key Idea Dimension Reduction + Residual Error Preservation

bull p original TF-IDF vector in vocabulary spacebull X projection matrix for dimension reductionbull w low dimensional feature vectorbull residual error

p Xw

An image = 0 01 02 03 04 05 06 07 08 09 1

0

2

4

6

8

10

12

14

16

+ a few words(10 words)

Orthogonal Decomposition

p Xw 1 11 1 1

2 21 2 1 2

1 1

1

k

k

W k W

W W Wk W

p x xp x x w

p wp x x

Base vector

Low dimensional representation

Residual

An image = 0 01 02 03 04 05 06 07 08 09 1

0

2

4

6

8

10

12

14

16

+ a few words(10 words)

T Tp q p q

p q

w w

X1 X2 X3 hellip Xk

A Probabilistic Implementation

x is a switch variable It controls a word generated from

bull a topic specific distribution

bull a document specific distribution

bull a background distribution

( | )p w d 1( 0 | ) ( | ) ( | )K

kp x d p w z k p z k d

( 1| ) ( | )p x d p w d ( 2 | ) ( | )p x d p w

C Chemudugunta etal Modeling General and Specific Aspects of Documents with a Probabilistic Topic Model NIPS 2006

Search (Online)

DS1 DS2 hellip

DS1 DS2 hellip

DS1 DS2 hellip

DS1 DS2 hellip

DS1 DS2 hellip

hellip

hellip

hellip

hellip

hellip

LSH Index

Doc 300 Doc 401 hellip

A query = 0 01 02 03 04 05 06 07 08 09 1

0

2

4

6

8

10

12

14

16

+ a few words

Re-rankingDoc 401 hellip

Doc 1

Doc 2

Doc 300

Doc 401

Doc N

Doc 300

Index 10M Images 46GBSearch Speed lt 100ms

Doc Meta

Search Example

Query Image

Search Example

Query Image

Simultaneously Modeling Semantics and Structure of Threaded Discussions A Sparse

Coding Approach and Its Applications

Chen LIN Jiang-Ming YANG Rui CAI Xin-jing WANG Wei WANG Lei ZHANG

SIGIR 2009

123

Semantic amp structure

124

SemanticTopics

StructureWho reply to who

Optimize them together

Model semantic

Model structure

125

Reply reconstruction

126

DocumentSimilarity

TopicSimilarity

StructureSimilarity

BaselinesNP

Reply to Nearest PostRR

Reply to RootDS

Document SimilarityLDA

Latent Dirichlet Allocation Project documents to topic space

SWB Special Words Topic Model with Background distribution Project documents to topic and junk topic space

127

Evaluation

method Slashdot Apple

All Posts Good Posts All Posts Good Posts

NP 0021 0012 0289 0239

RR 0183 0319 0269 0474

DS 0463 0643 0409 0628

LDA 0465 0644 0410 0648

SWB 0463 0644 0410 0641

SMSS 0524 0737 0517 0772

128

Expert finding

Reply reconstruction

Network construction

Expert finding

Methods

HITS

PageRank

hellip

129

Baselines

LMFormal Models for Expert Finding in Enterprise Corpora SIGIR

06Achieves stable performance in expert finding task using a

language modelPageRank

Benchmark nodal ranking methodHITS

Find hub nodes and authority nodeEABIF

Personalized Recommendation Driven by Information Flow SIGIR rsquo06

Find most influential node130

Evaluation

131

bull Bayesian estimate

Method MRR MAP P10

LM 0821 0698 0800

EABIF(ori) 0674 0362 0243

EABIF(rec) 0742 0318 0281

PageRank(ori) 0675 0377 0263

PageRank(rec) 0743 0321 0266

HITS(ori) 0906 0832 0900

HITS(rec) 0938 0822 0906

Summary

bull Matrix and probability are fundamental mathematics in information retrieval and computer visionndash Matrix decomposition ndash a good practice to learn matrixndash Graphical model ndash a good practice to learn probability

bull Graphical model is a good tool to analyze problems

bull The essence of decomposition is to discover a set of mid-level features to describe original documentsimages

bull It is more adaptable for various applications than matrix decomposition

  • An Introduction To Matrix Decomposition and Graphical Model
  • Outline
  • What Is Matrix Decomposition
  • Why We Need Matrix Decomposition
  • Why We Need Matrix Decomposition (2)
  • Principle Component Analysis
  • Definition ndash Eigenvalue amp Eigenvector
  • Definition ndash Principle Component Analysis
  • Principle Component Analysis (2)
  • Maximizing Variance
  • Optimization Problem
  • Property Data Decomposition
  • Face Recognition ndash Eigenface
  • Slide 14
  • Slide 15
  • PageRank ndash Power Iteration
  • Column-Stochastic amp Irreducible
  • Iterative PageRank Calculation
  • Convergence of the power iteration
  • Singular value decomposition
  • SVD - Definition
  • Singular Values And Singular Vectors
  • Matrix approximation
  • SVD and PCA
  • Example - LSI
  • SVD and PCA (2)
  • Latent Semantic Indexing (LSI)
  • Latent Semantic Indexing
  • Similarity Measures
  • Similarity Measures (2)
  • HITS (Hyperlink Induced Topic Search)
  • Power Iteration
  • HITS and SVD
  • HITS vs PageRank
  • NMF ndash Non-Negative Matrix Factorization
  • Definition
  • Motivation
  • Multiplicative Update Algorithm
  • Multiplicative Update Algorithm (2)
  • NMF vs PCA
  • Reference
  • Major Reference
  • Outline (2)
  • Not Included
  • Basic Concepts
  • What Is Machine Learning
  • Likelihood Function
  • Maximum Likelihood (ML)
  • IID ndash Independent Identically Distributed
  • Reference (2)
  • Expectation Maximization
  • Why We Need EM
  • More General
  • The Expectation Maximization (EM) Algorithm
  • Jensenrsquos Inequality
  • Lower Bounding the Log Likelihood
  • The E and M Steps of EM
  • The E Step
  • The M Step
  • EM Never Decreases the Likelihood
  • Reference (3)
  • Why Do We Need Graphical Model
  • Why Do We Need Graphical Models
  • Directed Acyclic Graphical Models (Bayesian Networks)
  • Directed Graphs for Statistical Models Plate Notation
  • pLSA ndash Probabilistic Latent Semantic Analysis
  • Latent Semantic Indexing (LSI) Review
  • pLSA ndash Probabilistic Latent Semantic Analysis (2)
  • pLSA
  • pLSA (2)
  • Joint Probability vs Likelihood
  • Document Decomposition
  • pLSA ndash Objective Function
  • EM Steps
  • Lower Bounding the Log Likelihood (2)
  • EM Steps (2)
  • Latent Subspace
  • pLSA vs LSA
  • pLSA vs LSA (2)
  • Applications
  • Text Mining
  • Scene Classification
  • Classification Result
  • Reference (4)
  • LDA ndash Latent Dirichilet Allocation
  • Problems in pLSA
  • Problems in pLSA (2)
  • Dirichlet Distribution
  • Dirichlet Distribution (2)
  • Example Dirichlet Distributions (K=3)
  • Example Dirichlet Distributions (K=3) (2)
  • The LDA Model
  • The LDA Model (2)
  • Joint Probability
  • Likelihood
  • Inference
  • Inference (2)
  • Variational Inference
  • Variantional Inference
  • VBEM vs EM
  • Parameter Estimation
  • Parameter Estimation (2)
  • Topic Examples in a 100-topic LDA Model)
  • Classification (50-topic LDA + SVM)
  • Problems in LDA
  • A Bayesian Hierarchical Model for Learning Natural Scene Catego
  • Codebook
  • Topic Distribution in Different Categories
  • Topic Hierarchical Clustering
  • More Topic Models
  • Are you really into Graphical Models
  • Reference (5)
  • Outline (3)
  • Slide 115
  • The Long Query Problem
  • Key Idea Dimension Reduction + Residual Error Preservation
  • Orthogonal Decomposition
  • A Probabilistic Implementation
  • Search (Online)
  • Search Example
  • Search Example (2)
  • Simultaneously Modeling Semantics and Structure of Threaded Dis
  • Semantic amp structure
  • Optimize them together
  • Reply reconstruction
  • Baselines
  • Evaluation
  • Expert finding
  • Baselines (2)
  • Evaluation (2)
  • Summary
Page 90: An Introduction To Matrix Decomposition and Graphical Model

The LDA Model

z4z3z2z1

w4w3w2w1

b

z4z3z2z1

w4w3w2w1

z4z3z2z1

w4w3w2w1

bull For each documentbull Choose ~Dirichlet()bull For each of the N words wn

ndash Choose a topic znraquo Multinomial()

ndash Choose a word wn from p(wn|znb) a multinomial probability conditioned on the topic zn

The LDA Model

bull For each documentbull Choose ~Dirichlet()bull For each of the N words wn

ndash Choose a topic znraquo Multinomial()

ndash Choose a word wn from p(wn|znb) a multinomial probability conditioned on the topic zn

Joint Probability

bull Given parameter α and β

where

Likelihood

bull Joint Probability

bull Marginal distribution of a document

bull Likelihood over all the documents

Inference

bull The likelihood can be computed by summing each documentbull Jansenrsquos inequality in EM

Inference

bull In E-Step we need to compute the posterior distribution of the hidden variables

bull Unfortunately this distribution is intractable to compute in general

bull We have to resort to variational approach

Variational Inference

bull In variational inference we consider a simplified graphical model with variational parameters and minimize the KL Divergence between the variational and posterior distributions

Variantional Inference

bull The difference between the lower bound and the likelihood is the KL divergence

bull Maximizing the lower bound L(b) with respect to and is equivalent to minimizing the KL divergence

VBEM vs EM

bull Only different in the E-Step

bull In standard EM q(X) is directly set to p(X|Dθ) and let KL=0bull In VBEM it is intractable to compute p(X|Dθ) Instead it

approximates p(X|Dθ) by a variational distribution q(X) by minimizing KL(q(X) | P(X|D θ)

bull This is also equivalent to maximizing the lower bound L(θ)

Parameter Estimation

bull Given a corpus of documents we would like to find the parameters and b which maximize the likelihood of the observed data

bull Strategy (Variational EM)

bull Lower bound log p(w|b) by a function L(b)bull Repeat until convergence

ndash E Maximize L(b) with respect to the variational parameters ndash M Maximize the bound with respect to parameters and b

Parameter Estimation

bull E-Step Variational Inference ndash repeat until convergence

bull M-Step Parameter estimation

β

α can be implemented using the Newton-Raphson method

Topic Examples in a 100-topic LDA Model)bull 16000 documents from a subset of the TREC AP corpus

Classification (50-topic LDA + SVM)

bull Reuters-21578 dataset ndash contains 8000 documents and 15818 words

(a) EARN vs NOT EARN (b) GRAIN vs NOT GRAIN

Problems in LDA

bull Dirichlet Distribution is helpful to avoid over-fitting But the assumption might be too strong

z4z3z2z1

w4w3w2w1

b

z4z3z2z1

w4w3w2w1

z4z3z2z1

w4w3w2w1

A Bayesian Hierarchical Model for Learning Natural Scene Categories

bull Incorporating category information

MNd

π

z

x

θ

β

Codebookbull 174 Local Image Patches

bull DetectionEvenly Sampled GridRandom SamplingSaliency DetectorLowersquos DoG Detector

bull RepresentationNormalized 11x11 gray values128-dim SIFT

Topic Distribution in Different Categories

Topic Hierarchical Clustering

More Topic Models

bull Dynamic topic models ICML 2006bull Correlated Topic Model NIPS 2005bull Hierarchical Dirichlet Process Journal of the American

Statistical Association 2003bull Nonparametric Bayes pachinko allocation UAI 2007bull Supervised LDA NIPS 2007bull MedLDA ndash Maximum Margin Discrimant LDA ICML 2009bull hellip

Are you really into Graphical Models

bull Describing Visual Scenes using Transformed Dirichlet Processes E Sudderth A Torralba W Freeman and A Willsky NIPS Dec 2005

Reference

bull David M Blei Andrew Y Ng Michael I Jordan Latent Dirichlet Allocation Journal of Machine Learning Research (JMLR) 2003

bull Matthew J Beal Variational Algorithms for Approximate Bayesian Inference PhD Thesis University of Cambridge 1998

bull L Fei-Fei and P Perona A Bayesian Hierarchical Model for Learning Natural Scene Categories CVPR 2005

Outline

bull Matrix Decompositionndash PCA SVD NMFndash LDA ICA Sparse Coding etc

bull Graphical Modelndash Basic concepts in probabilistic machine learningndash EMndash pLSAndash LDA

bull Two Applicationsndash Document decomposition for ldquolong queryrdquo retrieval ICCV 2009ndash Modeling Threaded Discussions SIGIR 2009

Large-Scale Indexing for ldquoLong Queryrdquo Retrieval (Similarity Search)

Xiao Zhang Zhiwei Li Lei Zhang Wei-Ying Ma and Heung-Yeung Shum

ICCV 2009

The Long Query Problem

bull If a query contains 1000 keywordsndash Need to access 1000 inverted listsndash The intersection of 1000 inverted lists may be emptyndash The union of 1000 inverted list may be the whole corpus

bull Dimension reductionTerm1 Term2 Term3 Term4 hellip TermN

Img1 1 2 0 0 hellip 2

f1 f2 hellip fM

Img1 02 01 hellip 003

Topic Projection

Dim = 1 million

Dim = 200

Key Idea Dimension Reduction + Residual Error Preservation

bull p original TF-IDF vector in vocabulary spacebull X projection matrix for dimension reductionbull w low dimensional feature vectorbull residual error

p Xw

An image = 0 01 02 03 04 05 06 07 08 09 1

0

2

4

6

8

10

12

14

16

+ a few words(10 words)

Orthogonal Decomposition

p Xw 1 11 1 1

2 21 2 1 2

1 1

1

k

k

W k W

W W Wk W

p x xp x x w

p wp x x

Base vector

Low dimensional representation

Residual

An image = 0 01 02 03 04 05 06 07 08 09 1

0

2

4

6

8

10

12

14

16

+ a few words(10 words)

T Tp q p q

p q

w w

X1 X2 X3 hellip Xk

A Probabilistic Implementation

x is a switch variable It controls a word generated from

bull a topic specific distribution

bull a document specific distribution

bull a background distribution

( | )p w d 1( 0 | ) ( | ) ( | )K

kp x d p w z k p z k d

( 1| ) ( | )p x d p w d ( 2 | ) ( | )p x d p w

C Chemudugunta etal Modeling General and Specific Aspects of Documents with a Probabilistic Topic Model NIPS 2006

Search (Online)

DS1 DS2 hellip

DS1 DS2 hellip

DS1 DS2 hellip

DS1 DS2 hellip

DS1 DS2 hellip

hellip

hellip

hellip

hellip

hellip

LSH Index

Doc 300 Doc 401 hellip

A query = 0 01 02 03 04 05 06 07 08 09 1

0

2

4

6

8

10

12

14

16

+ a few words

Re-rankingDoc 401 hellip

Doc 1

Doc 2

Doc 300

Doc 401

Doc N

Doc 300

Index 10M Images 46GBSearch Speed lt 100ms

Doc Meta

Search Example

Query Image

Search Example

Query Image

Simultaneously Modeling Semantics and Structure of Threaded Discussions A Sparse

Coding Approach and Its Applications

Chen LIN Jiang-Ming YANG Rui CAI Xin-jing WANG Wei WANG Lei ZHANG

SIGIR 2009

123

Semantic amp structure

124

SemanticTopics

StructureWho reply to who

Optimize them together

Model semantic

Model structure

125

Reply reconstruction

126

DocumentSimilarity

TopicSimilarity

StructureSimilarity

BaselinesNP

Reply to Nearest PostRR

Reply to RootDS

Document SimilarityLDA

Latent Dirichlet Allocation Project documents to topic space

SWB Special Words Topic Model with Background distribution Project documents to topic and junk topic space

127

Evaluation

method Slashdot Apple

All Posts Good Posts All Posts Good Posts

NP 0021 0012 0289 0239

RR 0183 0319 0269 0474

DS 0463 0643 0409 0628

LDA 0465 0644 0410 0648

SWB 0463 0644 0410 0641

SMSS 0524 0737 0517 0772

128

Expert finding

Reply reconstruction

Network construction

Expert finding

Methods

HITS

PageRank

hellip

129

Baselines

LMFormal Models for Expert Finding in Enterprise Corpora SIGIR

06Achieves stable performance in expert finding task using a

language modelPageRank

Benchmark nodal ranking methodHITS

Find hub nodes and authority nodeEABIF

Personalized Recommendation Driven by Information Flow SIGIR rsquo06

Find most influential node130

Evaluation

131

bull Bayesian estimate

Method MRR MAP P10

LM 0821 0698 0800

EABIF(ori) 0674 0362 0243

EABIF(rec) 0742 0318 0281

PageRank(ori) 0675 0377 0263

PageRank(rec) 0743 0321 0266

HITS(ori) 0906 0832 0900

HITS(rec) 0938 0822 0906

Summary

bull Matrix and probability are fundamental mathematics in information retrieval and computer visionndash Matrix decomposition ndash a good practice to learn matrixndash Graphical model ndash a good practice to learn probability

bull Graphical model is a good tool to analyze problems

bull The essence of decomposition is to discover a set of mid-level features to describe original documentsimages

bull It is more adaptable for various applications than matrix decomposition

  • An Introduction To Matrix Decomposition and Graphical Model
  • Outline
  • What Is Matrix Decomposition
  • Why We Need Matrix Decomposition
  • Why We Need Matrix Decomposition (2)
  • Principle Component Analysis
  • Definition ndash Eigenvalue amp Eigenvector
  • Definition ndash Principle Component Analysis
  • Principle Component Analysis (2)
  • Maximizing Variance
  • Optimization Problem
  • Property Data Decomposition
  • Face Recognition ndash Eigenface
  • Slide 14
  • Slide 15
  • PageRank ndash Power Iteration
  • Column-Stochastic amp Irreducible
  • Iterative PageRank Calculation
  • Convergence of the power iteration
  • Singular value decomposition
  • SVD - Definition
  • Singular Values And Singular Vectors
  • Matrix approximation
  • SVD and PCA
  • Example - LSI
  • SVD and PCA (2)
  • Latent Semantic Indexing (LSI)
  • Latent Semantic Indexing
  • Similarity Measures
  • Similarity Measures (2)
  • HITS (Hyperlink Induced Topic Search)
  • Power Iteration
  • HITS and SVD
  • HITS vs PageRank
  • NMF ndash Non-Negative Matrix Factorization
  • Definition
  • Motivation
  • Multiplicative Update Algorithm
  • Multiplicative Update Algorithm (2)
  • NMF vs PCA
  • Reference
  • Major Reference
  • Outline (2)
  • Not Included
  • Basic Concepts
  • What Is Machine Learning
  • Likelihood Function
  • Maximum Likelihood (ML)
  • IID ndash Independent Identically Distributed
  • Reference (2)
  • Expectation Maximization
  • Why We Need EM
  • More General
  • The Expectation Maximization (EM) Algorithm
  • Jensenrsquos Inequality
  • Lower Bounding the Log Likelihood
  • The E and M Steps of EM
  • The E Step
  • The M Step
  • EM Never Decreases the Likelihood
  • Reference (3)
  • Why Do We Need Graphical Model
  • Why Do We Need Graphical Models
  • Directed Acyclic Graphical Models (Bayesian Networks)
  • Directed Graphs for Statistical Models Plate Notation
  • pLSA ndash Probabilistic Latent Semantic Analysis
  • Latent Semantic Indexing (LSI) Review
  • pLSA ndash Probabilistic Latent Semantic Analysis (2)
  • pLSA
  • pLSA (2)
  • Joint Probability vs Likelihood
  • Document Decomposition
  • pLSA ndash Objective Function
  • EM Steps
  • Lower Bounding the Log Likelihood (2)
  • EM Steps (2)
  • Latent Subspace
  • pLSA vs LSA
  • pLSA vs LSA (2)
  • Applications
  • Text Mining
  • Scene Classification
  • Classification Result
  • Reference (4)
  • LDA ndash Latent Dirichilet Allocation
  • Problems in pLSA
  • Problems in pLSA (2)
  • Dirichlet Distribution
  • Dirichlet Distribution (2)
  • Example Dirichlet Distributions (K=3)
  • Example Dirichlet Distributions (K=3) (2)
  • The LDA Model
  • The LDA Model (2)
  • Joint Probability
  • Likelihood
  • Inference
  • Inference (2)
  • Variational Inference
  • Variantional Inference
  • VBEM vs EM
  • Parameter Estimation
  • Parameter Estimation (2)
  • Topic Examples in a 100-topic LDA Model)
  • Classification (50-topic LDA + SVM)
  • Problems in LDA
  • A Bayesian Hierarchical Model for Learning Natural Scene Catego
  • Codebook
  • Topic Distribution in Different Categories
  • Topic Hierarchical Clustering
  • More Topic Models
  • Are you really into Graphical Models
  • Reference (5)
  • Outline (3)
  • Slide 115
  • The Long Query Problem
  • Key Idea Dimension Reduction + Residual Error Preservation
  • Orthogonal Decomposition
  • A Probabilistic Implementation
  • Search (Online)
  • Search Example
  • Search Example (2)
  • Simultaneously Modeling Semantics and Structure of Threaded Dis
  • Semantic amp structure
  • Optimize them together
  • Reply reconstruction
  • Baselines
  • Evaluation
  • Expert finding
  • Baselines (2)
  • Evaluation (2)
  • Summary
Page 91: An Introduction To Matrix Decomposition and Graphical Model

The LDA Model

bull For each documentbull Choose ~Dirichlet()bull For each of the N words wn

ndash Choose a topic znraquo Multinomial()

ndash Choose a word wn from p(wn|znb) a multinomial probability conditioned on the topic zn

Joint Probability

bull Given parameter α and β

where

Likelihood

bull Joint Probability

bull Marginal distribution of a document

bull Likelihood over all the documents

Inference

bull The likelihood can be computed by summing each documentbull Jansenrsquos inequality in EM

Inference

bull In E-Step we need to compute the posterior distribution of the hidden variables

bull Unfortunately this distribution is intractable to compute in general

bull We have to resort to variational approach

Variational Inference

bull In variational inference we consider a simplified graphical model with variational parameters and minimize the KL Divergence between the variational and posterior distributions

Variantional Inference

bull The difference between the lower bound and the likelihood is the KL divergence

bull Maximizing the lower bound L(b) with respect to and is equivalent to minimizing the KL divergence

VBEM vs EM

bull Only different in the E-Step

bull In standard EM q(X) is directly set to p(X|Dθ) and let KL=0bull In VBEM it is intractable to compute p(X|Dθ) Instead it

approximates p(X|Dθ) by a variational distribution q(X) by minimizing KL(q(X) | P(X|D θ)

bull This is also equivalent to maximizing the lower bound L(θ)

Parameter Estimation

bull Given a corpus of documents we would like to find the parameters and b which maximize the likelihood of the observed data

bull Strategy (Variational EM)

bull Lower bound log p(w|b) by a function L(b)bull Repeat until convergence

ndash E Maximize L(b) with respect to the variational parameters ndash M Maximize the bound with respect to parameters and b

Parameter Estimation

bull E-Step Variational Inference ndash repeat until convergence

bull M-Step Parameter estimation

β

α can be implemented using the Newton-Raphson method

Topic Examples in a 100-topic LDA Model)bull 16000 documents from a subset of the TREC AP corpus

Classification (50-topic LDA + SVM)

bull Reuters-21578 dataset ndash contains 8000 documents and 15818 words

(a) EARN vs NOT EARN (b) GRAIN vs NOT GRAIN

Problems in LDA

bull Dirichlet Distribution is helpful to avoid over-fitting But the assumption might be too strong

z4z3z2z1

w4w3w2w1

b

z4z3z2z1

w4w3w2w1

z4z3z2z1

w4w3w2w1

A Bayesian Hierarchical Model for Learning Natural Scene Categories

bull Incorporating category information

MNd

π

z

x

θ

β

Codebookbull 174 Local Image Patches

bull DetectionEvenly Sampled GridRandom SamplingSaliency DetectorLowersquos DoG Detector

bull RepresentationNormalized 11x11 gray values128-dim SIFT

Topic Distribution in Different Categories

Topic Hierarchical Clustering

More Topic Models

bull Dynamic topic models ICML 2006bull Correlated Topic Model NIPS 2005bull Hierarchical Dirichlet Process Journal of the American

Statistical Association 2003bull Nonparametric Bayes pachinko allocation UAI 2007bull Supervised LDA NIPS 2007bull MedLDA ndash Maximum Margin Discrimant LDA ICML 2009bull hellip

Are you really into Graphical Models

bull Describing Visual Scenes using Transformed Dirichlet Processes E Sudderth A Torralba W Freeman and A Willsky NIPS Dec 2005

Reference

bull David M Blei Andrew Y Ng Michael I Jordan Latent Dirichlet Allocation Journal of Machine Learning Research (JMLR) 2003

bull Matthew J Beal Variational Algorithms for Approximate Bayesian Inference PhD Thesis University of Cambridge 1998

bull L Fei-Fei and P Perona A Bayesian Hierarchical Model for Learning Natural Scene Categories CVPR 2005

Outline

bull Matrix Decompositionndash PCA SVD NMFndash LDA ICA Sparse Coding etc

bull Graphical Modelndash Basic concepts in probabilistic machine learningndash EMndash pLSAndash LDA

bull Two Applicationsndash Document decomposition for ldquolong queryrdquo retrieval ICCV 2009ndash Modeling Threaded Discussions SIGIR 2009

Large-Scale Indexing for ldquoLong Queryrdquo Retrieval (Similarity Search)

Xiao Zhang Zhiwei Li Lei Zhang Wei-Ying Ma and Heung-Yeung Shum

ICCV 2009

The Long Query Problem

bull If a query contains 1000 keywordsndash Need to access 1000 inverted listsndash The intersection of 1000 inverted lists may be emptyndash The union of 1000 inverted list may be the whole corpus

bull Dimension reductionTerm1 Term2 Term3 Term4 hellip TermN

Img1 1 2 0 0 hellip 2

f1 f2 hellip fM

Img1 02 01 hellip 003

Topic Projection

Dim = 1 million

Dim = 200

Key Idea Dimension Reduction + Residual Error Preservation

bull p original TF-IDF vector in vocabulary spacebull X projection matrix for dimension reductionbull w low dimensional feature vectorbull residual error

p Xw

An image = 0 01 02 03 04 05 06 07 08 09 1

0

2

4

6

8

10

12

14

16

+ a few words(10 words)

Orthogonal Decomposition

p Xw 1 11 1 1

2 21 2 1 2

1 1

1

k

k

W k W

W W Wk W

p x xp x x w

p wp x x

Base vector

Low dimensional representation

Residual

An image = 0 01 02 03 04 05 06 07 08 09 1

0

2

4

6

8

10

12

14

16

+ a few words(10 words)

T Tp q p q

p q

w w

X1 X2 X3 hellip Xk

A Probabilistic Implementation

x is a switch variable It controls a word generated from

bull a topic specific distribution

bull a document specific distribution

bull a background distribution

( | )p w d 1( 0 | ) ( | ) ( | )K

kp x d p w z k p z k d

( 1| ) ( | )p x d p w d ( 2 | ) ( | )p x d p w

C Chemudugunta etal Modeling General and Specific Aspects of Documents with a Probabilistic Topic Model NIPS 2006

Search (Online)

DS1 DS2 hellip

DS1 DS2 hellip

DS1 DS2 hellip

DS1 DS2 hellip

DS1 DS2 hellip

hellip

hellip

hellip

hellip

hellip

LSH Index

Doc 300 Doc 401 hellip

A query = 0 01 02 03 04 05 06 07 08 09 1

0

2

4

6

8

10

12

14

16

+ a few words

Re-rankingDoc 401 hellip

Doc 1

Doc 2

Doc 300

Doc 401

Doc N

Doc 300

Index 10M Images 46GBSearch Speed lt 100ms

Doc Meta

Search Example

Query Image

Search Example

Query Image

Simultaneously Modeling Semantics and Structure of Threaded Discussions A Sparse

Coding Approach and Its Applications

Chen LIN Jiang-Ming YANG Rui CAI Xin-jing WANG Wei WANG Lei ZHANG

SIGIR 2009

123

Semantic amp structure

124

SemanticTopics

StructureWho reply to who

Optimize them together

Model semantic

Model structure

125

Reply reconstruction

126

DocumentSimilarity

TopicSimilarity

StructureSimilarity

BaselinesNP

Reply to Nearest PostRR

Reply to RootDS

Document SimilarityLDA

Latent Dirichlet Allocation Project documents to topic space

SWB Special Words Topic Model with Background distribution Project documents to topic and junk topic space

127

Evaluation

method Slashdot Apple

All Posts Good Posts All Posts Good Posts

NP 0021 0012 0289 0239

RR 0183 0319 0269 0474

DS 0463 0643 0409 0628

LDA 0465 0644 0410 0648

SWB 0463 0644 0410 0641

SMSS 0524 0737 0517 0772

128

Expert finding

Reply reconstruction

Network construction

Expert finding

Methods

HITS

PageRank

hellip

129

Baselines

LMFormal Models for Expert Finding in Enterprise Corpora SIGIR

06Achieves stable performance in expert finding task using a

language modelPageRank

Benchmark nodal ranking methodHITS

Find hub nodes and authority nodeEABIF

Personalized Recommendation Driven by Information Flow SIGIR rsquo06

Find most influential node130

Evaluation

131

bull Bayesian estimate

Method MRR MAP P10

LM 0821 0698 0800

EABIF(ori) 0674 0362 0243

EABIF(rec) 0742 0318 0281

PageRank(ori) 0675 0377 0263

PageRank(rec) 0743 0321 0266

HITS(ori) 0906 0832 0900

HITS(rec) 0938 0822 0906

Summary

bull Matrix and probability are fundamental mathematics in information retrieval and computer visionndash Matrix decomposition ndash a good practice to learn matrixndash Graphical model ndash a good practice to learn probability

bull Graphical model is a good tool to analyze problems

bull The essence of decomposition is to discover a set of mid-level features to describe original documentsimages

bull It is more adaptable for various applications than matrix decomposition

  • An Introduction To Matrix Decomposition and Graphical Model
  • Outline
  • What Is Matrix Decomposition
  • Why We Need Matrix Decomposition
  • Why We Need Matrix Decomposition (2)
  • Principle Component Analysis
  • Definition ndash Eigenvalue amp Eigenvector
  • Definition ndash Principle Component Analysis
  • Principle Component Analysis (2)
  • Maximizing Variance
  • Optimization Problem
  • Property Data Decomposition
  • Face Recognition ndash Eigenface
  • Slide 14
  • Slide 15
  • PageRank ndash Power Iteration
  • Column-Stochastic amp Irreducible
  • Iterative PageRank Calculation
  • Convergence of the power iteration
  • Singular value decomposition
  • SVD - Definition
  • Singular Values And Singular Vectors
  • Matrix approximation
  • SVD and PCA
  • Example - LSI
  • SVD and PCA (2)
  • Latent Semantic Indexing (LSI)
  • Latent Semantic Indexing
  • Similarity Measures
  • Similarity Measures (2)
  • HITS (Hyperlink Induced Topic Search)
  • Power Iteration
  • HITS and SVD
  • HITS vs PageRank
  • NMF ndash Non-Negative Matrix Factorization
  • Definition
  • Motivation
  • Multiplicative Update Algorithm
  • Multiplicative Update Algorithm (2)
  • NMF vs PCA
  • Reference
  • Major Reference
  • Outline (2)
  • Not Included
  • Basic Concepts
  • What Is Machine Learning
  • Likelihood Function
  • Maximum Likelihood (ML)
  • IID ndash Independent Identically Distributed
  • Reference (2)
  • Expectation Maximization
  • Why We Need EM
  • More General
  • The Expectation Maximization (EM) Algorithm
  • Jensenrsquos Inequality
  • Lower Bounding the Log Likelihood
  • The E and M Steps of EM
  • The E Step
  • The M Step
  • EM Never Decreases the Likelihood
  • Reference (3)
  • Why Do We Need Graphical Model
  • Why Do We Need Graphical Models
  • Directed Acyclic Graphical Models (Bayesian Networks)
  • Directed Graphs for Statistical Models Plate Notation
  • pLSA ndash Probabilistic Latent Semantic Analysis
  • Latent Semantic Indexing (LSI) Review
  • pLSA ndash Probabilistic Latent Semantic Analysis (2)
  • pLSA
  • pLSA (2)
  • Joint Probability vs Likelihood
  • Document Decomposition
  • pLSA ndash Objective Function
  • EM Steps
  • Lower Bounding the Log Likelihood (2)
  • EM Steps (2)
  • Latent Subspace
  • pLSA vs LSA
  • pLSA vs LSA (2)
  • Applications
  • Text Mining
  • Scene Classification
  • Classification Result
  • Reference (4)
  • LDA ndash Latent Dirichilet Allocation
  • Problems in pLSA
  • Problems in pLSA (2)
  • Dirichlet Distribution
  • Dirichlet Distribution (2)
  • Example Dirichlet Distributions (K=3)
  • Example Dirichlet Distributions (K=3) (2)
  • The LDA Model
  • The LDA Model (2)
  • Joint Probability
  • Likelihood
  • Inference
  • Inference (2)
  • Variational Inference
  • Variantional Inference
  • VBEM vs EM
  • Parameter Estimation
  • Parameter Estimation (2)
  • Topic Examples in a 100-topic LDA Model)
  • Classification (50-topic LDA + SVM)
  • Problems in LDA
  • A Bayesian Hierarchical Model for Learning Natural Scene Catego
  • Codebook
  • Topic Distribution in Different Categories
  • Topic Hierarchical Clustering
  • More Topic Models
  • Are you really into Graphical Models
  • Reference (5)
  • Outline (3)
  • Slide 115
  • The Long Query Problem
  • Key Idea Dimension Reduction + Residual Error Preservation
  • Orthogonal Decomposition
  • A Probabilistic Implementation
  • Search (Online)
  • Search Example
  • Search Example (2)
  • Simultaneously Modeling Semantics and Structure of Threaded Dis
  • Semantic amp structure
  • Optimize them together
  • Reply reconstruction
  • Baselines
  • Evaluation
  • Expert finding
  • Baselines (2)
  • Evaluation (2)
  • Summary
Page 92: An Introduction To Matrix Decomposition and Graphical Model

Joint Probability

bull Given parameter α and β

where

Likelihood

bull Joint Probability

bull Marginal distribution of a document

bull Likelihood over all the documents

Inference

bull The likelihood can be computed by summing each documentbull Jansenrsquos inequality in EM

Inference

bull In E-Step we need to compute the posterior distribution of the hidden variables

bull Unfortunately this distribution is intractable to compute in general

bull We have to resort to variational approach

Variational Inference

bull In variational inference we consider a simplified graphical model with variational parameters and minimize the KL Divergence between the variational and posterior distributions

Variantional Inference

bull The difference between the lower bound and the likelihood is the KL divergence

bull Maximizing the lower bound L(b) with respect to and is equivalent to minimizing the KL divergence

VBEM vs EM

bull Only different in the E-Step

bull In standard EM q(X) is directly set to p(X|Dθ) and let KL=0bull In VBEM it is intractable to compute p(X|Dθ) Instead it

approximates p(X|Dθ) by a variational distribution q(X) by minimizing KL(q(X) | P(X|D θ)

bull This is also equivalent to maximizing the lower bound L(θ)

Parameter Estimation

bull Given a corpus of documents we would like to find the parameters and b which maximize the likelihood of the observed data

bull Strategy (Variational EM)

bull Lower bound log p(w|b) by a function L(b)bull Repeat until convergence

ndash E Maximize L(b) with respect to the variational parameters ndash M Maximize the bound with respect to parameters and b

Parameter Estimation

bull E-Step Variational Inference ndash repeat until convergence

bull M-Step Parameter estimation

β

α can be implemented using the Newton-Raphson method

Topic Examples in a 100-topic LDA Model)bull 16000 documents from a subset of the TREC AP corpus

Classification (50-topic LDA + SVM)

bull Reuters-21578 dataset ndash contains 8000 documents and 15818 words

(a) EARN vs NOT EARN (b) GRAIN vs NOT GRAIN

Problems in LDA

bull Dirichlet Distribution is helpful to avoid over-fitting But the assumption might be too strong

z4z3z2z1

w4w3w2w1

b

z4z3z2z1

w4w3w2w1

z4z3z2z1

w4w3w2w1

A Bayesian Hierarchical Model for Learning Natural Scene Categories

bull Incorporating category information

MNd

π

z

x

θ

β

Codebookbull 174 Local Image Patches

bull DetectionEvenly Sampled GridRandom SamplingSaliency DetectorLowersquos DoG Detector

bull RepresentationNormalized 11x11 gray values128-dim SIFT

Topic Distribution in Different Categories

Topic Hierarchical Clustering

More Topic Models

bull Dynamic topic models ICML 2006bull Correlated Topic Model NIPS 2005bull Hierarchical Dirichlet Process Journal of the American

Statistical Association 2003bull Nonparametric Bayes pachinko allocation UAI 2007bull Supervised LDA NIPS 2007bull MedLDA ndash Maximum Margin Discrimant LDA ICML 2009bull hellip

Are you really into Graphical Models

bull Describing Visual Scenes using Transformed Dirichlet Processes E Sudderth A Torralba W Freeman and A Willsky NIPS Dec 2005

Reference

bull David M Blei Andrew Y Ng Michael I Jordan Latent Dirichlet Allocation Journal of Machine Learning Research (JMLR) 2003

bull Matthew J Beal Variational Algorithms for Approximate Bayesian Inference PhD Thesis University of Cambridge 1998

bull L Fei-Fei and P Perona A Bayesian Hierarchical Model for Learning Natural Scene Categories CVPR 2005

Outline

bull Matrix Decompositionndash PCA SVD NMFndash LDA ICA Sparse Coding etc

bull Graphical Modelndash Basic concepts in probabilistic machine learningndash EMndash pLSAndash LDA

bull Two Applicationsndash Document decomposition for ldquolong queryrdquo retrieval ICCV 2009ndash Modeling Threaded Discussions SIGIR 2009

Large-Scale Indexing for ldquoLong Queryrdquo Retrieval (Similarity Search)

Xiao Zhang Zhiwei Li Lei Zhang Wei-Ying Ma and Heung-Yeung Shum

ICCV 2009

The Long Query Problem

bull If a query contains 1000 keywordsndash Need to access 1000 inverted listsndash The intersection of 1000 inverted lists may be emptyndash The union of 1000 inverted list may be the whole corpus

bull Dimension reductionTerm1 Term2 Term3 Term4 hellip TermN

Img1 1 2 0 0 hellip 2

f1 f2 hellip fM

Img1 02 01 hellip 003

Topic Projection

Dim = 1 million

Dim = 200

Key Idea Dimension Reduction + Residual Error Preservation

bull p original TF-IDF vector in vocabulary spacebull X projection matrix for dimension reductionbull w low dimensional feature vectorbull residual error

p Xw

An image = 0 01 02 03 04 05 06 07 08 09 1

0

2

4

6

8

10

12

14

16

+ a few words(10 words)

Orthogonal Decomposition

p Xw 1 11 1 1

2 21 2 1 2

1 1

1

k

k

W k W

W W Wk W

p x xp x x w

p wp x x

Base vector

Low dimensional representation

Residual

An image = 0 01 02 03 04 05 06 07 08 09 1

0

2

4

6

8

10

12

14

16

+ a few words(10 words)

T Tp q p q

p q

w w

X1 X2 X3 hellip Xk

A Probabilistic Implementation

x is a switch variable It controls a word generated from

bull a topic specific distribution

bull a document specific distribution

bull a background distribution

( | )p w d 1( 0 | ) ( | ) ( | )K

kp x d p w z k p z k d

( 1| ) ( | )p x d p w d ( 2 | ) ( | )p x d p w

C Chemudugunta etal Modeling General and Specific Aspects of Documents with a Probabilistic Topic Model NIPS 2006

Search (Online)

DS1 DS2 hellip

DS1 DS2 hellip

DS1 DS2 hellip

DS1 DS2 hellip

DS1 DS2 hellip

hellip

hellip

hellip

hellip

hellip

LSH Index

Doc 300 Doc 401 hellip

A query = 0 01 02 03 04 05 06 07 08 09 1

0

2

4

6

8

10

12

14

16

+ a few words

Re-rankingDoc 401 hellip

Doc 1

Doc 2

Doc 300

Doc 401

Doc N

Doc 300

Index 10M Images 46GBSearch Speed lt 100ms

Doc Meta

Search Example

Query Image

Search Example

Query Image

Simultaneously Modeling Semantics and Structure of Threaded Discussions A Sparse

Coding Approach and Its Applications

Chen LIN Jiang-Ming YANG Rui CAI Xin-jing WANG Wei WANG Lei ZHANG

SIGIR 2009

123

Semantic amp structure

124

SemanticTopics

StructureWho reply to who

Optimize them together

Model semantic

Model structure

125

Reply reconstruction

126

DocumentSimilarity

TopicSimilarity

StructureSimilarity

BaselinesNP

Reply to Nearest PostRR

Reply to RootDS

Document SimilarityLDA

Latent Dirichlet Allocation Project documents to topic space

SWB Special Words Topic Model with Background distribution Project documents to topic and junk topic space

127

Evaluation

method Slashdot Apple

All Posts Good Posts All Posts Good Posts

NP 0021 0012 0289 0239

RR 0183 0319 0269 0474

DS 0463 0643 0409 0628

LDA 0465 0644 0410 0648

SWB 0463 0644 0410 0641

SMSS 0524 0737 0517 0772

128

Expert finding

Reply reconstruction

Network construction

Expert finding

Methods

HITS

PageRank

hellip

129

Baselines

LMFormal Models for Expert Finding in Enterprise Corpora SIGIR

06Achieves stable performance in expert finding task using a

language modelPageRank

Benchmark nodal ranking methodHITS

Find hub nodes and authority nodeEABIF

Personalized Recommendation Driven by Information Flow SIGIR rsquo06

Find most influential node130

Evaluation

131

bull Bayesian estimate

Method MRR MAP P10

LM 0821 0698 0800

EABIF(ori) 0674 0362 0243

EABIF(rec) 0742 0318 0281

PageRank(ori) 0675 0377 0263

PageRank(rec) 0743 0321 0266

HITS(ori) 0906 0832 0900

HITS(rec) 0938 0822 0906

Summary

bull Matrix and probability are fundamental mathematics in information retrieval and computer visionndash Matrix decomposition ndash a good practice to learn matrixndash Graphical model ndash a good practice to learn probability

bull Graphical model is a good tool to analyze problems

bull The essence of decomposition is to discover a set of mid-level features to describe original documentsimages

bull It is more adaptable for various applications than matrix decomposition

  • An Introduction To Matrix Decomposition and Graphical Model
  • Outline
  • What Is Matrix Decomposition
  • Why We Need Matrix Decomposition
  • Why We Need Matrix Decomposition (2)
  • Principle Component Analysis
  • Definition ndash Eigenvalue amp Eigenvector
  • Definition ndash Principle Component Analysis
  • Principle Component Analysis (2)
  • Maximizing Variance
  • Optimization Problem
  • Property Data Decomposition
  • Face Recognition ndash Eigenface
  • Slide 14
  • Slide 15
  • PageRank ndash Power Iteration
  • Column-Stochastic amp Irreducible
  • Iterative PageRank Calculation
  • Convergence of the power iteration
  • Singular value decomposition
  • SVD - Definition
  • Singular Values And Singular Vectors
  • Matrix approximation
  • SVD and PCA
  • Example - LSI
  • SVD and PCA (2)
  • Latent Semantic Indexing (LSI)
  • Latent Semantic Indexing
  • Similarity Measures
  • Similarity Measures (2)
  • HITS (Hyperlink Induced Topic Search)
  • Power Iteration
  • HITS and SVD
  • HITS vs PageRank
  • NMF ndash Non-Negative Matrix Factorization
  • Definition
  • Motivation
  • Multiplicative Update Algorithm
  • Multiplicative Update Algorithm (2)
  • NMF vs PCA
  • Reference
  • Major Reference
  • Outline (2)
  • Not Included
  • Basic Concepts
  • What Is Machine Learning
  • Likelihood Function
  • Maximum Likelihood (ML)
  • IID ndash Independent Identically Distributed
  • Reference (2)
  • Expectation Maximization
  • Why We Need EM
  • More General
  • The Expectation Maximization (EM) Algorithm
  • Jensenrsquos Inequality
  • Lower Bounding the Log Likelihood
  • The E and M Steps of EM
  • The E Step
  • The M Step
  • EM Never Decreases the Likelihood
  • Reference (3)
  • Why Do We Need Graphical Model
  • Why Do We Need Graphical Models
  • Directed Acyclic Graphical Models (Bayesian Networks)
  • Directed Graphs for Statistical Models Plate Notation
  • pLSA ndash Probabilistic Latent Semantic Analysis
  • Latent Semantic Indexing (LSI) Review
  • pLSA ndash Probabilistic Latent Semantic Analysis (2)
  • pLSA
  • pLSA (2)
  • Joint Probability vs Likelihood
  • Document Decomposition
  • pLSA ndash Objective Function
  • EM Steps
  • Lower Bounding the Log Likelihood (2)
  • EM Steps (2)
  • Latent Subspace
  • pLSA vs LSA
  • pLSA vs LSA (2)
  • Applications
  • Text Mining
  • Scene Classification
  • Classification Result
  • Reference (4)
  • LDA ndash Latent Dirichilet Allocation
  • Problems in pLSA
  • Problems in pLSA (2)
  • Dirichlet Distribution
  • Dirichlet Distribution (2)
  • Example Dirichlet Distributions (K=3)
  • Example Dirichlet Distributions (K=3) (2)
  • The LDA Model
  • The LDA Model (2)
  • Joint Probability
  • Likelihood
  • Inference
  • Inference (2)
  • Variational Inference
  • Variantional Inference
  • VBEM vs EM
  • Parameter Estimation
  • Parameter Estimation (2)
  • Topic Examples in a 100-topic LDA Model)
  • Classification (50-topic LDA + SVM)
  • Problems in LDA
  • A Bayesian Hierarchical Model for Learning Natural Scene Catego
  • Codebook
  • Topic Distribution in Different Categories
  • Topic Hierarchical Clustering
  • More Topic Models
  • Are you really into Graphical Models
  • Reference (5)
  • Outline (3)
  • Slide 115
  • The Long Query Problem
  • Key Idea Dimension Reduction + Residual Error Preservation
  • Orthogonal Decomposition
  • A Probabilistic Implementation
  • Search (Online)
  • Search Example
  • Search Example (2)
  • Simultaneously Modeling Semantics and Structure of Threaded Dis
  • Semantic amp structure
  • Optimize them together
  • Reply reconstruction
  • Baselines
  • Evaluation
  • Expert finding
  • Baselines (2)
  • Evaluation (2)
  • Summary
Page 93: An Introduction To Matrix Decomposition and Graphical Model

Likelihood

bull Joint Probability

bull Marginal distribution of a document

bull Likelihood over all the documents

Inference

bull The likelihood can be computed by summing each documentbull Jansenrsquos inequality in EM

Inference

bull In E-Step we need to compute the posterior distribution of the hidden variables

bull Unfortunately this distribution is intractable to compute in general

bull We have to resort to variational approach

Variational Inference

bull In variational inference we consider a simplified graphical model with variational parameters and minimize the KL Divergence between the variational and posterior distributions

Variantional Inference

bull The difference between the lower bound and the likelihood is the KL divergence

bull Maximizing the lower bound L(b) with respect to and is equivalent to minimizing the KL divergence

VBEM vs EM

bull Only different in the E-Step

bull In standard EM q(X) is directly set to p(X|Dθ) and let KL=0bull In VBEM it is intractable to compute p(X|Dθ) Instead it

approximates p(X|Dθ) by a variational distribution q(X) by minimizing KL(q(X) | P(X|D θ)

bull This is also equivalent to maximizing the lower bound L(θ)

Parameter Estimation

bull Given a corpus of documents we would like to find the parameters and b which maximize the likelihood of the observed data

bull Strategy (Variational EM)

bull Lower bound log p(w|b) by a function L(b)bull Repeat until convergence

ndash E Maximize L(b) with respect to the variational parameters ndash M Maximize the bound with respect to parameters and b

Parameter Estimation

bull E-Step Variational Inference ndash repeat until convergence

bull M-Step Parameter estimation

β

α can be implemented using the Newton-Raphson method

Topic Examples in a 100-topic LDA Model)bull 16000 documents from a subset of the TREC AP corpus

Classification (50-topic LDA + SVM)

bull Reuters-21578 dataset ndash contains 8000 documents and 15818 words

(a) EARN vs NOT EARN (b) GRAIN vs NOT GRAIN

Problems in LDA

bull Dirichlet Distribution is helpful to avoid over-fitting But the assumption might be too strong

z4z3z2z1

w4w3w2w1

b

z4z3z2z1

w4w3w2w1

z4z3z2z1

w4w3w2w1

A Bayesian Hierarchical Model for Learning Natural Scene Categories

bull Incorporating category information

MNd

π

z

x

θ

β

Codebookbull 174 Local Image Patches

bull DetectionEvenly Sampled GridRandom SamplingSaliency DetectorLowersquos DoG Detector

bull RepresentationNormalized 11x11 gray values128-dim SIFT

Topic Distribution in Different Categories

Topic Hierarchical Clustering

More Topic Models

bull Dynamic topic models ICML 2006bull Correlated Topic Model NIPS 2005bull Hierarchical Dirichlet Process Journal of the American

Statistical Association 2003bull Nonparametric Bayes pachinko allocation UAI 2007bull Supervised LDA NIPS 2007bull MedLDA ndash Maximum Margin Discrimant LDA ICML 2009bull hellip

Are you really into Graphical Models

bull Describing Visual Scenes using Transformed Dirichlet Processes E Sudderth A Torralba W Freeman and A Willsky NIPS Dec 2005

Reference

bull David M Blei Andrew Y Ng Michael I Jordan Latent Dirichlet Allocation Journal of Machine Learning Research (JMLR) 2003

bull Matthew J Beal Variational Algorithms for Approximate Bayesian Inference PhD Thesis University of Cambridge 1998

bull L Fei-Fei and P Perona A Bayesian Hierarchical Model for Learning Natural Scene Categories CVPR 2005

Outline

bull Matrix Decompositionndash PCA SVD NMFndash LDA ICA Sparse Coding etc

bull Graphical Modelndash Basic concepts in probabilistic machine learningndash EMndash pLSAndash LDA

bull Two Applicationsndash Document decomposition for ldquolong queryrdquo retrieval ICCV 2009ndash Modeling Threaded Discussions SIGIR 2009

Large-Scale Indexing for ldquoLong Queryrdquo Retrieval (Similarity Search)

Xiao Zhang Zhiwei Li Lei Zhang Wei-Ying Ma and Heung-Yeung Shum

ICCV 2009

The Long Query Problem

bull If a query contains 1000 keywordsndash Need to access 1000 inverted listsndash The intersection of 1000 inverted lists may be emptyndash The union of 1000 inverted list may be the whole corpus

bull Dimension reductionTerm1 Term2 Term3 Term4 hellip TermN

Img1 1 2 0 0 hellip 2

f1 f2 hellip fM

Img1 02 01 hellip 003

Topic Projection

Dim = 1 million

Dim = 200

Key Idea Dimension Reduction + Residual Error Preservation

bull p original TF-IDF vector in vocabulary spacebull X projection matrix for dimension reductionbull w low dimensional feature vectorbull residual error

p Xw

An image = 0 01 02 03 04 05 06 07 08 09 1

0

2

4

6

8

10

12

14

16

+ a few words(10 words)

Orthogonal Decomposition

p Xw 1 11 1 1

2 21 2 1 2

1 1

1

k

k

W k W

W W Wk W

p x xp x x w

p wp x x

Base vector

Low dimensional representation

Residual

An image = 0 01 02 03 04 05 06 07 08 09 1

0

2

4

6

8

10

12

14

16

+ a few words(10 words)

T Tp q p q

p q

w w

X1 X2 X3 hellip Xk

A Probabilistic Implementation

x is a switch variable It controls a word generated from

bull a topic specific distribution

bull a document specific distribution

bull a background distribution

( | )p w d 1( 0 | ) ( | ) ( | )K

kp x d p w z k p z k d

( 1| ) ( | )p x d p w d ( 2 | ) ( | )p x d p w

C Chemudugunta etal Modeling General and Specific Aspects of Documents with a Probabilistic Topic Model NIPS 2006

Search (Online)

DS1 DS2 hellip

DS1 DS2 hellip

DS1 DS2 hellip

DS1 DS2 hellip

DS1 DS2 hellip

hellip

hellip

hellip

hellip

hellip

LSH Index

Doc 300 Doc 401 hellip

A query = 0 01 02 03 04 05 06 07 08 09 1

0

2

4

6

8

10

12

14

16

+ a few words

Re-rankingDoc 401 hellip

Doc 1

Doc 2

Doc 300

Doc 401

Doc N

Doc 300

Index 10M Images 46GBSearch Speed lt 100ms

Doc Meta

Search Example

Query Image

Search Example

Query Image

Simultaneously Modeling Semantics and Structure of Threaded Discussions A Sparse

Coding Approach and Its Applications

Chen LIN Jiang-Ming YANG Rui CAI Xin-jing WANG Wei WANG Lei ZHANG

SIGIR 2009

123

Semantic amp structure

124

SemanticTopics

StructureWho reply to who

Optimize them together

Model semantic

Model structure

125

Reply reconstruction

126

DocumentSimilarity

TopicSimilarity

StructureSimilarity

BaselinesNP

Reply to Nearest PostRR

Reply to RootDS

Document SimilarityLDA

Latent Dirichlet Allocation Project documents to topic space

SWB Special Words Topic Model with Background distribution Project documents to topic and junk topic space

127

Evaluation

method Slashdot Apple

All Posts Good Posts All Posts Good Posts

NP 0021 0012 0289 0239

RR 0183 0319 0269 0474

DS 0463 0643 0409 0628

LDA 0465 0644 0410 0648

SWB 0463 0644 0410 0641

SMSS 0524 0737 0517 0772

128

Expert finding

Reply reconstruction

Network construction

Expert finding

Methods

HITS

PageRank

hellip

129

Baselines

LMFormal Models for Expert Finding in Enterprise Corpora SIGIR

06Achieves stable performance in expert finding task using a

language modelPageRank

Benchmark nodal ranking methodHITS

Find hub nodes and authority nodeEABIF

Personalized Recommendation Driven by Information Flow SIGIR rsquo06

Find most influential node130

Evaluation

131

bull Bayesian estimate

Method MRR MAP P10

LM 0821 0698 0800

EABIF(ori) 0674 0362 0243

EABIF(rec) 0742 0318 0281

PageRank(ori) 0675 0377 0263

PageRank(rec) 0743 0321 0266

HITS(ori) 0906 0832 0900

HITS(rec) 0938 0822 0906

Summary

bull Matrix and probability are fundamental mathematics in information retrieval and computer visionndash Matrix decomposition ndash a good practice to learn matrixndash Graphical model ndash a good practice to learn probability

bull Graphical model is a good tool to analyze problems

bull The essence of decomposition is to discover a set of mid-level features to describe original documentsimages

bull It is more adaptable for various applications than matrix decomposition

  • An Introduction To Matrix Decomposition and Graphical Model
  • Outline
  • What Is Matrix Decomposition
  • Why We Need Matrix Decomposition
  • Why We Need Matrix Decomposition (2)
  • Principle Component Analysis
  • Definition ndash Eigenvalue amp Eigenvector
  • Definition ndash Principle Component Analysis
  • Principle Component Analysis (2)
  • Maximizing Variance
  • Optimization Problem
  • Property Data Decomposition
  • Face Recognition ndash Eigenface
  • Slide 14
  • Slide 15
  • PageRank ndash Power Iteration
  • Column-Stochastic amp Irreducible
  • Iterative PageRank Calculation
  • Convergence of the power iteration
  • Singular value decomposition
  • SVD - Definition
  • Singular Values And Singular Vectors
  • Matrix approximation
  • SVD and PCA
  • Example - LSI
  • SVD and PCA (2)
  • Latent Semantic Indexing (LSI)
  • Latent Semantic Indexing
  • Similarity Measures
  • Similarity Measures (2)
  • HITS (Hyperlink Induced Topic Search)
  • Power Iteration
  • HITS and SVD
  • HITS vs PageRank
  • NMF ndash Non-Negative Matrix Factorization
  • Definition
  • Motivation
  • Multiplicative Update Algorithm
  • Multiplicative Update Algorithm (2)
  • NMF vs PCA
  • Reference
  • Major Reference
  • Outline (2)
  • Not Included
  • Basic Concepts
  • What Is Machine Learning
  • Likelihood Function
  • Maximum Likelihood (ML)
  • IID ndash Independent Identically Distributed
  • Reference (2)
  • Expectation Maximization
  • Why We Need EM
  • More General
  • The Expectation Maximization (EM) Algorithm
  • Jensenrsquos Inequality
  • Lower Bounding the Log Likelihood
  • The E and M Steps of EM
  • The E Step
  • The M Step
  • EM Never Decreases the Likelihood
  • Reference (3)
  • Why Do We Need Graphical Model
  • Why Do We Need Graphical Models
  • Directed Acyclic Graphical Models (Bayesian Networks)
  • Directed Graphs for Statistical Models Plate Notation
  • pLSA ndash Probabilistic Latent Semantic Analysis
  • Latent Semantic Indexing (LSI) Review
  • pLSA ndash Probabilistic Latent Semantic Analysis (2)
  • pLSA
  • pLSA (2)
  • Joint Probability vs Likelihood
  • Document Decomposition
  • pLSA ndash Objective Function
  • EM Steps
  • Lower Bounding the Log Likelihood (2)
  • EM Steps (2)
  • Latent Subspace
  • pLSA vs LSA
  • pLSA vs LSA (2)
  • Applications
  • Text Mining
  • Scene Classification
  • Classification Result
  • Reference (4)
  • LDA ndash Latent Dirichilet Allocation
  • Problems in pLSA
  • Problems in pLSA (2)
  • Dirichlet Distribution
  • Dirichlet Distribution (2)
  • Example Dirichlet Distributions (K=3)
  • Example Dirichlet Distributions (K=3) (2)
  • The LDA Model
  • The LDA Model (2)
  • Joint Probability
  • Likelihood
  • Inference
  • Inference (2)
  • Variational Inference
  • Variantional Inference
  • VBEM vs EM
  • Parameter Estimation
  • Parameter Estimation (2)
  • Topic Examples in a 100-topic LDA Model)
  • Classification (50-topic LDA + SVM)
  • Problems in LDA
  • A Bayesian Hierarchical Model for Learning Natural Scene Catego
  • Codebook
  • Topic Distribution in Different Categories
  • Topic Hierarchical Clustering
  • More Topic Models
  • Are you really into Graphical Models
  • Reference (5)
  • Outline (3)
  • Slide 115
  • The Long Query Problem
  • Key Idea Dimension Reduction + Residual Error Preservation
  • Orthogonal Decomposition
  • A Probabilistic Implementation
  • Search (Online)
  • Search Example
  • Search Example (2)
  • Simultaneously Modeling Semantics and Structure of Threaded Dis
  • Semantic amp structure
  • Optimize them together
  • Reply reconstruction
  • Baselines
  • Evaluation
  • Expert finding
  • Baselines (2)
  • Evaluation (2)
  • Summary
Page 94: An Introduction To Matrix Decomposition and Graphical Model

Inference

bull The likelihood can be computed by summing each documentbull Jansenrsquos inequality in EM

Inference

bull In E-Step we need to compute the posterior distribution of the hidden variables

bull Unfortunately this distribution is intractable to compute in general

bull We have to resort to variational approach

Variational Inference

bull In variational inference we consider a simplified graphical model with variational parameters and minimize the KL Divergence between the variational and posterior distributions

Variantional Inference

bull The difference between the lower bound and the likelihood is the KL divergence

bull Maximizing the lower bound L(b) with respect to and is equivalent to minimizing the KL divergence

VBEM vs EM

bull Only different in the E-Step

bull In standard EM q(X) is directly set to p(X|Dθ) and let KL=0bull In VBEM it is intractable to compute p(X|Dθ) Instead it

approximates p(X|Dθ) by a variational distribution q(X) by minimizing KL(q(X) | P(X|D θ)

bull This is also equivalent to maximizing the lower bound L(θ)

Parameter Estimation

bull Given a corpus of documents we would like to find the parameters and b which maximize the likelihood of the observed data

bull Strategy (Variational EM)

bull Lower bound log p(w|b) by a function L(b)bull Repeat until convergence

ndash E Maximize L(b) with respect to the variational parameters ndash M Maximize the bound with respect to parameters and b

Parameter Estimation

bull E-Step Variational Inference ndash repeat until convergence

bull M-Step Parameter estimation

β

α can be implemented using the Newton-Raphson method

Topic Examples in a 100-topic LDA Model)bull 16000 documents from a subset of the TREC AP corpus

Classification (50-topic LDA + SVM)

bull Reuters-21578 dataset ndash contains 8000 documents and 15818 words

(a) EARN vs NOT EARN (b) GRAIN vs NOT GRAIN

Problems in LDA

bull Dirichlet Distribution is helpful to avoid over-fitting But the assumption might be too strong

z4z3z2z1

w4w3w2w1

b

z4z3z2z1

w4w3w2w1

z4z3z2z1

w4w3w2w1

A Bayesian Hierarchical Model for Learning Natural Scene Categories

bull Incorporating category information

MNd

π

z

x

θ

β

Codebookbull 174 Local Image Patches

bull DetectionEvenly Sampled GridRandom SamplingSaliency DetectorLowersquos DoG Detector

bull RepresentationNormalized 11x11 gray values128-dim SIFT

Topic Distribution in Different Categories

Topic Hierarchical Clustering

More Topic Models

bull Dynamic topic models ICML 2006bull Correlated Topic Model NIPS 2005bull Hierarchical Dirichlet Process Journal of the American

Statistical Association 2003bull Nonparametric Bayes pachinko allocation UAI 2007bull Supervised LDA NIPS 2007bull MedLDA ndash Maximum Margin Discrimant LDA ICML 2009bull hellip

Are you really into Graphical Models

bull Describing Visual Scenes using Transformed Dirichlet Processes E Sudderth A Torralba W Freeman and A Willsky NIPS Dec 2005

Reference

bull David M Blei Andrew Y Ng Michael I Jordan Latent Dirichlet Allocation Journal of Machine Learning Research (JMLR) 2003

bull Matthew J Beal Variational Algorithms for Approximate Bayesian Inference PhD Thesis University of Cambridge 1998

bull L Fei-Fei and P Perona A Bayesian Hierarchical Model for Learning Natural Scene Categories CVPR 2005

Outline

bull Matrix Decompositionndash PCA SVD NMFndash LDA ICA Sparse Coding etc

bull Graphical Modelndash Basic concepts in probabilistic machine learningndash EMndash pLSAndash LDA

bull Two Applicationsndash Document decomposition for ldquolong queryrdquo retrieval ICCV 2009ndash Modeling Threaded Discussions SIGIR 2009

Large-Scale Indexing for ldquoLong Queryrdquo Retrieval (Similarity Search)

Xiao Zhang Zhiwei Li Lei Zhang Wei-Ying Ma and Heung-Yeung Shum

ICCV 2009

The Long Query Problem

bull If a query contains 1000 keywordsndash Need to access 1000 inverted listsndash The intersection of 1000 inverted lists may be emptyndash The union of 1000 inverted list may be the whole corpus

bull Dimension reductionTerm1 Term2 Term3 Term4 hellip TermN

Img1 1 2 0 0 hellip 2

f1 f2 hellip fM

Img1 02 01 hellip 003

Topic Projection

Dim = 1 million

Dim = 200

Key Idea Dimension Reduction + Residual Error Preservation

bull p original TF-IDF vector in vocabulary spacebull X projection matrix for dimension reductionbull w low dimensional feature vectorbull residual error

p Xw

An image = 0 01 02 03 04 05 06 07 08 09 1

0

2

4

6

8

10

12

14

16

+ a few words(10 words)

Orthogonal Decomposition

p Xw 1 11 1 1

2 21 2 1 2

1 1

1

k

k

W k W

W W Wk W

p x xp x x w

p wp x x

Base vector

Low dimensional representation

Residual

An image = 0 01 02 03 04 05 06 07 08 09 1

0

2

4

6

8

10

12

14

16

+ a few words(10 words)

T Tp q p q

p q

w w

X1 X2 X3 hellip Xk

A Probabilistic Implementation

x is a switch variable It controls a word generated from

bull a topic specific distribution

bull a document specific distribution

bull a background distribution

( | )p w d 1( 0 | ) ( | ) ( | )K

kp x d p w z k p z k d

( 1| ) ( | )p x d p w d ( 2 | ) ( | )p x d p w

C Chemudugunta etal Modeling General and Specific Aspects of Documents with a Probabilistic Topic Model NIPS 2006

Search (Online)

DS1 DS2 hellip

DS1 DS2 hellip

DS1 DS2 hellip

DS1 DS2 hellip

DS1 DS2 hellip

hellip

hellip

hellip

hellip

hellip

LSH Index

Doc 300 Doc 401 hellip

A query = 0 01 02 03 04 05 06 07 08 09 1

0

2

4

6

8

10

12

14

16

+ a few words

Re-rankingDoc 401 hellip

Doc 1

Doc 2

Doc 300

Doc 401

Doc N

Doc 300

Index 10M Images 46GBSearch Speed lt 100ms

Doc Meta

Search Example

Query Image

Search Example

Query Image

Simultaneously Modeling Semantics and Structure of Threaded Discussions A Sparse

Coding Approach and Its Applications

Chen LIN Jiang-Ming YANG Rui CAI Xin-jing WANG Wei WANG Lei ZHANG

SIGIR 2009

123

Semantic amp structure

124

SemanticTopics

StructureWho reply to who

Optimize them together

Model semantic

Model structure

125

Reply reconstruction

126

DocumentSimilarity

TopicSimilarity

StructureSimilarity

BaselinesNP

Reply to Nearest PostRR

Reply to RootDS

Document SimilarityLDA

Latent Dirichlet Allocation Project documents to topic space

SWB Special Words Topic Model with Background distribution Project documents to topic and junk topic space

127

Evaluation

method Slashdot Apple

All Posts Good Posts All Posts Good Posts

NP 0021 0012 0289 0239

RR 0183 0319 0269 0474

DS 0463 0643 0409 0628

LDA 0465 0644 0410 0648

SWB 0463 0644 0410 0641

SMSS 0524 0737 0517 0772

128

Expert finding

Reply reconstruction

Network construction

Expert finding

Methods

HITS

PageRank

hellip

129

Baselines

LMFormal Models for Expert Finding in Enterprise Corpora SIGIR

06Achieves stable performance in expert finding task using a

language modelPageRank

Benchmark nodal ranking methodHITS

Find hub nodes and authority nodeEABIF

Personalized Recommendation Driven by Information Flow SIGIR rsquo06

Find most influential node130

Evaluation

131

bull Bayesian estimate

Method MRR MAP P10

LM 0821 0698 0800

EABIF(ori) 0674 0362 0243

EABIF(rec) 0742 0318 0281

PageRank(ori) 0675 0377 0263

PageRank(rec) 0743 0321 0266

HITS(ori) 0906 0832 0900

HITS(rec) 0938 0822 0906

Summary

bull Matrix and probability are fundamental mathematics in information retrieval and computer visionndash Matrix decomposition ndash a good practice to learn matrixndash Graphical model ndash a good practice to learn probability

bull Graphical model is a good tool to analyze problems

bull The essence of decomposition is to discover a set of mid-level features to describe original documentsimages

bull It is more adaptable for various applications than matrix decomposition

  • An Introduction To Matrix Decomposition and Graphical Model
  • Outline
  • What Is Matrix Decomposition
  • Why We Need Matrix Decomposition
  • Why We Need Matrix Decomposition (2)
  • Principle Component Analysis
  • Definition ndash Eigenvalue amp Eigenvector
  • Definition ndash Principle Component Analysis
  • Principle Component Analysis (2)
  • Maximizing Variance
  • Optimization Problem
  • Property Data Decomposition
  • Face Recognition ndash Eigenface
  • Slide 14
  • Slide 15
  • PageRank ndash Power Iteration
  • Column-Stochastic amp Irreducible
  • Iterative PageRank Calculation
  • Convergence of the power iteration
  • Singular value decomposition
  • SVD - Definition
  • Singular Values And Singular Vectors
  • Matrix approximation
  • SVD and PCA
  • Example - LSI
  • SVD and PCA (2)
  • Latent Semantic Indexing (LSI)
  • Latent Semantic Indexing
  • Similarity Measures
  • Similarity Measures (2)
  • HITS (Hyperlink Induced Topic Search)
  • Power Iteration
  • HITS and SVD
  • HITS vs PageRank
  • NMF ndash Non-Negative Matrix Factorization
  • Definition
  • Motivation
  • Multiplicative Update Algorithm
  • Multiplicative Update Algorithm (2)
  • NMF vs PCA
  • Reference
  • Major Reference
  • Outline (2)
  • Not Included
  • Basic Concepts
  • What Is Machine Learning
  • Likelihood Function
  • Maximum Likelihood (ML)
  • IID ndash Independent Identically Distributed
  • Reference (2)
  • Expectation Maximization
  • Why We Need EM
  • More General
  • The Expectation Maximization (EM) Algorithm
  • Jensenrsquos Inequality
  • Lower Bounding the Log Likelihood
  • The E and M Steps of EM
  • The E Step
  • The M Step
  • EM Never Decreases the Likelihood
  • Reference (3)
  • Why Do We Need Graphical Model
  • Why Do We Need Graphical Models
  • Directed Acyclic Graphical Models (Bayesian Networks)
  • Directed Graphs for Statistical Models Plate Notation
  • pLSA ndash Probabilistic Latent Semantic Analysis
  • Latent Semantic Indexing (LSI) Review
  • pLSA ndash Probabilistic Latent Semantic Analysis (2)
  • pLSA
  • pLSA (2)
  • Joint Probability vs Likelihood
  • Document Decomposition
  • pLSA ndash Objective Function
  • EM Steps
  • Lower Bounding the Log Likelihood (2)
  • EM Steps (2)
  • Latent Subspace
  • pLSA vs LSA
  • pLSA vs LSA (2)
  • Applications
  • Text Mining
  • Scene Classification
  • Classification Result
  • Reference (4)
  • LDA ndash Latent Dirichilet Allocation
  • Problems in pLSA
  • Problems in pLSA (2)
  • Dirichlet Distribution
  • Dirichlet Distribution (2)
  • Example Dirichlet Distributions (K=3)
  • Example Dirichlet Distributions (K=3) (2)
  • The LDA Model
  • The LDA Model (2)
  • Joint Probability
  • Likelihood
  • Inference
  • Inference (2)
  • Variational Inference
  • Variantional Inference
  • VBEM vs EM
  • Parameter Estimation
  • Parameter Estimation (2)
  • Topic Examples in a 100-topic LDA Model)
  • Classification (50-topic LDA + SVM)
  • Problems in LDA
  • A Bayesian Hierarchical Model for Learning Natural Scene Catego
  • Codebook
  • Topic Distribution in Different Categories
  • Topic Hierarchical Clustering
  • More Topic Models
  • Are you really into Graphical Models
  • Reference (5)
  • Outline (3)
  • Slide 115
  • The Long Query Problem
  • Key Idea Dimension Reduction + Residual Error Preservation
  • Orthogonal Decomposition
  • A Probabilistic Implementation
  • Search (Online)
  • Search Example
  • Search Example (2)
  • Simultaneously Modeling Semantics and Structure of Threaded Dis
  • Semantic amp structure
  • Optimize them together
  • Reply reconstruction
  • Baselines
  • Evaluation
  • Expert finding
  • Baselines (2)
  • Evaluation (2)
  • Summary
Page 95: An Introduction To Matrix Decomposition and Graphical Model

Inference

bull In E-Step we need to compute the posterior distribution of the hidden variables

bull Unfortunately this distribution is intractable to compute in general

bull We have to resort to variational approach

Variational Inference

bull In variational inference we consider a simplified graphical model with variational parameters and minimize the KL Divergence between the variational and posterior distributions

Variantional Inference

bull The difference between the lower bound and the likelihood is the KL divergence

bull Maximizing the lower bound L(b) with respect to and is equivalent to minimizing the KL divergence

VBEM vs EM

bull Only different in the E-Step

bull In standard EM q(X) is directly set to p(X|Dθ) and let KL=0bull In VBEM it is intractable to compute p(X|Dθ) Instead it

approximates p(X|Dθ) by a variational distribution q(X) by minimizing KL(q(X) | P(X|D θ)

bull This is also equivalent to maximizing the lower bound L(θ)

Parameter Estimation

bull Given a corpus of documents we would like to find the parameters and b which maximize the likelihood of the observed data

bull Strategy (Variational EM)

bull Lower bound log p(w|b) by a function L(b)bull Repeat until convergence

ndash E Maximize L(b) with respect to the variational parameters ndash M Maximize the bound with respect to parameters and b

Parameter Estimation

bull E-Step Variational Inference ndash repeat until convergence

bull M-Step Parameter estimation

β

α can be implemented using the Newton-Raphson method

Topic Examples in a 100-topic LDA Model)bull 16000 documents from a subset of the TREC AP corpus

Classification (50-topic LDA + SVM)

bull Reuters-21578 dataset ndash contains 8000 documents and 15818 words

(a) EARN vs NOT EARN (b) GRAIN vs NOT GRAIN

Problems in LDA

bull Dirichlet Distribution is helpful to avoid over-fitting But the assumption might be too strong

z4z3z2z1

w4w3w2w1

b

z4z3z2z1

w4w3w2w1

z4z3z2z1

w4w3w2w1

A Bayesian Hierarchical Model for Learning Natural Scene Categories

bull Incorporating category information

MNd

π

z

x

θ

β

Codebookbull 174 Local Image Patches

bull DetectionEvenly Sampled GridRandom SamplingSaliency DetectorLowersquos DoG Detector

bull RepresentationNormalized 11x11 gray values128-dim SIFT

Topic Distribution in Different Categories

Topic Hierarchical Clustering

More Topic Models

bull Dynamic topic models ICML 2006bull Correlated Topic Model NIPS 2005bull Hierarchical Dirichlet Process Journal of the American

Statistical Association 2003bull Nonparametric Bayes pachinko allocation UAI 2007bull Supervised LDA NIPS 2007bull MedLDA ndash Maximum Margin Discrimant LDA ICML 2009bull hellip

Are you really into Graphical Models

bull Describing Visual Scenes using Transformed Dirichlet Processes E Sudderth A Torralba W Freeman and A Willsky NIPS Dec 2005

Reference

bull David M Blei Andrew Y Ng Michael I Jordan Latent Dirichlet Allocation Journal of Machine Learning Research (JMLR) 2003

bull Matthew J Beal Variational Algorithms for Approximate Bayesian Inference PhD Thesis University of Cambridge 1998

bull L Fei-Fei and P Perona A Bayesian Hierarchical Model for Learning Natural Scene Categories CVPR 2005

Outline

bull Matrix Decompositionndash PCA SVD NMFndash LDA ICA Sparse Coding etc

bull Graphical Modelndash Basic concepts in probabilistic machine learningndash EMndash pLSAndash LDA

bull Two Applicationsndash Document decomposition for ldquolong queryrdquo retrieval ICCV 2009ndash Modeling Threaded Discussions SIGIR 2009

Large-Scale Indexing for ldquoLong Queryrdquo Retrieval (Similarity Search)

Xiao Zhang Zhiwei Li Lei Zhang Wei-Ying Ma and Heung-Yeung Shum

ICCV 2009

The Long Query Problem

bull If a query contains 1000 keywordsndash Need to access 1000 inverted listsndash The intersection of 1000 inverted lists may be emptyndash The union of 1000 inverted list may be the whole corpus

bull Dimension reductionTerm1 Term2 Term3 Term4 hellip TermN

Img1 1 2 0 0 hellip 2

f1 f2 hellip fM

Img1 02 01 hellip 003

Topic Projection

Dim = 1 million

Dim = 200

Key Idea Dimension Reduction + Residual Error Preservation

bull p original TF-IDF vector in vocabulary spacebull X projection matrix for dimension reductionbull w low dimensional feature vectorbull residual error

p Xw

An image = 0 01 02 03 04 05 06 07 08 09 1

0

2

4

6

8

10

12

14

16

+ a few words(10 words)

Orthogonal Decomposition

p Xw 1 11 1 1

2 21 2 1 2

1 1

1

k

k

W k W

W W Wk W

p x xp x x w

p wp x x

Base vector

Low dimensional representation

Residual

An image = 0 01 02 03 04 05 06 07 08 09 1

0

2

4

6

8

10

12

14

16

+ a few words(10 words)

T Tp q p q

p q

w w

X1 X2 X3 hellip Xk

A Probabilistic Implementation

x is a switch variable It controls a word generated from

bull a topic specific distribution

bull a document specific distribution

bull a background distribution

( | )p w d 1( 0 | ) ( | ) ( | )K

kp x d p w z k p z k d

( 1| ) ( | )p x d p w d ( 2 | ) ( | )p x d p w

C Chemudugunta etal Modeling General and Specific Aspects of Documents with a Probabilistic Topic Model NIPS 2006

Search (Online)

DS1 DS2 hellip

DS1 DS2 hellip

DS1 DS2 hellip

DS1 DS2 hellip

DS1 DS2 hellip

hellip

hellip

hellip

hellip

hellip

LSH Index

Doc 300 Doc 401 hellip

A query = 0 01 02 03 04 05 06 07 08 09 1

0

2

4

6

8

10

12

14

16

+ a few words

Re-rankingDoc 401 hellip

Doc 1

Doc 2

Doc 300

Doc 401

Doc N

Doc 300

Index 10M Images 46GBSearch Speed lt 100ms

Doc Meta

Search Example

Query Image

Search Example

Query Image

Simultaneously Modeling Semantics and Structure of Threaded Discussions A Sparse

Coding Approach and Its Applications

Chen LIN Jiang-Ming YANG Rui CAI Xin-jing WANG Wei WANG Lei ZHANG

SIGIR 2009

123

Semantic amp structure

124

SemanticTopics

StructureWho reply to who

Optimize them together

Model semantic

Model structure

125

Reply reconstruction

126

DocumentSimilarity

TopicSimilarity

StructureSimilarity

BaselinesNP

Reply to Nearest PostRR

Reply to RootDS

Document SimilarityLDA

Latent Dirichlet Allocation Project documents to topic space

SWB Special Words Topic Model with Background distribution Project documents to topic and junk topic space

127

Evaluation

method Slashdot Apple

All Posts Good Posts All Posts Good Posts

NP 0021 0012 0289 0239

RR 0183 0319 0269 0474

DS 0463 0643 0409 0628

LDA 0465 0644 0410 0648

SWB 0463 0644 0410 0641

SMSS 0524 0737 0517 0772

128

Expert finding

Reply reconstruction

Network construction

Expert finding

Methods

HITS

PageRank

hellip

129

Baselines

LMFormal Models for Expert Finding in Enterprise Corpora SIGIR

06Achieves stable performance in expert finding task using a

language modelPageRank

Benchmark nodal ranking methodHITS

Find hub nodes and authority nodeEABIF

Personalized Recommendation Driven by Information Flow SIGIR rsquo06

Find most influential node130

Evaluation

131

bull Bayesian estimate

Method MRR MAP P10

LM 0821 0698 0800

EABIF(ori) 0674 0362 0243

EABIF(rec) 0742 0318 0281

PageRank(ori) 0675 0377 0263

PageRank(rec) 0743 0321 0266

HITS(ori) 0906 0832 0900

HITS(rec) 0938 0822 0906

Summary

bull Matrix and probability are fundamental mathematics in information retrieval and computer visionndash Matrix decomposition ndash a good practice to learn matrixndash Graphical model ndash a good practice to learn probability

bull Graphical model is a good tool to analyze problems

bull The essence of decomposition is to discover a set of mid-level features to describe original documentsimages

bull It is more adaptable for various applications than matrix decomposition

  • An Introduction To Matrix Decomposition and Graphical Model
  • Outline
  • What Is Matrix Decomposition
  • Why We Need Matrix Decomposition
  • Why We Need Matrix Decomposition (2)
  • Principle Component Analysis
  • Definition ndash Eigenvalue amp Eigenvector
  • Definition ndash Principle Component Analysis
  • Principle Component Analysis (2)
  • Maximizing Variance
  • Optimization Problem
  • Property Data Decomposition
  • Face Recognition ndash Eigenface
  • Slide 14
  • Slide 15
  • PageRank ndash Power Iteration
  • Column-Stochastic amp Irreducible
  • Iterative PageRank Calculation
  • Convergence of the power iteration
  • Singular value decomposition
  • SVD - Definition
  • Singular Values And Singular Vectors
  • Matrix approximation
  • SVD and PCA
  • Example - LSI
  • SVD and PCA (2)
  • Latent Semantic Indexing (LSI)
  • Latent Semantic Indexing
  • Similarity Measures
  • Similarity Measures (2)
  • HITS (Hyperlink Induced Topic Search)
  • Power Iteration
  • HITS and SVD
  • HITS vs PageRank
  • NMF ndash Non-Negative Matrix Factorization
  • Definition
  • Motivation
  • Multiplicative Update Algorithm
  • Multiplicative Update Algorithm (2)
  • NMF vs PCA
  • Reference
  • Major Reference
  • Outline (2)
  • Not Included
  • Basic Concepts
  • What Is Machine Learning
  • Likelihood Function
  • Maximum Likelihood (ML)
  • IID ndash Independent Identically Distributed
  • Reference (2)
  • Expectation Maximization
  • Why We Need EM
  • More General
  • The Expectation Maximization (EM) Algorithm
  • Jensenrsquos Inequality
  • Lower Bounding the Log Likelihood
  • The E and M Steps of EM
  • The E Step
  • The M Step
  • EM Never Decreases the Likelihood
  • Reference (3)
  • Why Do We Need Graphical Model
  • Why Do We Need Graphical Models
  • Directed Acyclic Graphical Models (Bayesian Networks)
  • Directed Graphs for Statistical Models Plate Notation
  • pLSA ndash Probabilistic Latent Semantic Analysis
  • Latent Semantic Indexing (LSI) Review
  • pLSA ndash Probabilistic Latent Semantic Analysis (2)
  • pLSA
  • pLSA (2)
  • Joint Probability vs Likelihood
  • Document Decomposition
  • pLSA ndash Objective Function
  • EM Steps
  • Lower Bounding the Log Likelihood (2)
  • EM Steps (2)
  • Latent Subspace
  • pLSA vs LSA
  • pLSA vs LSA (2)
  • Applications
  • Text Mining
  • Scene Classification
  • Classification Result
  • Reference (4)
  • LDA ndash Latent Dirichilet Allocation
  • Problems in pLSA
  • Problems in pLSA (2)
  • Dirichlet Distribution
  • Dirichlet Distribution (2)
  • Example Dirichlet Distributions (K=3)
  • Example Dirichlet Distributions (K=3) (2)
  • The LDA Model
  • The LDA Model (2)
  • Joint Probability
  • Likelihood
  • Inference
  • Inference (2)
  • Variational Inference
  • Variantional Inference
  • VBEM vs EM
  • Parameter Estimation
  • Parameter Estimation (2)
  • Topic Examples in a 100-topic LDA Model)
  • Classification (50-topic LDA + SVM)
  • Problems in LDA
  • A Bayesian Hierarchical Model for Learning Natural Scene Catego
  • Codebook
  • Topic Distribution in Different Categories
  • Topic Hierarchical Clustering
  • More Topic Models
  • Are you really into Graphical Models
  • Reference (5)
  • Outline (3)
  • Slide 115
  • The Long Query Problem
  • Key Idea Dimension Reduction + Residual Error Preservation
  • Orthogonal Decomposition
  • A Probabilistic Implementation
  • Search (Online)
  • Search Example
  • Search Example (2)
  • Simultaneously Modeling Semantics and Structure of Threaded Dis
  • Semantic amp structure
  • Optimize them together
  • Reply reconstruction
  • Baselines
  • Evaluation
  • Expert finding
  • Baselines (2)
  • Evaluation (2)
  • Summary
Page 96: An Introduction To Matrix Decomposition and Graphical Model

Variational Inference

bull In variational inference we consider a simplified graphical model with variational parameters and minimize the KL Divergence between the variational and posterior distributions

Variantional Inference

bull The difference between the lower bound and the likelihood is the KL divergence

bull Maximizing the lower bound L(b) with respect to and is equivalent to minimizing the KL divergence

VBEM vs EM

bull Only different in the E-Step

bull In standard EM q(X) is directly set to p(X|Dθ) and let KL=0bull In VBEM it is intractable to compute p(X|Dθ) Instead it

approximates p(X|Dθ) by a variational distribution q(X) by minimizing KL(q(X) | P(X|D θ)

bull This is also equivalent to maximizing the lower bound L(θ)

Parameter Estimation

bull Given a corpus of documents we would like to find the parameters and b which maximize the likelihood of the observed data

bull Strategy (Variational EM)

bull Lower bound log p(w|b) by a function L(b)bull Repeat until convergence

ndash E Maximize L(b) with respect to the variational parameters ndash M Maximize the bound with respect to parameters and b

Parameter Estimation

bull E-Step Variational Inference ndash repeat until convergence

bull M-Step Parameter estimation

β

α can be implemented using the Newton-Raphson method

Topic Examples in a 100-topic LDA Model)bull 16000 documents from a subset of the TREC AP corpus

Classification (50-topic LDA + SVM)

bull Reuters-21578 dataset ndash contains 8000 documents and 15818 words

(a) EARN vs NOT EARN (b) GRAIN vs NOT GRAIN

Problems in LDA

bull Dirichlet Distribution is helpful to avoid over-fitting But the assumption might be too strong

z4z3z2z1

w4w3w2w1

b

z4z3z2z1

w4w3w2w1

z4z3z2z1

w4w3w2w1

A Bayesian Hierarchical Model for Learning Natural Scene Categories

bull Incorporating category information

MNd

π

z

x

θ

β

Codebookbull 174 Local Image Patches

bull DetectionEvenly Sampled GridRandom SamplingSaliency DetectorLowersquos DoG Detector

bull RepresentationNormalized 11x11 gray values128-dim SIFT

Topic Distribution in Different Categories

Topic Hierarchical Clustering

More Topic Models

bull Dynamic topic models ICML 2006bull Correlated Topic Model NIPS 2005bull Hierarchical Dirichlet Process Journal of the American

Statistical Association 2003bull Nonparametric Bayes pachinko allocation UAI 2007bull Supervised LDA NIPS 2007bull MedLDA ndash Maximum Margin Discrimant LDA ICML 2009bull hellip

Are you really into Graphical Models

bull Describing Visual Scenes using Transformed Dirichlet Processes E Sudderth A Torralba W Freeman and A Willsky NIPS Dec 2005

Reference

bull David M Blei Andrew Y Ng Michael I Jordan Latent Dirichlet Allocation Journal of Machine Learning Research (JMLR) 2003

bull Matthew J Beal Variational Algorithms for Approximate Bayesian Inference PhD Thesis University of Cambridge 1998

bull L Fei-Fei and P Perona A Bayesian Hierarchical Model for Learning Natural Scene Categories CVPR 2005

Outline

bull Matrix Decompositionndash PCA SVD NMFndash LDA ICA Sparse Coding etc

bull Graphical Modelndash Basic concepts in probabilistic machine learningndash EMndash pLSAndash LDA

bull Two Applicationsndash Document decomposition for ldquolong queryrdquo retrieval ICCV 2009ndash Modeling Threaded Discussions SIGIR 2009

Large-Scale Indexing for ldquoLong Queryrdquo Retrieval (Similarity Search)

Xiao Zhang Zhiwei Li Lei Zhang Wei-Ying Ma and Heung-Yeung Shum

ICCV 2009

The Long Query Problem

bull If a query contains 1000 keywordsndash Need to access 1000 inverted listsndash The intersection of 1000 inverted lists may be emptyndash The union of 1000 inverted list may be the whole corpus

bull Dimension reductionTerm1 Term2 Term3 Term4 hellip TermN

Img1 1 2 0 0 hellip 2

f1 f2 hellip fM

Img1 02 01 hellip 003

Topic Projection

Dim = 1 million

Dim = 200

Key Idea Dimension Reduction + Residual Error Preservation

bull p original TF-IDF vector in vocabulary spacebull X projection matrix for dimension reductionbull w low dimensional feature vectorbull residual error

p Xw

An image = 0 01 02 03 04 05 06 07 08 09 1

0

2

4

6

8

10

12

14

16

+ a few words(10 words)

Orthogonal Decomposition

p Xw 1 11 1 1

2 21 2 1 2

1 1

1

k

k

W k W

W W Wk W

p x xp x x w

p wp x x

Base vector

Low dimensional representation

Residual

An image = 0 01 02 03 04 05 06 07 08 09 1

0

2

4

6

8

10

12

14

16

+ a few words(10 words)

T Tp q p q

p q

w w

X1 X2 X3 hellip Xk

A Probabilistic Implementation

x is a switch variable It controls a word generated from

bull a topic specific distribution

bull a document specific distribution

bull a background distribution

( | )p w d 1( 0 | ) ( | ) ( | )K

kp x d p w z k p z k d

( 1| ) ( | )p x d p w d ( 2 | ) ( | )p x d p w

C Chemudugunta etal Modeling General and Specific Aspects of Documents with a Probabilistic Topic Model NIPS 2006

Search (Online)

DS1 DS2 hellip

DS1 DS2 hellip

DS1 DS2 hellip

DS1 DS2 hellip

DS1 DS2 hellip

hellip

hellip

hellip

hellip

hellip

LSH Index

Doc 300 Doc 401 hellip

A query = 0 01 02 03 04 05 06 07 08 09 1

0

2

4

6

8

10

12

14

16

+ a few words

Re-rankingDoc 401 hellip

Doc 1

Doc 2

Doc 300

Doc 401

Doc N

Doc 300

Index 10M Images 46GBSearch Speed lt 100ms

Doc Meta

Search Example

Query Image

Search Example

Query Image

Simultaneously Modeling Semantics and Structure of Threaded Discussions A Sparse

Coding Approach and Its Applications

Chen LIN Jiang-Ming YANG Rui CAI Xin-jing WANG Wei WANG Lei ZHANG

SIGIR 2009

123

Semantic amp structure

124

SemanticTopics

StructureWho reply to who

Optimize them together

Model semantic

Model structure

125

Reply reconstruction

126

DocumentSimilarity

TopicSimilarity

StructureSimilarity

BaselinesNP

Reply to Nearest PostRR

Reply to RootDS

Document SimilarityLDA

Latent Dirichlet Allocation Project documents to topic space

SWB Special Words Topic Model with Background distribution Project documents to topic and junk topic space

127

Evaluation

method Slashdot Apple

All Posts Good Posts All Posts Good Posts

NP 0021 0012 0289 0239

RR 0183 0319 0269 0474

DS 0463 0643 0409 0628

LDA 0465 0644 0410 0648

SWB 0463 0644 0410 0641

SMSS 0524 0737 0517 0772

128

Expert finding

Reply reconstruction

Network construction

Expert finding

Methods

HITS

PageRank

hellip

129

Baselines

LMFormal Models for Expert Finding in Enterprise Corpora SIGIR

06Achieves stable performance in expert finding task using a

language modelPageRank

Benchmark nodal ranking methodHITS

Find hub nodes and authority nodeEABIF

Personalized Recommendation Driven by Information Flow SIGIR rsquo06

Find most influential node130

Evaluation

131

bull Bayesian estimate

Method MRR MAP P10

LM 0821 0698 0800

EABIF(ori) 0674 0362 0243

EABIF(rec) 0742 0318 0281

PageRank(ori) 0675 0377 0263

PageRank(rec) 0743 0321 0266

HITS(ori) 0906 0832 0900

HITS(rec) 0938 0822 0906

Summary

bull Matrix and probability are fundamental mathematics in information retrieval and computer visionndash Matrix decomposition ndash a good practice to learn matrixndash Graphical model ndash a good practice to learn probability

bull Graphical model is a good tool to analyze problems

bull The essence of decomposition is to discover a set of mid-level features to describe original documentsimages

bull It is more adaptable for various applications than matrix decomposition

  • An Introduction To Matrix Decomposition and Graphical Model
  • Outline
  • What Is Matrix Decomposition
  • Why We Need Matrix Decomposition
  • Why We Need Matrix Decomposition (2)
  • Principle Component Analysis
  • Definition ndash Eigenvalue amp Eigenvector
  • Definition ndash Principle Component Analysis
  • Principle Component Analysis (2)
  • Maximizing Variance
  • Optimization Problem
  • Property Data Decomposition
  • Face Recognition ndash Eigenface
  • Slide 14
  • Slide 15
  • PageRank ndash Power Iteration
  • Column-Stochastic amp Irreducible
  • Iterative PageRank Calculation
  • Convergence of the power iteration
  • Singular value decomposition
  • SVD - Definition
  • Singular Values And Singular Vectors
  • Matrix approximation
  • SVD and PCA
  • Example - LSI
  • SVD and PCA (2)
  • Latent Semantic Indexing (LSI)
  • Latent Semantic Indexing
  • Similarity Measures
  • Similarity Measures (2)
  • HITS (Hyperlink Induced Topic Search)
  • Power Iteration
  • HITS and SVD
  • HITS vs PageRank
  • NMF ndash Non-Negative Matrix Factorization
  • Definition
  • Motivation
  • Multiplicative Update Algorithm
  • Multiplicative Update Algorithm (2)
  • NMF vs PCA
  • Reference
  • Major Reference
  • Outline (2)
  • Not Included
  • Basic Concepts
  • What Is Machine Learning
  • Likelihood Function
  • Maximum Likelihood (ML)
  • IID ndash Independent Identically Distributed
  • Reference (2)
  • Expectation Maximization
  • Why We Need EM
  • More General
  • The Expectation Maximization (EM) Algorithm
  • Jensenrsquos Inequality
  • Lower Bounding the Log Likelihood
  • The E and M Steps of EM
  • The E Step
  • The M Step
  • EM Never Decreases the Likelihood
  • Reference (3)
  • Why Do We Need Graphical Model
  • Why Do We Need Graphical Models
  • Directed Acyclic Graphical Models (Bayesian Networks)
  • Directed Graphs for Statistical Models Plate Notation
  • pLSA ndash Probabilistic Latent Semantic Analysis
  • Latent Semantic Indexing (LSI) Review
  • pLSA ndash Probabilistic Latent Semantic Analysis (2)
  • pLSA
  • pLSA (2)
  • Joint Probability vs Likelihood
  • Document Decomposition
  • pLSA ndash Objective Function
  • EM Steps
  • Lower Bounding the Log Likelihood (2)
  • EM Steps (2)
  • Latent Subspace
  • pLSA vs LSA
  • pLSA vs LSA (2)
  • Applications
  • Text Mining
  • Scene Classification
  • Classification Result
  • Reference (4)
  • LDA ndash Latent Dirichilet Allocation
  • Problems in pLSA
  • Problems in pLSA (2)
  • Dirichlet Distribution
  • Dirichlet Distribution (2)
  • Example Dirichlet Distributions (K=3)
  • Example Dirichlet Distributions (K=3) (2)
  • The LDA Model
  • The LDA Model (2)
  • Joint Probability
  • Likelihood
  • Inference
  • Inference (2)
  • Variational Inference
  • Variantional Inference
  • VBEM vs EM
  • Parameter Estimation
  • Parameter Estimation (2)
  • Topic Examples in a 100-topic LDA Model)
  • Classification (50-topic LDA + SVM)
  • Problems in LDA
  • A Bayesian Hierarchical Model for Learning Natural Scene Catego
  • Codebook
  • Topic Distribution in Different Categories
  • Topic Hierarchical Clustering
  • More Topic Models
  • Are you really into Graphical Models
  • Reference (5)
  • Outline (3)
  • Slide 115
  • The Long Query Problem
  • Key Idea Dimension Reduction + Residual Error Preservation
  • Orthogonal Decomposition
  • A Probabilistic Implementation
  • Search (Online)
  • Search Example
  • Search Example (2)
  • Simultaneously Modeling Semantics and Structure of Threaded Dis
  • Semantic amp structure
  • Optimize them together
  • Reply reconstruction
  • Baselines
  • Evaluation
  • Expert finding
  • Baselines (2)
  • Evaluation (2)
  • Summary
Page 97: An Introduction To Matrix Decomposition and Graphical Model

Variantional Inference

bull The difference between the lower bound and the likelihood is the KL divergence

bull Maximizing the lower bound L(b) with respect to and is equivalent to minimizing the KL divergence

VBEM vs EM

bull Only different in the E-Step

bull In standard EM q(X) is directly set to p(X|Dθ) and let KL=0bull In VBEM it is intractable to compute p(X|Dθ) Instead it

approximates p(X|Dθ) by a variational distribution q(X) by minimizing KL(q(X) | P(X|D θ)

bull This is also equivalent to maximizing the lower bound L(θ)

Parameter Estimation

bull Given a corpus of documents we would like to find the parameters and b which maximize the likelihood of the observed data

bull Strategy (Variational EM)

bull Lower bound log p(w|b) by a function L(b)bull Repeat until convergence

ndash E Maximize L(b) with respect to the variational parameters ndash M Maximize the bound with respect to parameters and b

Parameter Estimation

bull E-Step Variational Inference ndash repeat until convergence

bull M-Step Parameter estimation

β

α can be implemented using the Newton-Raphson method

Topic Examples in a 100-topic LDA Model)bull 16000 documents from a subset of the TREC AP corpus

Classification (50-topic LDA + SVM)

bull Reuters-21578 dataset ndash contains 8000 documents and 15818 words

(a) EARN vs NOT EARN (b) GRAIN vs NOT GRAIN

Problems in LDA

bull Dirichlet Distribution is helpful to avoid over-fitting But the assumption might be too strong

z4z3z2z1

w4w3w2w1

b

z4z3z2z1

w4w3w2w1

z4z3z2z1

w4w3w2w1

A Bayesian Hierarchical Model for Learning Natural Scene Categories

bull Incorporating category information

MNd

π

z

x

θ

β

Codebookbull 174 Local Image Patches

bull DetectionEvenly Sampled GridRandom SamplingSaliency DetectorLowersquos DoG Detector

bull RepresentationNormalized 11x11 gray values128-dim SIFT

Topic Distribution in Different Categories

Topic Hierarchical Clustering

More Topic Models

bull Dynamic topic models ICML 2006bull Correlated Topic Model NIPS 2005bull Hierarchical Dirichlet Process Journal of the American

Statistical Association 2003bull Nonparametric Bayes pachinko allocation UAI 2007bull Supervised LDA NIPS 2007bull MedLDA ndash Maximum Margin Discrimant LDA ICML 2009bull hellip

Are you really into Graphical Models

bull Describing Visual Scenes using Transformed Dirichlet Processes E Sudderth A Torralba W Freeman and A Willsky NIPS Dec 2005

Reference

bull David M Blei Andrew Y Ng Michael I Jordan Latent Dirichlet Allocation Journal of Machine Learning Research (JMLR) 2003

bull Matthew J Beal Variational Algorithms for Approximate Bayesian Inference PhD Thesis University of Cambridge 1998

bull L Fei-Fei and P Perona A Bayesian Hierarchical Model for Learning Natural Scene Categories CVPR 2005

Outline

bull Matrix Decompositionndash PCA SVD NMFndash LDA ICA Sparse Coding etc

bull Graphical Modelndash Basic concepts in probabilistic machine learningndash EMndash pLSAndash LDA

bull Two Applicationsndash Document decomposition for ldquolong queryrdquo retrieval ICCV 2009ndash Modeling Threaded Discussions SIGIR 2009

Large-Scale Indexing for ldquoLong Queryrdquo Retrieval (Similarity Search)

Xiao Zhang Zhiwei Li Lei Zhang Wei-Ying Ma and Heung-Yeung Shum

ICCV 2009

The Long Query Problem

bull If a query contains 1000 keywordsndash Need to access 1000 inverted listsndash The intersection of 1000 inverted lists may be emptyndash The union of 1000 inverted list may be the whole corpus

bull Dimension reductionTerm1 Term2 Term3 Term4 hellip TermN

Img1 1 2 0 0 hellip 2

f1 f2 hellip fM

Img1 02 01 hellip 003

Topic Projection

Dim = 1 million

Dim = 200

Key Idea Dimension Reduction + Residual Error Preservation

bull p original TF-IDF vector in vocabulary spacebull X projection matrix for dimension reductionbull w low dimensional feature vectorbull residual error

p Xw

An image = 0 01 02 03 04 05 06 07 08 09 1

0

2

4

6

8

10

12

14

16

+ a few words(10 words)

Orthogonal Decomposition

p Xw 1 11 1 1

2 21 2 1 2

1 1

1

k

k

W k W

W W Wk W

p x xp x x w

p wp x x

Base vector

Low dimensional representation

Residual

An image = 0 01 02 03 04 05 06 07 08 09 1

0

2

4

6

8

10

12

14

16

+ a few words(10 words)

T Tp q p q

p q

w w

X1 X2 X3 hellip Xk

A Probabilistic Implementation

x is a switch variable It controls a word generated from

bull a topic specific distribution

bull a document specific distribution

bull a background distribution

( | )p w d 1( 0 | ) ( | ) ( | )K

kp x d p w z k p z k d

( 1| ) ( | )p x d p w d ( 2 | ) ( | )p x d p w

C Chemudugunta etal Modeling General and Specific Aspects of Documents with a Probabilistic Topic Model NIPS 2006

Search (Online)

DS1 DS2 hellip

DS1 DS2 hellip

DS1 DS2 hellip

DS1 DS2 hellip

DS1 DS2 hellip

hellip

hellip

hellip

hellip

hellip

LSH Index

Doc 300 Doc 401 hellip

A query = 0 01 02 03 04 05 06 07 08 09 1

0

2

4

6

8

10

12

14

16

+ a few words

Re-rankingDoc 401 hellip

Doc 1

Doc 2

Doc 300

Doc 401

Doc N

Doc 300

Index 10M Images 46GBSearch Speed lt 100ms

Doc Meta

Search Example

Query Image

Search Example

Query Image

Simultaneously Modeling Semantics and Structure of Threaded Discussions A Sparse

Coding Approach and Its Applications

Chen LIN Jiang-Ming YANG Rui CAI Xin-jing WANG Wei WANG Lei ZHANG

SIGIR 2009

123

Semantic amp structure

124

SemanticTopics

StructureWho reply to who

Optimize them together

Model semantic

Model structure

125

Reply reconstruction

126

DocumentSimilarity

TopicSimilarity

StructureSimilarity

BaselinesNP

Reply to Nearest PostRR

Reply to RootDS

Document SimilarityLDA

Latent Dirichlet Allocation Project documents to topic space

SWB Special Words Topic Model with Background distribution Project documents to topic and junk topic space

127

Evaluation

method Slashdot Apple

All Posts Good Posts All Posts Good Posts

NP 0021 0012 0289 0239

RR 0183 0319 0269 0474

DS 0463 0643 0409 0628

LDA 0465 0644 0410 0648

SWB 0463 0644 0410 0641

SMSS 0524 0737 0517 0772

128

Expert finding

Reply reconstruction

Network construction

Expert finding

Methods

HITS

PageRank

hellip

129

Baselines

LMFormal Models for Expert Finding in Enterprise Corpora SIGIR

06Achieves stable performance in expert finding task using a

language modelPageRank

Benchmark nodal ranking methodHITS

Find hub nodes and authority nodeEABIF

Personalized Recommendation Driven by Information Flow SIGIR rsquo06

Find most influential node130

Evaluation

131

bull Bayesian estimate

Method MRR MAP P10

LM 0821 0698 0800

EABIF(ori) 0674 0362 0243

EABIF(rec) 0742 0318 0281

PageRank(ori) 0675 0377 0263

PageRank(rec) 0743 0321 0266

HITS(ori) 0906 0832 0900

HITS(rec) 0938 0822 0906

Summary

bull Matrix and probability are fundamental mathematics in information retrieval and computer visionndash Matrix decomposition ndash a good practice to learn matrixndash Graphical model ndash a good practice to learn probability

bull Graphical model is a good tool to analyze problems

bull The essence of decomposition is to discover a set of mid-level features to describe original documentsimages

bull It is more adaptable for various applications than matrix decomposition

  • An Introduction To Matrix Decomposition and Graphical Model
  • Outline
  • What Is Matrix Decomposition
  • Why We Need Matrix Decomposition
  • Why We Need Matrix Decomposition (2)
  • Principle Component Analysis
  • Definition ndash Eigenvalue amp Eigenvector
  • Definition ndash Principle Component Analysis
  • Principle Component Analysis (2)
  • Maximizing Variance
  • Optimization Problem
  • Property Data Decomposition
  • Face Recognition ndash Eigenface
  • Slide 14
  • Slide 15
  • PageRank ndash Power Iteration
  • Column-Stochastic amp Irreducible
  • Iterative PageRank Calculation
  • Convergence of the power iteration
  • Singular value decomposition
  • SVD - Definition
  • Singular Values And Singular Vectors
  • Matrix approximation
  • SVD and PCA
  • Example - LSI
  • SVD and PCA (2)
  • Latent Semantic Indexing (LSI)
  • Latent Semantic Indexing
  • Similarity Measures
  • Similarity Measures (2)
  • HITS (Hyperlink Induced Topic Search)
  • Power Iteration
  • HITS and SVD
  • HITS vs PageRank
  • NMF ndash Non-Negative Matrix Factorization
  • Definition
  • Motivation
  • Multiplicative Update Algorithm
  • Multiplicative Update Algorithm (2)
  • NMF vs PCA
  • Reference
  • Major Reference
  • Outline (2)
  • Not Included
  • Basic Concepts
  • What Is Machine Learning
  • Likelihood Function
  • Maximum Likelihood (ML)
  • IID ndash Independent Identically Distributed
  • Reference (2)
  • Expectation Maximization
  • Why We Need EM
  • More General
  • The Expectation Maximization (EM) Algorithm
  • Jensenrsquos Inequality
  • Lower Bounding the Log Likelihood
  • The E and M Steps of EM
  • The E Step
  • The M Step
  • EM Never Decreases the Likelihood
  • Reference (3)
  • Why Do We Need Graphical Model
  • Why Do We Need Graphical Models
  • Directed Acyclic Graphical Models (Bayesian Networks)
  • Directed Graphs for Statistical Models Plate Notation
  • pLSA ndash Probabilistic Latent Semantic Analysis
  • Latent Semantic Indexing (LSI) Review
  • pLSA ndash Probabilistic Latent Semantic Analysis (2)
  • pLSA
  • pLSA (2)
  • Joint Probability vs Likelihood
  • Document Decomposition
  • pLSA ndash Objective Function
  • EM Steps
  • Lower Bounding the Log Likelihood (2)
  • EM Steps (2)
  • Latent Subspace
  • pLSA vs LSA
  • pLSA vs LSA (2)
  • Applications
  • Text Mining
  • Scene Classification
  • Classification Result
  • Reference (4)
  • LDA ndash Latent Dirichilet Allocation
  • Problems in pLSA
  • Problems in pLSA (2)
  • Dirichlet Distribution
  • Dirichlet Distribution (2)
  • Example Dirichlet Distributions (K=3)
  • Example Dirichlet Distributions (K=3) (2)
  • The LDA Model
  • The LDA Model (2)
  • Joint Probability
  • Likelihood
  • Inference
  • Inference (2)
  • Variational Inference
  • Variantional Inference
  • VBEM vs EM
  • Parameter Estimation
  • Parameter Estimation (2)
  • Topic Examples in a 100-topic LDA Model)
  • Classification (50-topic LDA + SVM)
  • Problems in LDA
  • A Bayesian Hierarchical Model for Learning Natural Scene Catego
  • Codebook
  • Topic Distribution in Different Categories
  • Topic Hierarchical Clustering
  • More Topic Models
  • Are you really into Graphical Models
  • Reference (5)
  • Outline (3)
  • Slide 115
  • The Long Query Problem
  • Key Idea Dimension Reduction + Residual Error Preservation
  • Orthogonal Decomposition
  • A Probabilistic Implementation
  • Search (Online)
  • Search Example
  • Search Example (2)
  • Simultaneously Modeling Semantics and Structure of Threaded Dis
  • Semantic amp structure
  • Optimize them together
  • Reply reconstruction
  • Baselines
  • Evaluation
  • Expert finding
  • Baselines (2)
  • Evaluation (2)
  • Summary
Page 98: An Introduction To Matrix Decomposition and Graphical Model

VBEM vs EM

bull Only different in the E-Step

bull In standard EM q(X) is directly set to p(X|Dθ) and let KL=0bull In VBEM it is intractable to compute p(X|Dθ) Instead it

approximates p(X|Dθ) by a variational distribution q(X) by minimizing KL(q(X) | P(X|D θ)

bull This is also equivalent to maximizing the lower bound L(θ)

Parameter Estimation

bull Given a corpus of documents we would like to find the parameters and b which maximize the likelihood of the observed data

bull Strategy (Variational EM)

bull Lower bound log p(w|b) by a function L(b)bull Repeat until convergence

ndash E Maximize L(b) with respect to the variational parameters ndash M Maximize the bound with respect to parameters and b

Parameter Estimation

bull E-Step Variational Inference ndash repeat until convergence

bull M-Step Parameter estimation

β

α can be implemented using the Newton-Raphson method

Topic Examples in a 100-topic LDA Model)bull 16000 documents from a subset of the TREC AP corpus

Classification (50-topic LDA + SVM)

bull Reuters-21578 dataset ndash contains 8000 documents and 15818 words

(a) EARN vs NOT EARN (b) GRAIN vs NOT GRAIN

Problems in LDA

bull Dirichlet Distribution is helpful to avoid over-fitting But the assumption might be too strong

z4z3z2z1

w4w3w2w1

b

z4z3z2z1

w4w3w2w1

z4z3z2z1

w4w3w2w1

A Bayesian Hierarchical Model for Learning Natural Scene Categories

bull Incorporating category information

MNd

π

z

x

θ

β

Codebookbull 174 Local Image Patches

bull DetectionEvenly Sampled GridRandom SamplingSaliency DetectorLowersquos DoG Detector

bull RepresentationNormalized 11x11 gray values128-dim SIFT

Topic Distribution in Different Categories

Topic Hierarchical Clustering

More Topic Models

bull Dynamic topic models ICML 2006bull Correlated Topic Model NIPS 2005bull Hierarchical Dirichlet Process Journal of the American

Statistical Association 2003bull Nonparametric Bayes pachinko allocation UAI 2007bull Supervised LDA NIPS 2007bull MedLDA ndash Maximum Margin Discrimant LDA ICML 2009bull hellip

Are you really into Graphical Models

bull Describing Visual Scenes using Transformed Dirichlet Processes E Sudderth A Torralba W Freeman and A Willsky NIPS Dec 2005

Reference

bull David M Blei Andrew Y Ng Michael I Jordan Latent Dirichlet Allocation Journal of Machine Learning Research (JMLR) 2003

bull Matthew J Beal Variational Algorithms for Approximate Bayesian Inference PhD Thesis University of Cambridge 1998

bull L Fei-Fei and P Perona A Bayesian Hierarchical Model for Learning Natural Scene Categories CVPR 2005

Outline

bull Matrix Decompositionndash PCA SVD NMFndash LDA ICA Sparse Coding etc

bull Graphical Modelndash Basic concepts in probabilistic machine learningndash EMndash pLSAndash LDA

bull Two Applicationsndash Document decomposition for ldquolong queryrdquo retrieval ICCV 2009ndash Modeling Threaded Discussions SIGIR 2009

Large-Scale Indexing for ldquoLong Queryrdquo Retrieval (Similarity Search)

Xiao Zhang Zhiwei Li Lei Zhang Wei-Ying Ma and Heung-Yeung Shum

ICCV 2009

The Long Query Problem

bull If a query contains 1000 keywordsndash Need to access 1000 inverted listsndash The intersection of 1000 inverted lists may be emptyndash The union of 1000 inverted list may be the whole corpus

bull Dimension reductionTerm1 Term2 Term3 Term4 hellip TermN

Img1 1 2 0 0 hellip 2

f1 f2 hellip fM

Img1 02 01 hellip 003

Topic Projection

Dim = 1 million

Dim = 200

Key Idea Dimension Reduction + Residual Error Preservation

bull p original TF-IDF vector in vocabulary spacebull X projection matrix for dimension reductionbull w low dimensional feature vectorbull residual error

p Xw

An image = 0 01 02 03 04 05 06 07 08 09 1

0

2

4

6

8

10

12

14

16

+ a few words(10 words)

Orthogonal Decomposition

p Xw 1 11 1 1

2 21 2 1 2

1 1

1

k

k

W k W

W W Wk W

p x xp x x w

p wp x x

Base vector

Low dimensional representation

Residual

An image = 0 01 02 03 04 05 06 07 08 09 1

0

2

4

6

8

10

12

14

16

+ a few words(10 words)

T Tp q p q

p q

w w

X1 X2 X3 hellip Xk

A Probabilistic Implementation

x is a switch variable It controls a word generated from

bull a topic specific distribution

bull a document specific distribution

bull a background distribution

( | )p w d 1( 0 | ) ( | ) ( | )K

kp x d p w z k p z k d

( 1| ) ( | )p x d p w d ( 2 | ) ( | )p x d p w

C Chemudugunta etal Modeling General and Specific Aspects of Documents with a Probabilistic Topic Model NIPS 2006

Search (Online)

DS1 DS2 hellip

DS1 DS2 hellip

DS1 DS2 hellip

DS1 DS2 hellip

DS1 DS2 hellip

hellip

hellip

hellip

hellip

hellip

LSH Index

Doc 300 Doc 401 hellip

A query = 0 01 02 03 04 05 06 07 08 09 1

0

2

4

6

8

10

12

14

16

+ a few words

Re-rankingDoc 401 hellip

Doc 1

Doc 2

Doc 300

Doc 401

Doc N

Doc 300

Index 10M Images 46GBSearch Speed lt 100ms

Doc Meta

Search Example

Query Image

Search Example

Query Image

Simultaneously Modeling Semantics and Structure of Threaded Discussions A Sparse

Coding Approach and Its Applications

Chen LIN Jiang-Ming YANG Rui CAI Xin-jing WANG Wei WANG Lei ZHANG

SIGIR 2009

123

Semantic amp structure

124

SemanticTopics

StructureWho reply to who

Optimize them together

Model semantic

Model structure

125

Reply reconstruction

126

DocumentSimilarity

TopicSimilarity

StructureSimilarity

BaselinesNP

Reply to Nearest PostRR

Reply to RootDS

Document SimilarityLDA

Latent Dirichlet Allocation Project documents to topic space

SWB Special Words Topic Model with Background distribution Project documents to topic and junk topic space

127

Evaluation

method Slashdot Apple

All Posts Good Posts All Posts Good Posts

NP 0021 0012 0289 0239

RR 0183 0319 0269 0474

DS 0463 0643 0409 0628

LDA 0465 0644 0410 0648

SWB 0463 0644 0410 0641

SMSS 0524 0737 0517 0772

128

Expert finding

Reply reconstruction

Network construction

Expert finding

Methods

HITS

PageRank

hellip

129

Baselines

LMFormal Models for Expert Finding in Enterprise Corpora SIGIR

06Achieves stable performance in expert finding task using a

language modelPageRank

Benchmark nodal ranking methodHITS

Find hub nodes and authority nodeEABIF

Personalized Recommendation Driven by Information Flow SIGIR rsquo06

Find most influential node130

Evaluation

131

bull Bayesian estimate

Method MRR MAP P10

LM 0821 0698 0800

EABIF(ori) 0674 0362 0243

EABIF(rec) 0742 0318 0281

PageRank(ori) 0675 0377 0263

PageRank(rec) 0743 0321 0266

HITS(ori) 0906 0832 0900

HITS(rec) 0938 0822 0906

Summary

bull Matrix and probability are fundamental mathematics in information retrieval and computer visionndash Matrix decomposition ndash a good practice to learn matrixndash Graphical model ndash a good practice to learn probability

bull Graphical model is a good tool to analyze problems

bull The essence of decomposition is to discover a set of mid-level features to describe original documentsimages

bull It is more adaptable for various applications than matrix decomposition

  • An Introduction To Matrix Decomposition and Graphical Model
  • Outline
  • What Is Matrix Decomposition
  • Why We Need Matrix Decomposition
  • Why We Need Matrix Decomposition (2)
  • Principle Component Analysis
  • Definition ndash Eigenvalue amp Eigenvector
  • Definition ndash Principle Component Analysis
  • Principle Component Analysis (2)
  • Maximizing Variance
  • Optimization Problem
  • Property Data Decomposition
  • Face Recognition ndash Eigenface
  • Slide 14
  • Slide 15
  • PageRank ndash Power Iteration
  • Column-Stochastic amp Irreducible
  • Iterative PageRank Calculation
  • Convergence of the power iteration
  • Singular value decomposition
  • SVD - Definition
  • Singular Values And Singular Vectors
  • Matrix approximation
  • SVD and PCA
  • Example - LSI
  • SVD and PCA (2)
  • Latent Semantic Indexing (LSI)
  • Latent Semantic Indexing
  • Similarity Measures
  • Similarity Measures (2)
  • HITS (Hyperlink Induced Topic Search)
  • Power Iteration
  • HITS and SVD
  • HITS vs PageRank
  • NMF ndash Non-Negative Matrix Factorization
  • Definition
  • Motivation
  • Multiplicative Update Algorithm
  • Multiplicative Update Algorithm (2)
  • NMF vs PCA
  • Reference
  • Major Reference
  • Outline (2)
  • Not Included
  • Basic Concepts
  • What Is Machine Learning
  • Likelihood Function
  • Maximum Likelihood (ML)
  • IID ndash Independent Identically Distributed
  • Reference (2)
  • Expectation Maximization
  • Why We Need EM
  • More General
  • The Expectation Maximization (EM) Algorithm
  • Jensenrsquos Inequality
  • Lower Bounding the Log Likelihood
  • The E and M Steps of EM
  • The E Step
  • The M Step
  • EM Never Decreases the Likelihood
  • Reference (3)
  • Why Do We Need Graphical Model
  • Why Do We Need Graphical Models
  • Directed Acyclic Graphical Models (Bayesian Networks)
  • Directed Graphs for Statistical Models Plate Notation
  • pLSA ndash Probabilistic Latent Semantic Analysis
  • Latent Semantic Indexing (LSI) Review
  • pLSA ndash Probabilistic Latent Semantic Analysis (2)
  • pLSA
  • pLSA (2)
  • Joint Probability vs Likelihood
  • Document Decomposition
  • pLSA ndash Objective Function
  • EM Steps
  • Lower Bounding the Log Likelihood (2)
  • EM Steps (2)
  • Latent Subspace
  • pLSA vs LSA
  • pLSA vs LSA (2)
  • Applications
  • Text Mining
  • Scene Classification
  • Classification Result
  • Reference (4)
  • LDA ndash Latent Dirichilet Allocation
  • Problems in pLSA
  • Problems in pLSA (2)
  • Dirichlet Distribution
  • Dirichlet Distribution (2)
  • Example Dirichlet Distributions (K=3)
  • Example Dirichlet Distributions (K=3) (2)
  • The LDA Model
  • The LDA Model (2)
  • Joint Probability
  • Likelihood
  • Inference
  • Inference (2)
  • Variational Inference
  • Variantional Inference
  • VBEM vs EM
  • Parameter Estimation
  • Parameter Estimation (2)
  • Topic Examples in a 100-topic LDA Model)
  • Classification (50-topic LDA + SVM)
  • Problems in LDA
  • A Bayesian Hierarchical Model for Learning Natural Scene Catego
  • Codebook
  • Topic Distribution in Different Categories
  • Topic Hierarchical Clustering
  • More Topic Models
  • Are you really into Graphical Models
  • Reference (5)
  • Outline (3)
  • Slide 115
  • The Long Query Problem
  • Key Idea Dimension Reduction + Residual Error Preservation
  • Orthogonal Decomposition
  • A Probabilistic Implementation
  • Search (Online)
  • Search Example
  • Search Example (2)
  • Simultaneously Modeling Semantics and Structure of Threaded Dis
  • Semantic amp structure
  • Optimize them together
  • Reply reconstruction
  • Baselines
  • Evaluation
  • Expert finding
  • Baselines (2)
  • Evaluation (2)
  • Summary
Page 99: An Introduction To Matrix Decomposition and Graphical Model

Parameter Estimation

bull Given a corpus of documents we would like to find the parameters and b which maximize the likelihood of the observed data

bull Strategy (Variational EM)

bull Lower bound log p(w|b) by a function L(b)bull Repeat until convergence

ndash E Maximize L(b) with respect to the variational parameters ndash M Maximize the bound with respect to parameters and b

Parameter Estimation

bull E-Step Variational Inference ndash repeat until convergence

bull M-Step Parameter estimation

β

α can be implemented using the Newton-Raphson method

Topic Examples in a 100-topic LDA Model)bull 16000 documents from a subset of the TREC AP corpus

Classification (50-topic LDA + SVM)

bull Reuters-21578 dataset ndash contains 8000 documents and 15818 words

(a) EARN vs NOT EARN (b) GRAIN vs NOT GRAIN

Problems in LDA

bull Dirichlet Distribution is helpful to avoid over-fitting But the assumption might be too strong

z4z3z2z1

w4w3w2w1

b

z4z3z2z1

w4w3w2w1

z4z3z2z1

w4w3w2w1

A Bayesian Hierarchical Model for Learning Natural Scene Categories

bull Incorporating category information

MNd

π

z

x

θ

β

Codebookbull 174 Local Image Patches

bull DetectionEvenly Sampled GridRandom SamplingSaliency DetectorLowersquos DoG Detector

bull RepresentationNormalized 11x11 gray values128-dim SIFT

Topic Distribution in Different Categories

Topic Hierarchical Clustering

More Topic Models

bull Dynamic topic models ICML 2006bull Correlated Topic Model NIPS 2005bull Hierarchical Dirichlet Process Journal of the American

Statistical Association 2003bull Nonparametric Bayes pachinko allocation UAI 2007bull Supervised LDA NIPS 2007bull MedLDA ndash Maximum Margin Discrimant LDA ICML 2009bull hellip

Are you really into Graphical Models

bull Describing Visual Scenes using Transformed Dirichlet Processes E Sudderth A Torralba W Freeman and A Willsky NIPS Dec 2005

Reference

bull David M Blei Andrew Y Ng Michael I Jordan Latent Dirichlet Allocation Journal of Machine Learning Research (JMLR) 2003

bull Matthew J Beal Variational Algorithms for Approximate Bayesian Inference PhD Thesis University of Cambridge 1998

bull L Fei-Fei and P Perona A Bayesian Hierarchical Model for Learning Natural Scene Categories CVPR 2005

Outline

bull Matrix Decompositionndash PCA SVD NMFndash LDA ICA Sparse Coding etc

bull Graphical Modelndash Basic concepts in probabilistic machine learningndash EMndash pLSAndash LDA

bull Two Applicationsndash Document decomposition for ldquolong queryrdquo retrieval ICCV 2009ndash Modeling Threaded Discussions SIGIR 2009

Large-Scale Indexing for ldquoLong Queryrdquo Retrieval (Similarity Search)

Xiao Zhang Zhiwei Li Lei Zhang Wei-Ying Ma and Heung-Yeung Shum

ICCV 2009

The Long Query Problem

bull If a query contains 1000 keywordsndash Need to access 1000 inverted listsndash The intersection of 1000 inverted lists may be emptyndash The union of 1000 inverted list may be the whole corpus

bull Dimension reductionTerm1 Term2 Term3 Term4 hellip TermN

Img1 1 2 0 0 hellip 2

f1 f2 hellip fM

Img1 02 01 hellip 003

Topic Projection

Dim = 1 million

Dim = 200

Key Idea Dimension Reduction + Residual Error Preservation

bull p original TF-IDF vector in vocabulary spacebull X projection matrix for dimension reductionbull w low dimensional feature vectorbull residual error

p Xw

An image = 0 01 02 03 04 05 06 07 08 09 1

0

2

4

6

8

10

12

14

16

+ a few words(10 words)

Orthogonal Decomposition

p Xw 1 11 1 1

2 21 2 1 2

1 1

1

k

k

W k W

W W Wk W

p x xp x x w

p wp x x

Base vector

Low dimensional representation

Residual

An image = 0 01 02 03 04 05 06 07 08 09 1

0

2

4

6

8

10

12

14

16

+ a few words(10 words)

T Tp q p q

p q

w w

X1 X2 X3 hellip Xk

A Probabilistic Implementation

x is a switch variable It controls a word generated from

bull a topic specific distribution

bull a document specific distribution

bull a background distribution

( | )p w d 1( 0 | ) ( | ) ( | )K

kp x d p w z k p z k d

( 1| ) ( | )p x d p w d ( 2 | ) ( | )p x d p w

C Chemudugunta etal Modeling General and Specific Aspects of Documents with a Probabilistic Topic Model NIPS 2006

Search (Online)

DS1 DS2 hellip

DS1 DS2 hellip

DS1 DS2 hellip

DS1 DS2 hellip

DS1 DS2 hellip

hellip

hellip

hellip

hellip

hellip

LSH Index

Doc 300 Doc 401 hellip

A query = 0 01 02 03 04 05 06 07 08 09 1

0

2

4

6

8

10

12

14

16

+ a few words

Re-rankingDoc 401 hellip

Doc 1

Doc 2

Doc 300

Doc 401

Doc N

Doc 300

Index 10M Images 46GBSearch Speed lt 100ms

Doc Meta

Search Example

Query Image

Search Example

Query Image

Simultaneously Modeling Semantics and Structure of Threaded Discussions A Sparse

Coding Approach and Its Applications

Chen LIN Jiang-Ming YANG Rui CAI Xin-jing WANG Wei WANG Lei ZHANG

SIGIR 2009

123

Semantic amp structure

124

SemanticTopics

StructureWho reply to who

Optimize them together

Model semantic

Model structure

125

Reply reconstruction

126

DocumentSimilarity

TopicSimilarity

StructureSimilarity

BaselinesNP

Reply to Nearest PostRR

Reply to RootDS

Document SimilarityLDA

Latent Dirichlet Allocation Project documents to topic space

SWB Special Words Topic Model with Background distribution Project documents to topic and junk topic space

127

Evaluation

method Slashdot Apple

All Posts Good Posts All Posts Good Posts

NP 0021 0012 0289 0239

RR 0183 0319 0269 0474

DS 0463 0643 0409 0628

LDA 0465 0644 0410 0648

SWB 0463 0644 0410 0641

SMSS 0524 0737 0517 0772

128

Expert finding

Reply reconstruction

Network construction

Expert finding

Methods

HITS

PageRank

hellip

129

Baselines

LMFormal Models for Expert Finding in Enterprise Corpora SIGIR

06Achieves stable performance in expert finding task using a

language modelPageRank

Benchmark nodal ranking methodHITS

Find hub nodes and authority nodeEABIF

Personalized Recommendation Driven by Information Flow SIGIR rsquo06

Find most influential node130

Evaluation

131

bull Bayesian estimate

Method MRR MAP P10

LM 0821 0698 0800

EABIF(ori) 0674 0362 0243

EABIF(rec) 0742 0318 0281

PageRank(ori) 0675 0377 0263

PageRank(rec) 0743 0321 0266

HITS(ori) 0906 0832 0900

HITS(rec) 0938 0822 0906

Summary

bull Matrix and probability are fundamental mathematics in information retrieval and computer visionndash Matrix decomposition ndash a good practice to learn matrixndash Graphical model ndash a good practice to learn probability

bull Graphical model is a good tool to analyze problems

bull The essence of decomposition is to discover a set of mid-level features to describe original documentsimages

bull It is more adaptable for various applications than matrix decomposition

  • An Introduction To Matrix Decomposition and Graphical Model
  • Outline
  • What Is Matrix Decomposition
  • Why We Need Matrix Decomposition
  • Why We Need Matrix Decomposition (2)
  • Principle Component Analysis
  • Definition ndash Eigenvalue amp Eigenvector
  • Definition ndash Principle Component Analysis
  • Principle Component Analysis (2)
  • Maximizing Variance
  • Optimization Problem
  • Property Data Decomposition
  • Face Recognition ndash Eigenface
  • Slide 14
  • Slide 15
  • PageRank ndash Power Iteration
  • Column-Stochastic amp Irreducible
  • Iterative PageRank Calculation
  • Convergence of the power iteration
  • Singular value decomposition
  • SVD - Definition
  • Singular Values And Singular Vectors
  • Matrix approximation
  • SVD and PCA
  • Example - LSI
  • SVD and PCA (2)
  • Latent Semantic Indexing (LSI)
  • Latent Semantic Indexing
  • Similarity Measures
  • Similarity Measures (2)
  • HITS (Hyperlink Induced Topic Search)
  • Power Iteration
  • HITS and SVD
  • HITS vs PageRank
  • NMF ndash Non-Negative Matrix Factorization
  • Definition
  • Motivation
  • Multiplicative Update Algorithm
  • Multiplicative Update Algorithm (2)
  • NMF vs PCA
  • Reference
  • Major Reference
  • Outline (2)
  • Not Included
  • Basic Concepts
  • What Is Machine Learning
  • Likelihood Function
  • Maximum Likelihood (ML)
  • IID ndash Independent Identically Distributed
  • Reference (2)
  • Expectation Maximization
  • Why We Need EM
  • More General
  • The Expectation Maximization (EM) Algorithm
  • Jensenrsquos Inequality
  • Lower Bounding the Log Likelihood
  • The E and M Steps of EM
  • The E Step
  • The M Step
  • EM Never Decreases the Likelihood
  • Reference (3)
  • Why Do We Need Graphical Model
  • Why Do We Need Graphical Models
  • Directed Acyclic Graphical Models (Bayesian Networks)
  • Directed Graphs for Statistical Models Plate Notation
  • pLSA ndash Probabilistic Latent Semantic Analysis
  • Latent Semantic Indexing (LSI) Review
  • pLSA ndash Probabilistic Latent Semantic Analysis (2)
  • pLSA
  • pLSA (2)
  • Joint Probability vs Likelihood
  • Document Decomposition
  • pLSA ndash Objective Function
  • EM Steps
  • Lower Bounding the Log Likelihood (2)
  • EM Steps (2)
  • Latent Subspace
  • pLSA vs LSA
  • pLSA vs LSA (2)
  • Applications
  • Text Mining
  • Scene Classification
  • Classification Result
  • Reference (4)
  • LDA ndash Latent Dirichilet Allocation
  • Problems in pLSA
  • Problems in pLSA (2)
  • Dirichlet Distribution
  • Dirichlet Distribution (2)
  • Example Dirichlet Distributions (K=3)
  • Example Dirichlet Distributions (K=3) (2)
  • The LDA Model
  • The LDA Model (2)
  • Joint Probability
  • Likelihood
  • Inference
  • Inference (2)
  • Variational Inference
  • Variantional Inference
  • VBEM vs EM
  • Parameter Estimation
  • Parameter Estimation (2)
  • Topic Examples in a 100-topic LDA Model)
  • Classification (50-topic LDA + SVM)
  • Problems in LDA
  • A Bayesian Hierarchical Model for Learning Natural Scene Catego
  • Codebook
  • Topic Distribution in Different Categories
  • Topic Hierarchical Clustering
  • More Topic Models
  • Are you really into Graphical Models
  • Reference (5)
  • Outline (3)
  • Slide 115
  • The Long Query Problem
  • Key Idea Dimension Reduction + Residual Error Preservation
  • Orthogonal Decomposition
  • A Probabilistic Implementation
  • Search (Online)
  • Search Example
  • Search Example (2)
  • Simultaneously Modeling Semantics and Structure of Threaded Dis
  • Semantic amp structure
  • Optimize them together
  • Reply reconstruction
  • Baselines
  • Evaluation
  • Expert finding
  • Baselines (2)
  • Evaluation (2)
  • Summary
Page 100: An Introduction To Matrix Decomposition and Graphical Model

Parameter Estimation

bull E-Step Variational Inference ndash repeat until convergence

bull M-Step Parameter estimation

β

α can be implemented using the Newton-Raphson method

Topic Examples in a 100-topic LDA Model)bull 16000 documents from a subset of the TREC AP corpus

Classification (50-topic LDA + SVM)

bull Reuters-21578 dataset ndash contains 8000 documents and 15818 words

(a) EARN vs NOT EARN (b) GRAIN vs NOT GRAIN

Problems in LDA

bull Dirichlet Distribution is helpful to avoid over-fitting But the assumption might be too strong

z4z3z2z1

w4w3w2w1

b

z4z3z2z1

w4w3w2w1

z4z3z2z1

w4w3w2w1

A Bayesian Hierarchical Model for Learning Natural Scene Categories

bull Incorporating category information

MNd

π

z

x

θ

β

Codebookbull 174 Local Image Patches

bull DetectionEvenly Sampled GridRandom SamplingSaliency DetectorLowersquos DoG Detector

bull RepresentationNormalized 11x11 gray values128-dim SIFT

Topic Distribution in Different Categories

Topic Hierarchical Clustering

More Topic Models

bull Dynamic topic models ICML 2006bull Correlated Topic Model NIPS 2005bull Hierarchical Dirichlet Process Journal of the American

Statistical Association 2003bull Nonparametric Bayes pachinko allocation UAI 2007bull Supervised LDA NIPS 2007bull MedLDA ndash Maximum Margin Discrimant LDA ICML 2009bull hellip

Are you really into Graphical Models

bull Describing Visual Scenes using Transformed Dirichlet Processes E Sudderth A Torralba W Freeman and A Willsky NIPS Dec 2005

Reference

bull David M Blei Andrew Y Ng Michael I Jordan Latent Dirichlet Allocation Journal of Machine Learning Research (JMLR) 2003

bull Matthew J Beal Variational Algorithms for Approximate Bayesian Inference PhD Thesis University of Cambridge 1998

bull L Fei-Fei and P Perona A Bayesian Hierarchical Model for Learning Natural Scene Categories CVPR 2005

Outline

bull Matrix Decompositionndash PCA SVD NMFndash LDA ICA Sparse Coding etc

bull Graphical Modelndash Basic concepts in probabilistic machine learningndash EMndash pLSAndash LDA

bull Two Applicationsndash Document decomposition for ldquolong queryrdquo retrieval ICCV 2009ndash Modeling Threaded Discussions SIGIR 2009

Large-Scale Indexing for ldquoLong Queryrdquo Retrieval (Similarity Search)

Xiao Zhang Zhiwei Li Lei Zhang Wei-Ying Ma and Heung-Yeung Shum

ICCV 2009

The Long Query Problem

bull If a query contains 1000 keywordsndash Need to access 1000 inverted listsndash The intersection of 1000 inverted lists may be emptyndash The union of 1000 inverted list may be the whole corpus

bull Dimension reductionTerm1 Term2 Term3 Term4 hellip TermN

Img1 1 2 0 0 hellip 2

f1 f2 hellip fM

Img1 02 01 hellip 003

Topic Projection

Dim = 1 million

Dim = 200

Key Idea Dimension Reduction + Residual Error Preservation

bull p original TF-IDF vector in vocabulary spacebull X projection matrix for dimension reductionbull w low dimensional feature vectorbull residual error

p Xw

An image = 0 01 02 03 04 05 06 07 08 09 1

0

2

4

6

8

10

12

14

16

+ a few words(10 words)

Orthogonal Decomposition

p Xw 1 11 1 1

2 21 2 1 2

1 1

1

k

k

W k W

W W Wk W

p x xp x x w

p wp x x

Base vector

Low dimensional representation

Residual

An image = 0 01 02 03 04 05 06 07 08 09 1

0

2

4

6

8

10

12

14

16

+ a few words(10 words)

T Tp q p q

p q

w w

X1 X2 X3 hellip Xk

A Probabilistic Implementation

x is a switch variable It controls a word generated from

bull a topic specific distribution

bull a document specific distribution

bull a background distribution

( | )p w d 1( 0 | ) ( | ) ( | )K

kp x d p w z k p z k d

( 1| ) ( | )p x d p w d ( 2 | ) ( | )p x d p w

C Chemudugunta etal Modeling General and Specific Aspects of Documents with a Probabilistic Topic Model NIPS 2006

Search (Online)

DS1 DS2 hellip

DS1 DS2 hellip

DS1 DS2 hellip

DS1 DS2 hellip

DS1 DS2 hellip

hellip

hellip

hellip

hellip

hellip

LSH Index

Doc 300 Doc 401 hellip

A query = 0 01 02 03 04 05 06 07 08 09 1

0

2

4

6

8

10

12

14

16

+ a few words

Re-rankingDoc 401 hellip

Doc 1

Doc 2

Doc 300

Doc 401

Doc N

Doc 300

Index 10M Images 46GBSearch Speed lt 100ms

Doc Meta

Search Example

Query Image

Search Example

Query Image

Simultaneously Modeling Semantics and Structure of Threaded Discussions A Sparse

Coding Approach and Its Applications

Chen LIN Jiang-Ming YANG Rui CAI Xin-jing WANG Wei WANG Lei ZHANG

SIGIR 2009

123

Semantic amp structure

124

SemanticTopics

StructureWho reply to who

Optimize them together

Model semantic

Model structure

125

Reply reconstruction

126

DocumentSimilarity

TopicSimilarity

StructureSimilarity

BaselinesNP

Reply to Nearest PostRR

Reply to RootDS

Document SimilarityLDA

Latent Dirichlet Allocation Project documents to topic space

SWB Special Words Topic Model with Background distribution Project documents to topic and junk topic space

127

Evaluation

method Slashdot Apple

All Posts Good Posts All Posts Good Posts

NP 0021 0012 0289 0239

RR 0183 0319 0269 0474

DS 0463 0643 0409 0628

LDA 0465 0644 0410 0648

SWB 0463 0644 0410 0641

SMSS 0524 0737 0517 0772

128

Expert finding

Reply reconstruction

Network construction

Expert finding

Methods

HITS

PageRank

hellip

129

Baselines

LMFormal Models for Expert Finding in Enterprise Corpora SIGIR

06Achieves stable performance in expert finding task using a

language modelPageRank

Benchmark nodal ranking methodHITS

Find hub nodes and authority nodeEABIF

Personalized Recommendation Driven by Information Flow SIGIR rsquo06

Find most influential node130

Evaluation

131

bull Bayesian estimate

Method MRR MAP P10

LM 0821 0698 0800

EABIF(ori) 0674 0362 0243

EABIF(rec) 0742 0318 0281

PageRank(ori) 0675 0377 0263

PageRank(rec) 0743 0321 0266

HITS(ori) 0906 0832 0900

HITS(rec) 0938 0822 0906

Summary

bull Matrix and probability are fundamental mathematics in information retrieval and computer visionndash Matrix decomposition ndash a good practice to learn matrixndash Graphical model ndash a good practice to learn probability

bull Graphical model is a good tool to analyze problems

bull The essence of decomposition is to discover a set of mid-level features to describe original documentsimages

bull It is more adaptable for various applications than matrix decomposition

  • An Introduction To Matrix Decomposition and Graphical Model
  • Outline
  • What Is Matrix Decomposition
  • Why We Need Matrix Decomposition
  • Why We Need Matrix Decomposition (2)
  • Principle Component Analysis
  • Definition ndash Eigenvalue amp Eigenvector
  • Definition ndash Principle Component Analysis
  • Principle Component Analysis (2)
  • Maximizing Variance
  • Optimization Problem
  • Property Data Decomposition
  • Face Recognition ndash Eigenface
  • Slide 14
  • Slide 15
  • PageRank ndash Power Iteration
  • Column-Stochastic amp Irreducible
  • Iterative PageRank Calculation
  • Convergence of the power iteration
  • Singular value decomposition
  • SVD - Definition
  • Singular Values And Singular Vectors
  • Matrix approximation
  • SVD and PCA
  • Example - LSI
  • SVD and PCA (2)
  • Latent Semantic Indexing (LSI)
  • Latent Semantic Indexing
  • Similarity Measures
  • Similarity Measures (2)
  • HITS (Hyperlink Induced Topic Search)
  • Power Iteration
  • HITS and SVD
  • HITS vs PageRank
  • NMF ndash Non-Negative Matrix Factorization
  • Definition
  • Motivation
  • Multiplicative Update Algorithm
  • Multiplicative Update Algorithm (2)
  • NMF vs PCA
  • Reference
  • Major Reference
  • Outline (2)
  • Not Included
  • Basic Concepts
  • What Is Machine Learning
  • Likelihood Function
  • Maximum Likelihood (ML)
  • IID ndash Independent Identically Distributed
  • Reference (2)
  • Expectation Maximization
  • Why We Need EM
  • More General
  • The Expectation Maximization (EM) Algorithm
  • Jensenrsquos Inequality
  • Lower Bounding the Log Likelihood
  • The E and M Steps of EM
  • The E Step
  • The M Step
  • EM Never Decreases the Likelihood
  • Reference (3)
  • Why Do We Need Graphical Model
  • Why Do We Need Graphical Models
  • Directed Acyclic Graphical Models (Bayesian Networks)
  • Directed Graphs for Statistical Models Plate Notation
  • pLSA ndash Probabilistic Latent Semantic Analysis
  • Latent Semantic Indexing (LSI) Review
  • pLSA ndash Probabilistic Latent Semantic Analysis (2)
  • pLSA
  • pLSA (2)
  • Joint Probability vs Likelihood
  • Document Decomposition
  • pLSA ndash Objective Function
  • EM Steps
  • Lower Bounding the Log Likelihood (2)
  • EM Steps (2)
  • Latent Subspace
  • pLSA vs LSA
  • pLSA vs LSA (2)
  • Applications
  • Text Mining
  • Scene Classification
  • Classification Result
  • Reference (4)
  • LDA ndash Latent Dirichilet Allocation
  • Problems in pLSA
  • Problems in pLSA (2)
  • Dirichlet Distribution
  • Dirichlet Distribution (2)
  • Example Dirichlet Distributions (K=3)
  • Example Dirichlet Distributions (K=3) (2)
  • The LDA Model
  • The LDA Model (2)
  • Joint Probability
  • Likelihood
  • Inference
  • Inference (2)
  • Variational Inference
  • Variantional Inference
  • VBEM vs EM
  • Parameter Estimation
  • Parameter Estimation (2)
  • Topic Examples in a 100-topic LDA Model)
  • Classification (50-topic LDA + SVM)
  • Problems in LDA
  • A Bayesian Hierarchical Model for Learning Natural Scene Catego
  • Codebook
  • Topic Distribution in Different Categories
  • Topic Hierarchical Clustering
  • More Topic Models
  • Are you really into Graphical Models
  • Reference (5)
  • Outline (3)
  • Slide 115
  • The Long Query Problem
  • Key Idea Dimension Reduction + Residual Error Preservation
  • Orthogonal Decomposition
  • A Probabilistic Implementation
  • Search (Online)
  • Search Example
  • Search Example (2)
  • Simultaneously Modeling Semantics and Structure of Threaded Dis
  • Semantic amp structure
  • Optimize them together
  • Reply reconstruction
  • Baselines
  • Evaluation
  • Expert finding
  • Baselines (2)
  • Evaluation (2)
  • Summary
Page 101: An Introduction To Matrix Decomposition and Graphical Model

Topic Examples in a 100-topic LDA Model)bull 16000 documents from a subset of the TREC AP corpus

Classification (50-topic LDA + SVM)

bull Reuters-21578 dataset ndash contains 8000 documents and 15818 words

(a) EARN vs NOT EARN (b) GRAIN vs NOT GRAIN

Problems in LDA

bull Dirichlet Distribution is helpful to avoid over-fitting But the assumption might be too strong

z4z3z2z1

w4w3w2w1

b

z4z3z2z1

w4w3w2w1

z4z3z2z1

w4w3w2w1

A Bayesian Hierarchical Model for Learning Natural Scene Categories

bull Incorporating category information

MNd

π

z

x

θ

β

Codebookbull 174 Local Image Patches

bull DetectionEvenly Sampled GridRandom SamplingSaliency DetectorLowersquos DoG Detector

bull RepresentationNormalized 11x11 gray values128-dim SIFT

Topic Distribution in Different Categories

Topic Hierarchical Clustering

More Topic Models

bull Dynamic topic models ICML 2006bull Correlated Topic Model NIPS 2005bull Hierarchical Dirichlet Process Journal of the American

Statistical Association 2003bull Nonparametric Bayes pachinko allocation UAI 2007bull Supervised LDA NIPS 2007bull MedLDA ndash Maximum Margin Discrimant LDA ICML 2009bull hellip

Are you really into Graphical Models

bull Describing Visual Scenes using Transformed Dirichlet Processes E Sudderth A Torralba W Freeman and A Willsky NIPS Dec 2005

Reference

bull David M Blei Andrew Y Ng Michael I Jordan Latent Dirichlet Allocation Journal of Machine Learning Research (JMLR) 2003

bull Matthew J Beal Variational Algorithms for Approximate Bayesian Inference PhD Thesis University of Cambridge 1998

bull L Fei-Fei and P Perona A Bayesian Hierarchical Model for Learning Natural Scene Categories CVPR 2005

Outline

bull Matrix Decompositionndash PCA SVD NMFndash LDA ICA Sparse Coding etc

bull Graphical Modelndash Basic concepts in probabilistic machine learningndash EMndash pLSAndash LDA

bull Two Applicationsndash Document decomposition for ldquolong queryrdquo retrieval ICCV 2009ndash Modeling Threaded Discussions SIGIR 2009

Large-Scale Indexing for ldquoLong Queryrdquo Retrieval (Similarity Search)

Xiao Zhang Zhiwei Li Lei Zhang Wei-Ying Ma and Heung-Yeung Shum

ICCV 2009

The Long Query Problem

bull If a query contains 1000 keywordsndash Need to access 1000 inverted listsndash The intersection of 1000 inverted lists may be emptyndash The union of 1000 inverted list may be the whole corpus

bull Dimension reductionTerm1 Term2 Term3 Term4 hellip TermN

Img1 1 2 0 0 hellip 2

f1 f2 hellip fM

Img1 02 01 hellip 003

Topic Projection

Dim = 1 million

Dim = 200

Key Idea Dimension Reduction + Residual Error Preservation

bull p original TF-IDF vector in vocabulary spacebull X projection matrix for dimension reductionbull w low dimensional feature vectorbull residual error

p Xw

An image = 0 01 02 03 04 05 06 07 08 09 1

0

2

4

6

8

10

12

14

16

+ a few words(10 words)

Orthogonal Decomposition

p Xw 1 11 1 1

2 21 2 1 2

1 1

1

k

k

W k W

W W Wk W

p x xp x x w

p wp x x

Base vector

Low dimensional representation

Residual

An image = 0 01 02 03 04 05 06 07 08 09 1

0

2

4

6

8

10

12

14

16

+ a few words(10 words)

T Tp q p q

p q

w w

X1 X2 X3 hellip Xk

A Probabilistic Implementation

x is a switch variable It controls a word generated from

bull a topic specific distribution

bull a document specific distribution

bull a background distribution

( | )p w d 1( 0 | ) ( | ) ( | )K

kp x d p w z k p z k d

( 1| ) ( | )p x d p w d ( 2 | ) ( | )p x d p w

C Chemudugunta etal Modeling General and Specific Aspects of Documents with a Probabilistic Topic Model NIPS 2006

Search (Online)

DS1 DS2 hellip

DS1 DS2 hellip

DS1 DS2 hellip

DS1 DS2 hellip

DS1 DS2 hellip

hellip

hellip

hellip

hellip

hellip

LSH Index

Doc 300 Doc 401 hellip

A query = 0 01 02 03 04 05 06 07 08 09 1

0

2

4

6

8

10

12

14

16

+ a few words

Re-rankingDoc 401 hellip

Doc 1

Doc 2

Doc 300

Doc 401

Doc N

Doc 300

Index 10M Images 46GBSearch Speed lt 100ms

Doc Meta

Search Example

Query Image

Search Example

Query Image

Simultaneously Modeling Semantics and Structure of Threaded Discussions A Sparse

Coding Approach and Its Applications

Chen LIN Jiang-Ming YANG Rui CAI Xin-jing WANG Wei WANG Lei ZHANG

SIGIR 2009

123

Semantic amp structure

124

SemanticTopics

StructureWho reply to who

Optimize them together

Model semantic

Model structure

125

Reply reconstruction

126

DocumentSimilarity

TopicSimilarity

StructureSimilarity

BaselinesNP

Reply to Nearest PostRR

Reply to RootDS

Document SimilarityLDA

Latent Dirichlet Allocation Project documents to topic space

SWB Special Words Topic Model with Background distribution Project documents to topic and junk topic space

127

Evaluation

method Slashdot Apple

All Posts Good Posts All Posts Good Posts

NP 0021 0012 0289 0239

RR 0183 0319 0269 0474

DS 0463 0643 0409 0628

LDA 0465 0644 0410 0648

SWB 0463 0644 0410 0641

SMSS 0524 0737 0517 0772

128

Expert finding

Reply reconstruction

Network construction

Expert finding

Methods

HITS

PageRank

hellip

129

Baselines

LMFormal Models for Expert Finding in Enterprise Corpora SIGIR

06Achieves stable performance in expert finding task using a

language modelPageRank

Benchmark nodal ranking methodHITS

Find hub nodes and authority nodeEABIF

Personalized Recommendation Driven by Information Flow SIGIR rsquo06

Find most influential node130

Evaluation

131

bull Bayesian estimate

Method MRR MAP P10

LM 0821 0698 0800

EABIF(ori) 0674 0362 0243

EABIF(rec) 0742 0318 0281

PageRank(ori) 0675 0377 0263

PageRank(rec) 0743 0321 0266

HITS(ori) 0906 0832 0900

HITS(rec) 0938 0822 0906

Summary

bull Matrix and probability are fundamental mathematics in information retrieval and computer visionndash Matrix decomposition ndash a good practice to learn matrixndash Graphical model ndash a good practice to learn probability

bull Graphical model is a good tool to analyze problems

bull The essence of decomposition is to discover a set of mid-level features to describe original documentsimages

bull It is more adaptable for various applications than matrix decomposition

  • An Introduction To Matrix Decomposition and Graphical Model
  • Outline
  • What Is Matrix Decomposition
  • Why We Need Matrix Decomposition
  • Why We Need Matrix Decomposition (2)
  • Principle Component Analysis
  • Definition ndash Eigenvalue amp Eigenvector
  • Definition ndash Principle Component Analysis
  • Principle Component Analysis (2)
  • Maximizing Variance
  • Optimization Problem
  • Property Data Decomposition
  • Face Recognition ndash Eigenface
  • Slide 14
  • Slide 15
  • PageRank ndash Power Iteration
  • Column-Stochastic amp Irreducible
  • Iterative PageRank Calculation
  • Convergence of the power iteration
  • Singular value decomposition
  • SVD - Definition
  • Singular Values And Singular Vectors
  • Matrix approximation
  • SVD and PCA
  • Example - LSI
  • SVD and PCA (2)
  • Latent Semantic Indexing (LSI)
  • Latent Semantic Indexing
  • Similarity Measures
  • Similarity Measures (2)
  • HITS (Hyperlink Induced Topic Search)
  • Power Iteration
  • HITS and SVD
  • HITS vs PageRank
  • NMF ndash Non-Negative Matrix Factorization
  • Definition
  • Motivation
  • Multiplicative Update Algorithm
  • Multiplicative Update Algorithm (2)
  • NMF vs PCA
  • Reference
  • Major Reference
  • Outline (2)
  • Not Included
  • Basic Concepts
  • What Is Machine Learning
  • Likelihood Function
  • Maximum Likelihood (ML)
  • IID ndash Independent Identically Distributed
  • Reference (2)
  • Expectation Maximization
  • Why We Need EM
  • More General
  • The Expectation Maximization (EM) Algorithm
  • Jensenrsquos Inequality
  • Lower Bounding the Log Likelihood
  • The E and M Steps of EM
  • The E Step
  • The M Step
  • EM Never Decreases the Likelihood
  • Reference (3)
  • Why Do We Need Graphical Model
  • Why Do We Need Graphical Models
  • Directed Acyclic Graphical Models (Bayesian Networks)
  • Directed Graphs for Statistical Models Plate Notation
  • pLSA ndash Probabilistic Latent Semantic Analysis
  • Latent Semantic Indexing (LSI) Review
  • pLSA ndash Probabilistic Latent Semantic Analysis (2)
  • pLSA
  • pLSA (2)
  • Joint Probability vs Likelihood
  • Document Decomposition
  • pLSA ndash Objective Function
  • EM Steps
  • Lower Bounding the Log Likelihood (2)
  • EM Steps (2)
  • Latent Subspace
  • pLSA vs LSA
  • pLSA vs LSA (2)
  • Applications
  • Text Mining
  • Scene Classification
  • Classification Result
  • Reference (4)
  • LDA ndash Latent Dirichilet Allocation
  • Problems in pLSA
  • Problems in pLSA (2)
  • Dirichlet Distribution
  • Dirichlet Distribution (2)
  • Example Dirichlet Distributions (K=3)
  • Example Dirichlet Distributions (K=3) (2)
  • The LDA Model
  • The LDA Model (2)
  • Joint Probability
  • Likelihood
  • Inference
  • Inference (2)
  • Variational Inference
  • Variantional Inference
  • VBEM vs EM
  • Parameter Estimation
  • Parameter Estimation (2)
  • Topic Examples in a 100-topic LDA Model)
  • Classification (50-topic LDA + SVM)
  • Problems in LDA
  • A Bayesian Hierarchical Model for Learning Natural Scene Catego
  • Codebook
  • Topic Distribution in Different Categories
  • Topic Hierarchical Clustering
  • More Topic Models
  • Are you really into Graphical Models
  • Reference (5)
  • Outline (3)
  • Slide 115
  • The Long Query Problem
  • Key Idea Dimension Reduction + Residual Error Preservation
  • Orthogonal Decomposition
  • A Probabilistic Implementation
  • Search (Online)
  • Search Example
  • Search Example (2)
  • Simultaneously Modeling Semantics and Structure of Threaded Dis
  • Semantic amp structure
  • Optimize them together
  • Reply reconstruction
  • Baselines
  • Evaluation
  • Expert finding
  • Baselines (2)
  • Evaluation (2)
  • Summary
Page 102: An Introduction To Matrix Decomposition and Graphical Model

Classification (50-topic LDA + SVM)

bull Reuters-21578 dataset ndash contains 8000 documents and 15818 words

(a) EARN vs NOT EARN (b) GRAIN vs NOT GRAIN

Problems in LDA

bull Dirichlet Distribution is helpful to avoid over-fitting But the assumption might be too strong

z4z3z2z1

w4w3w2w1

b

z4z3z2z1

w4w3w2w1

z4z3z2z1

w4w3w2w1

A Bayesian Hierarchical Model for Learning Natural Scene Categories

bull Incorporating category information

MNd

π

z

x

θ

β

Codebookbull 174 Local Image Patches

bull DetectionEvenly Sampled GridRandom SamplingSaliency DetectorLowersquos DoG Detector

bull RepresentationNormalized 11x11 gray values128-dim SIFT

Topic Distribution in Different Categories

Topic Hierarchical Clustering

More Topic Models

bull Dynamic topic models ICML 2006bull Correlated Topic Model NIPS 2005bull Hierarchical Dirichlet Process Journal of the American

Statistical Association 2003bull Nonparametric Bayes pachinko allocation UAI 2007bull Supervised LDA NIPS 2007bull MedLDA ndash Maximum Margin Discrimant LDA ICML 2009bull hellip

Are you really into Graphical Models

bull Describing Visual Scenes using Transformed Dirichlet Processes E Sudderth A Torralba W Freeman and A Willsky NIPS Dec 2005

Reference

bull David M Blei Andrew Y Ng Michael I Jordan Latent Dirichlet Allocation Journal of Machine Learning Research (JMLR) 2003

bull Matthew J Beal Variational Algorithms for Approximate Bayesian Inference PhD Thesis University of Cambridge 1998

bull L Fei-Fei and P Perona A Bayesian Hierarchical Model for Learning Natural Scene Categories CVPR 2005

Outline

bull Matrix Decompositionndash PCA SVD NMFndash LDA ICA Sparse Coding etc

bull Graphical Modelndash Basic concepts in probabilistic machine learningndash EMndash pLSAndash LDA

bull Two Applicationsndash Document decomposition for ldquolong queryrdquo retrieval ICCV 2009ndash Modeling Threaded Discussions SIGIR 2009

Large-Scale Indexing for ldquoLong Queryrdquo Retrieval (Similarity Search)

Xiao Zhang Zhiwei Li Lei Zhang Wei-Ying Ma and Heung-Yeung Shum

ICCV 2009

The Long Query Problem

bull If a query contains 1000 keywordsndash Need to access 1000 inverted listsndash The intersection of 1000 inverted lists may be emptyndash The union of 1000 inverted list may be the whole corpus

bull Dimension reductionTerm1 Term2 Term3 Term4 hellip TermN

Img1 1 2 0 0 hellip 2

f1 f2 hellip fM

Img1 02 01 hellip 003

Topic Projection

Dim = 1 million

Dim = 200

Key Idea Dimension Reduction + Residual Error Preservation

bull p original TF-IDF vector in vocabulary spacebull X projection matrix for dimension reductionbull w low dimensional feature vectorbull residual error

p Xw

An image = 0 01 02 03 04 05 06 07 08 09 1

0

2

4

6

8

10

12

14

16

+ a few words(10 words)

Orthogonal Decomposition

p Xw 1 11 1 1

2 21 2 1 2

1 1

1

k

k

W k W

W W Wk W

p x xp x x w

p wp x x

Base vector

Low dimensional representation

Residual

An image = 0 01 02 03 04 05 06 07 08 09 1

0

2

4

6

8

10

12

14

16

+ a few words(10 words)

T Tp q p q

p q

w w

X1 X2 X3 hellip Xk

A Probabilistic Implementation

x is a switch variable It controls a word generated from

bull a topic specific distribution

bull a document specific distribution

bull a background distribution

( | )p w d 1( 0 | ) ( | ) ( | )K

kp x d p w z k p z k d

( 1| ) ( | )p x d p w d ( 2 | ) ( | )p x d p w

C Chemudugunta etal Modeling General and Specific Aspects of Documents with a Probabilistic Topic Model NIPS 2006

Search (Online)

DS1 DS2 hellip

DS1 DS2 hellip

DS1 DS2 hellip

DS1 DS2 hellip

DS1 DS2 hellip

hellip

hellip

hellip

hellip

hellip

LSH Index

Doc 300 Doc 401 hellip

A query = 0 01 02 03 04 05 06 07 08 09 1

0

2

4

6

8

10

12

14

16

+ a few words

Re-rankingDoc 401 hellip

Doc 1

Doc 2

Doc 300

Doc 401

Doc N

Doc 300

Index 10M Images 46GBSearch Speed lt 100ms

Doc Meta

Search Example

Query Image

Search Example

Query Image

Simultaneously Modeling Semantics and Structure of Threaded Discussions A Sparse

Coding Approach and Its Applications

Chen LIN Jiang-Ming YANG Rui CAI Xin-jing WANG Wei WANG Lei ZHANG

SIGIR 2009

123

Semantic amp structure

124

SemanticTopics

StructureWho reply to who

Optimize them together

Model semantic

Model structure

125

Reply reconstruction

126

DocumentSimilarity

TopicSimilarity

StructureSimilarity

BaselinesNP

Reply to Nearest PostRR

Reply to RootDS

Document SimilarityLDA

Latent Dirichlet Allocation Project documents to topic space

SWB Special Words Topic Model with Background distribution Project documents to topic and junk topic space

127

Evaluation

method Slashdot Apple

All Posts Good Posts All Posts Good Posts

NP 0021 0012 0289 0239

RR 0183 0319 0269 0474

DS 0463 0643 0409 0628

LDA 0465 0644 0410 0648

SWB 0463 0644 0410 0641

SMSS 0524 0737 0517 0772

128

Expert finding

Reply reconstruction

Network construction

Expert finding

Methods

HITS

PageRank

hellip

129

Baselines

LMFormal Models for Expert Finding in Enterprise Corpora SIGIR

06Achieves stable performance in expert finding task using a

language modelPageRank

Benchmark nodal ranking methodHITS

Find hub nodes and authority nodeEABIF

Personalized Recommendation Driven by Information Flow SIGIR rsquo06

Find most influential node130

Evaluation

131

bull Bayesian estimate

Method MRR MAP P10

LM 0821 0698 0800

EABIF(ori) 0674 0362 0243

EABIF(rec) 0742 0318 0281

PageRank(ori) 0675 0377 0263

PageRank(rec) 0743 0321 0266

HITS(ori) 0906 0832 0900

HITS(rec) 0938 0822 0906

Summary

bull Matrix and probability are fundamental mathematics in information retrieval and computer visionndash Matrix decomposition ndash a good practice to learn matrixndash Graphical model ndash a good practice to learn probability

bull Graphical model is a good tool to analyze problems

bull The essence of decomposition is to discover a set of mid-level features to describe original documentsimages

bull It is more adaptable for various applications than matrix decomposition

  • An Introduction To Matrix Decomposition and Graphical Model
  • Outline
  • What Is Matrix Decomposition
  • Why We Need Matrix Decomposition
  • Why We Need Matrix Decomposition (2)
  • Principle Component Analysis
  • Definition ndash Eigenvalue amp Eigenvector
  • Definition ndash Principle Component Analysis
  • Principle Component Analysis (2)
  • Maximizing Variance
  • Optimization Problem
  • Property Data Decomposition
  • Face Recognition ndash Eigenface
  • Slide 14
  • Slide 15
  • PageRank ndash Power Iteration
  • Column-Stochastic amp Irreducible
  • Iterative PageRank Calculation
  • Convergence of the power iteration
  • Singular value decomposition
  • SVD - Definition
  • Singular Values And Singular Vectors
  • Matrix approximation
  • SVD and PCA
  • Example - LSI
  • SVD and PCA (2)
  • Latent Semantic Indexing (LSI)
  • Latent Semantic Indexing
  • Similarity Measures
  • Similarity Measures (2)
  • HITS (Hyperlink Induced Topic Search)
  • Power Iteration
  • HITS and SVD
  • HITS vs PageRank
  • NMF ndash Non-Negative Matrix Factorization
  • Definition
  • Motivation
  • Multiplicative Update Algorithm
  • Multiplicative Update Algorithm (2)
  • NMF vs PCA
  • Reference
  • Major Reference
  • Outline (2)
  • Not Included
  • Basic Concepts
  • What Is Machine Learning
  • Likelihood Function
  • Maximum Likelihood (ML)
  • IID ndash Independent Identically Distributed
  • Reference (2)
  • Expectation Maximization
  • Why We Need EM
  • More General
  • The Expectation Maximization (EM) Algorithm
  • Jensenrsquos Inequality
  • Lower Bounding the Log Likelihood
  • The E and M Steps of EM
  • The E Step
  • The M Step
  • EM Never Decreases the Likelihood
  • Reference (3)
  • Why Do We Need Graphical Model
  • Why Do We Need Graphical Models
  • Directed Acyclic Graphical Models (Bayesian Networks)
  • Directed Graphs for Statistical Models Plate Notation
  • pLSA ndash Probabilistic Latent Semantic Analysis
  • Latent Semantic Indexing (LSI) Review
  • pLSA ndash Probabilistic Latent Semantic Analysis (2)
  • pLSA
  • pLSA (2)
  • Joint Probability vs Likelihood
  • Document Decomposition
  • pLSA ndash Objective Function
  • EM Steps
  • Lower Bounding the Log Likelihood (2)
  • EM Steps (2)
  • Latent Subspace
  • pLSA vs LSA
  • pLSA vs LSA (2)
  • Applications
  • Text Mining
  • Scene Classification
  • Classification Result
  • Reference (4)
  • LDA ndash Latent Dirichilet Allocation
  • Problems in pLSA
  • Problems in pLSA (2)
  • Dirichlet Distribution
  • Dirichlet Distribution (2)
  • Example Dirichlet Distributions (K=3)
  • Example Dirichlet Distributions (K=3) (2)
  • The LDA Model
  • The LDA Model (2)
  • Joint Probability
  • Likelihood
  • Inference
  • Inference (2)
  • Variational Inference
  • Variantional Inference
  • VBEM vs EM
  • Parameter Estimation
  • Parameter Estimation (2)
  • Topic Examples in a 100-topic LDA Model)
  • Classification (50-topic LDA + SVM)
  • Problems in LDA
  • A Bayesian Hierarchical Model for Learning Natural Scene Catego
  • Codebook
  • Topic Distribution in Different Categories
  • Topic Hierarchical Clustering
  • More Topic Models
  • Are you really into Graphical Models
  • Reference (5)
  • Outline (3)
  • Slide 115
  • The Long Query Problem
  • Key Idea Dimension Reduction + Residual Error Preservation
  • Orthogonal Decomposition
  • A Probabilistic Implementation
  • Search (Online)
  • Search Example
  • Search Example (2)
  • Simultaneously Modeling Semantics and Structure of Threaded Dis
  • Semantic amp structure
  • Optimize them together
  • Reply reconstruction
  • Baselines
  • Evaluation
  • Expert finding
  • Baselines (2)
  • Evaluation (2)
  • Summary
Page 103: An Introduction To Matrix Decomposition and Graphical Model

Problems in LDA

bull Dirichlet Distribution is helpful to avoid over-fitting But the assumption might be too strong

z4z3z2z1

w4w3w2w1

b

z4z3z2z1

w4w3w2w1

z4z3z2z1

w4w3w2w1

A Bayesian Hierarchical Model for Learning Natural Scene Categories

bull Incorporating category information

MNd

π

z

x

θ

β

Codebookbull 174 Local Image Patches

bull DetectionEvenly Sampled GridRandom SamplingSaliency DetectorLowersquos DoG Detector

bull RepresentationNormalized 11x11 gray values128-dim SIFT

Topic Distribution in Different Categories

Topic Hierarchical Clustering

More Topic Models

bull Dynamic topic models ICML 2006bull Correlated Topic Model NIPS 2005bull Hierarchical Dirichlet Process Journal of the American

Statistical Association 2003bull Nonparametric Bayes pachinko allocation UAI 2007bull Supervised LDA NIPS 2007bull MedLDA ndash Maximum Margin Discrimant LDA ICML 2009bull hellip

Are you really into Graphical Models

bull Describing Visual Scenes using Transformed Dirichlet Processes E Sudderth A Torralba W Freeman and A Willsky NIPS Dec 2005

Reference

bull David M Blei Andrew Y Ng Michael I Jordan Latent Dirichlet Allocation Journal of Machine Learning Research (JMLR) 2003

bull Matthew J Beal Variational Algorithms for Approximate Bayesian Inference PhD Thesis University of Cambridge 1998

bull L Fei-Fei and P Perona A Bayesian Hierarchical Model for Learning Natural Scene Categories CVPR 2005

Outline

bull Matrix Decompositionndash PCA SVD NMFndash LDA ICA Sparse Coding etc

bull Graphical Modelndash Basic concepts in probabilistic machine learningndash EMndash pLSAndash LDA

bull Two Applicationsndash Document decomposition for ldquolong queryrdquo retrieval ICCV 2009ndash Modeling Threaded Discussions SIGIR 2009

Large-Scale Indexing for ldquoLong Queryrdquo Retrieval (Similarity Search)

Xiao Zhang Zhiwei Li Lei Zhang Wei-Ying Ma and Heung-Yeung Shum

ICCV 2009

The Long Query Problem

bull If a query contains 1000 keywordsndash Need to access 1000 inverted listsndash The intersection of 1000 inverted lists may be emptyndash The union of 1000 inverted list may be the whole corpus

bull Dimension reductionTerm1 Term2 Term3 Term4 hellip TermN

Img1 1 2 0 0 hellip 2

f1 f2 hellip fM

Img1 02 01 hellip 003

Topic Projection

Dim = 1 million

Dim = 200

Key Idea Dimension Reduction + Residual Error Preservation

bull p original TF-IDF vector in vocabulary spacebull X projection matrix for dimension reductionbull w low dimensional feature vectorbull residual error

p Xw

An image = 0 01 02 03 04 05 06 07 08 09 1

0

2

4

6

8

10

12

14

16

+ a few words(10 words)

Orthogonal Decomposition

p Xw 1 11 1 1

2 21 2 1 2

1 1

1

k

k

W k W

W W Wk W

p x xp x x w

p wp x x

Base vector

Low dimensional representation

Residual

An image = 0 01 02 03 04 05 06 07 08 09 1

0

2

4

6

8

10

12

14

16

+ a few words(10 words)

T Tp q p q

p q

w w

X1 X2 X3 hellip Xk

A Probabilistic Implementation

x is a switch variable It controls a word generated from

bull a topic specific distribution

bull a document specific distribution

bull a background distribution

( | )p w d 1( 0 | ) ( | ) ( | )K

kp x d p w z k p z k d

( 1| ) ( | )p x d p w d ( 2 | ) ( | )p x d p w

C Chemudugunta etal Modeling General and Specific Aspects of Documents with a Probabilistic Topic Model NIPS 2006

Search (Online)

DS1 DS2 hellip

DS1 DS2 hellip

DS1 DS2 hellip

DS1 DS2 hellip

DS1 DS2 hellip

hellip

hellip

hellip

hellip

hellip

LSH Index

Doc 300 Doc 401 hellip

A query = 0 01 02 03 04 05 06 07 08 09 1

0

2

4

6

8

10

12

14

16

+ a few words

Re-rankingDoc 401 hellip

Doc 1

Doc 2

Doc 300

Doc 401

Doc N

Doc 300

Index 10M Images 46GBSearch Speed lt 100ms

Doc Meta

Search Example

Query Image

Search Example

Query Image

Simultaneously Modeling Semantics and Structure of Threaded Discussions A Sparse

Coding Approach and Its Applications

Chen LIN Jiang-Ming YANG Rui CAI Xin-jing WANG Wei WANG Lei ZHANG

SIGIR 2009

123

Semantic amp structure

124

SemanticTopics

StructureWho reply to who

Optimize them together

Model semantic

Model structure

125

Reply reconstruction

126

DocumentSimilarity

TopicSimilarity

StructureSimilarity

BaselinesNP

Reply to Nearest PostRR

Reply to RootDS

Document SimilarityLDA

Latent Dirichlet Allocation Project documents to topic space

SWB Special Words Topic Model with Background distribution Project documents to topic and junk topic space

127

Evaluation

method Slashdot Apple

All Posts Good Posts All Posts Good Posts

NP 0021 0012 0289 0239

RR 0183 0319 0269 0474

DS 0463 0643 0409 0628

LDA 0465 0644 0410 0648

SWB 0463 0644 0410 0641

SMSS 0524 0737 0517 0772

128

Expert finding

Reply reconstruction

Network construction

Expert finding

Methods

HITS

PageRank

hellip

129

Baselines

LMFormal Models for Expert Finding in Enterprise Corpora SIGIR

06Achieves stable performance in expert finding task using a

language modelPageRank

Benchmark nodal ranking methodHITS

Find hub nodes and authority nodeEABIF

Personalized Recommendation Driven by Information Flow SIGIR rsquo06

Find most influential node130

Evaluation

131

bull Bayesian estimate

Method MRR MAP P10

LM 0821 0698 0800

EABIF(ori) 0674 0362 0243

EABIF(rec) 0742 0318 0281

PageRank(ori) 0675 0377 0263

PageRank(rec) 0743 0321 0266

HITS(ori) 0906 0832 0900

HITS(rec) 0938 0822 0906

Summary

bull Matrix and probability are fundamental mathematics in information retrieval and computer visionndash Matrix decomposition ndash a good practice to learn matrixndash Graphical model ndash a good practice to learn probability

bull Graphical model is a good tool to analyze problems

bull The essence of decomposition is to discover a set of mid-level features to describe original documentsimages

bull It is more adaptable for various applications than matrix decomposition

  • An Introduction To Matrix Decomposition and Graphical Model
  • Outline
  • What Is Matrix Decomposition
  • Why We Need Matrix Decomposition
  • Why We Need Matrix Decomposition (2)
  • Principle Component Analysis
  • Definition ndash Eigenvalue amp Eigenvector
  • Definition ndash Principle Component Analysis
  • Principle Component Analysis (2)
  • Maximizing Variance
  • Optimization Problem
  • Property Data Decomposition
  • Face Recognition ndash Eigenface
  • Slide 14
  • Slide 15
  • PageRank ndash Power Iteration
  • Column-Stochastic amp Irreducible
  • Iterative PageRank Calculation
  • Convergence of the power iteration
  • Singular value decomposition
  • SVD - Definition
  • Singular Values And Singular Vectors
  • Matrix approximation
  • SVD and PCA
  • Example - LSI
  • SVD and PCA (2)
  • Latent Semantic Indexing (LSI)
  • Latent Semantic Indexing
  • Similarity Measures
  • Similarity Measures (2)
  • HITS (Hyperlink Induced Topic Search)
  • Power Iteration
  • HITS and SVD
  • HITS vs PageRank
  • NMF ndash Non-Negative Matrix Factorization
  • Definition
  • Motivation
  • Multiplicative Update Algorithm
  • Multiplicative Update Algorithm (2)
  • NMF vs PCA
  • Reference
  • Major Reference
  • Outline (2)
  • Not Included
  • Basic Concepts
  • What Is Machine Learning
  • Likelihood Function
  • Maximum Likelihood (ML)
  • IID ndash Independent Identically Distributed
  • Reference (2)
  • Expectation Maximization
  • Why We Need EM
  • More General
  • The Expectation Maximization (EM) Algorithm
  • Jensenrsquos Inequality
  • Lower Bounding the Log Likelihood
  • The E and M Steps of EM
  • The E Step
  • The M Step
  • EM Never Decreases the Likelihood
  • Reference (3)
  • Why Do We Need Graphical Model
  • Why Do We Need Graphical Models
  • Directed Acyclic Graphical Models (Bayesian Networks)
  • Directed Graphs for Statistical Models Plate Notation
  • pLSA ndash Probabilistic Latent Semantic Analysis
  • Latent Semantic Indexing (LSI) Review
  • pLSA ndash Probabilistic Latent Semantic Analysis (2)
  • pLSA
  • pLSA (2)
  • Joint Probability vs Likelihood
  • Document Decomposition
  • pLSA ndash Objective Function
  • EM Steps
  • Lower Bounding the Log Likelihood (2)
  • EM Steps (2)
  • Latent Subspace
  • pLSA vs LSA
  • pLSA vs LSA (2)
  • Applications
  • Text Mining
  • Scene Classification
  • Classification Result
  • Reference (4)
  • LDA ndash Latent Dirichilet Allocation
  • Problems in pLSA
  • Problems in pLSA (2)
  • Dirichlet Distribution
  • Dirichlet Distribution (2)
  • Example Dirichlet Distributions (K=3)
  • Example Dirichlet Distributions (K=3) (2)
  • The LDA Model
  • The LDA Model (2)
  • Joint Probability
  • Likelihood
  • Inference
  • Inference (2)
  • Variational Inference
  • Variantional Inference
  • VBEM vs EM
  • Parameter Estimation
  • Parameter Estimation (2)
  • Topic Examples in a 100-topic LDA Model)
  • Classification (50-topic LDA + SVM)
  • Problems in LDA
  • A Bayesian Hierarchical Model for Learning Natural Scene Catego
  • Codebook
  • Topic Distribution in Different Categories
  • Topic Hierarchical Clustering
  • More Topic Models
  • Are you really into Graphical Models
  • Reference (5)
  • Outline (3)
  • Slide 115
  • The Long Query Problem
  • Key Idea Dimension Reduction + Residual Error Preservation
  • Orthogonal Decomposition
  • A Probabilistic Implementation
  • Search (Online)
  • Search Example
  • Search Example (2)
  • Simultaneously Modeling Semantics and Structure of Threaded Dis
  • Semantic amp structure
  • Optimize them together
  • Reply reconstruction
  • Baselines
  • Evaluation
  • Expert finding
  • Baselines (2)
  • Evaluation (2)
  • Summary
Page 104: An Introduction To Matrix Decomposition and Graphical Model

A Bayesian Hierarchical Model for Learning Natural Scene Categories

bull Incorporating category information

MNd

π

z

x

θ

β

Codebookbull 174 Local Image Patches

bull DetectionEvenly Sampled GridRandom SamplingSaliency DetectorLowersquos DoG Detector

bull RepresentationNormalized 11x11 gray values128-dim SIFT

Topic Distribution in Different Categories

Topic Hierarchical Clustering

More Topic Models

bull Dynamic topic models ICML 2006bull Correlated Topic Model NIPS 2005bull Hierarchical Dirichlet Process Journal of the American

Statistical Association 2003bull Nonparametric Bayes pachinko allocation UAI 2007bull Supervised LDA NIPS 2007bull MedLDA ndash Maximum Margin Discrimant LDA ICML 2009bull hellip

Are you really into Graphical Models

bull Describing Visual Scenes using Transformed Dirichlet Processes E Sudderth A Torralba W Freeman and A Willsky NIPS Dec 2005

Reference

bull David M Blei Andrew Y Ng Michael I Jordan Latent Dirichlet Allocation Journal of Machine Learning Research (JMLR) 2003

bull Matthew J Beal Variational Algorithms for Approximate Bayesian Inference PhD Thesis University of Cambridge 1998

bull L Fei-Fei and P Perona A Bayesian Hierarchical Model for Learning Natural Scene Categories CVPR 2005

Outline

bull Matrix Decompositionndash PCA SVD NMFndash LDA ICA Sparse Coding etc

bull Graphical Modelndash Basic concepts in probabilistic machine learningndash EMndash pLSAndash LDA

bull Two Applicationsndash Document decomposition for ldquolong queryrdquo retrieval ICCV 2009ndash Modeling Threaded Discussions SIGIR 2009

Large-Scale Indexing for ldquoLong Queryrdquo Retrieval (Similarity Search)

Xiao Zhang Zhiwei Li Lei Zhang Wei-Ying Ma and Heung-Yeung Shum

ICCV 2009

The Long Query Problem

bull If a query contains 1000 keywordsndash Need to access 1000 inverted listsndash The intersection of 1000 inverted lists may be emptyndash The union of 1000 inverted list may be the whole corpus

bull Dimension reductionTerm1 Term2 Term3 Term4 hellip TermN

Img1 1 2 0 0 hellip 2

f1 f2 hellip fM

Img1 02 01 hellip 003

Topic Projection

Dim = 1 million

Dim = 200

Key Idea Dimension Reduction + Residual Error Preservation

bull p original TF-IDF vector in vocabulary spacebull X projection matrix for dimension reductionbull w low dimensional feature vectorbull residual error

p Xw

An image = 0 01 02 03 04 05 06 07 08 09 1

0

2

4

6

8

10

12

14

16

+ a few words(10 words)

Orthogonal Decomposition

p Xw 1 11 1 1

2 21 2 1 2

1 1

1

k

k

W k W

W W Wk W

p x xp x x w

p wp x x

Base vector

Low dimensional representation

Residual

An image = 0 01 02 03 04 05 06 07 08 09 1

0

2

4

6

8

10

12

14

16

+ a few words(10 words)

T Tp q p q

p q

w w

X1 X2 X3 hellip Xk

A Probabilistic Implementation

x is a switch variable It controls a word generated from

bull a topic specific distribution

bull a document specific distribution

bull a background distribution

( | )p w d 1( 0 | ) ( | ) ( | )K

kp x d p w z k p z k d

( 1| ) ( | )p x d p w d ( 2 | ) ( | )p x d p w

C Chemudugunta etal Modeling General and Specific Aspects of Documents with a Probabilistic Topic Model NIPS 2006

Search (Online)

DS1 DS2 hellip

DS1 DS2 hellip

DS1 DS2 hellip

DS1 DS2 hellip

DS1 DS2 hellip

hellip

hellip

hellip

hellip

hellip

LSH Index

Doc 300 Doc 401 hellip

A query = 0 01 02 03 04 05 06 07 08 09 1

0

2

4

6

8

10

12

14

16

+ a few words

Re-rankingDoc 401 hellip

Doc 1

Doc 2

Doc 300

Doc 401

Doc N

Doc 300

Index 10M Images 46GBSearch Speed lt 100ms

Doc Meta

Search Example

Query Image

Search Example

Query Image

Simultaneously Modeling Semantics and Structure of Threaded Discussions A Sparse

Coding Approach and Its Applications

Chen LIN Jiang-Ming YANG Rui CAI Xin-jing WANG Wei WANG Lei ZHANG

SIGIR 2009

123

Semantic amp structure

124

SemanticTopics

StructureWho reply to who

Optimize them together

Model semantic

Model structure

125

Reply reconstruction

126

DocumentSimilarity

TopicSimilarity

StructureSimilarity

BaselinesNP

Reply to Nearest PostRR

Reply to RootDS

Document SimilarityLDA

Latent Dirichlet Allocation Project documents to topic space

SWB Special Words Topic Model with Background distribution Project documents to topic and junk topic space

127

Evaluation

method Slashdot Apple

All Posts Good Posts All Posts Good Posts

NP 0021 0012 0289 0239

RR 0183 0319 0269 0474

DS 0463 0643 0409 0628

LDA 0465 0644 0410 0648

SWB 0463 0644 0410 0641

SMSS 0524 0737 0517 0772

128

Expert finding

Reply reconstruction

Network construction

Expert finding

Methods

HITS

PageRank

hellip

129

Baselines

LMFormal Models for Expert Finding in Enterprise Corpora SIGIR

06Achieves stable performance in expert finding task using a

language modelPageRank

Benchmark nodal ranking methodHITS

Find hub nodes and authority nodeEABIF

Personalized Recommendation Driven by Information Flow SIGIR rsquo06

Find most influential node130

Evaluation

131

bull Bayesian estimate

Method MRR MAP P10

LM 0821 0698 0800

EABIF(ori) 0674 0362 0243

EABIF(rec) 0742 0318 0281

PageRank(ori) 0675 0377 0263

PageRank(rec) 0743 0321 0266

HITS(ori) 0906 0832 0900

HITS(rec) 0938 0822 0906

Summary

bull Matrix and probability are fundamental mathematics in information retrieval and computer visionndash Matrix decomposition ndash a good practice to learn matrixndash Graphical model ndash a good practice to learn probability

bull Graphical model is a good tool to analyze problems

bull The essence of decomposition is to discover a set of mid-level features to describe original documentsimages

bull It is more adaptable for various applications than matrix decomposition

  • An Introduction To Matrix Decomposition and Graphical Model
  • Outline
  • What Is Matrix Decomposition
  • Why We Need Matrix Decomposition
  • Why We Need Matrix Decomposition (2)
  • Principle Component Analysis
  • Definition ndash Eigenvalue amp Eigenvector
  • Definition ndash Principle Component Analysis
  • Principle Component Analysis (2)
  • Maximizing Variance
  • Optimization Problem
  • Property Data Decomposition
  • Face Recognition ndash Eigenface
  • Slide 14
  • Slide 15
  • PageRank ndash Power Iteration
  • Column-Stochastic amp Irreducible
  • Iterative PageRank Calculation
  • Convergence of the power iteration
  • Singular value decomposition
  • SVD - Definition
  • Singular Values And Singular Vectors
  • Matrix approximation
  • SVD and PCA
  • Example - LSI
  • SVD and PCA (2)
  • Latent Semantic Indexing (LSI)
  • Latent Semantic Indexing
  • Similarity Measures
  • Similarity Measures (2)
  • HITS (Hyperlink Induced Topic Search)
  • Power Iteration
  • HITS and SVD
  • HITS vs PageRank
  • NMF ndash Non-Negative Matrix Factorization
  • Definition
  • Motivation
  • Multiplicative Update Algorithm
  • Multiplicative Update Algorithm (2)
  • NMF vs PCA
  • Reference
  • Major Reference
  • Outline (2)
  • Not Included
  • Basic Concepts
  • What Is Machine Learning
  • Likelihood Function
  • Maximum Likelihood (ML)
  • IID ndash Independent Identically Distributed
  • Reference (2)
  • Expectation Maximization
  • Why We Need EM
  • More General
  • The Expectation Maximization (EM) Algorithm
  • Jensenrsquos Inequality
  • Lower Bounding the Log Likelihood
  • The E and M Steps of EM
  • The E Step
  • The M Step
  • EM Never Decreases the Likelihood
  • Reference (3)
  • Why Do We Need Graphical Model
  • Why Do We Need Graphical Models
  • Directed Acyclic Graphical Models (Bayesian Networks)
  • Directed Graphs for Statistical Models Plate Notation
  • pLSA ndash Probabilistic Latent Semantic Analysis
  • Latent Semantic Indexing (LSI) Review
  • pLSA ndash Probabilistic Latent Semantic Analysis (2)
  • pLSA
  • pLSA (2)
  • Joint Probability vs Likelihood
  • Document Decomposition
  • pLSA ndash Objective Function
  • EM Steps
  • Lower Bounding the Log Likelihood (2)
  • EM Steps (2)
  • Latent Subspace
  • pLSA vs LSA
  • pLSA vs LSA (2)
  • Applications
  • Text Mining
  • Scene Classification
  • Classification Result
  • Reference (4)
  • LDA ndash Latent Dirichilet Allocation
  • Problems in pLSA
  • Problems in pLSA (2)
  • Dirichlet Distribution
  • Dirichlet Distribution (2)
  • Example Dirichlet Distributions (K=3)
  • Example Dirichlet Distributions (K=3) (2)
  • The LDA Model
  • The LDA Model (2)
  • Joint Probability
  • Likelihood
  • Inference
  • Inference (2)
  • Variational Inference
  • Variantional Inference
  • VBEM vs EM
  • Parameter Estimation
  • Parameter Estimation (2)
  • Topic Examples in a 100-topic LDA Model)
  • Classification (50-topic LDA + SVM)
  • Problems in LDA
  • A Bayesian Hierarchical Model for Learning Natural Scene Catego
  • Codebook
  • Topic Distribution in Different Categories
  • Topic Hierarchical Clustering
  • More Topic Models
  • Are you really into Graphical Models
  • Reference (5)
  • Outline (3)
  • Slide 115
  • The Long Query Problem
  • Key Idea Dimension Reduction + Residual Error Preservation
  • Orthogonal Decomposition
  • A Probabilistic Implementation
  • Search (Online)
  • Search Example
  • Search Example (2)
  • Simultaneously Modeling Semantics and Structure of Threaded Dis
  • Semantic amp structure
  • Optimize them together
  • Reply reconstruction
  • Baselines
  • Evaluation
  • Expert finding
  • Baselines (2)
  • Evaluation (2)
  • Summary
Page 105: An Introduction To Matrix Decomposition and Graphical Model

Codebookbull 174 Local Image Patches

bull DetectionEvenly Sampled GridRandom SamplingSaliency DetectorLowersquos DoG Detector

bull RepresentationNormalized 11x11 gray values128-dim SIFT

Topic Distribution in Different Categories

Topic Hierarchical Clustering

More Topic Models

bull Dynamic topic models ICML 2006bull Correlated Topic Model NIPS 2005bull Hierarchical Dirichlet Process Journal of the American

Statistical Association 2003bull Nonparametric Bayes pachinko allocation UAI 2007bull Supervised LDA NIPS 2007bull MedLDA ndash Maximum Margin Discrimant LDA ICML 2009bull hellip

Are you really into Graphical Models

bull Describing Visual Scenes using Transformed Dirichlet Processes E Sudderth A Torralba W Freeman and A Willsky NIPS Dec 2005

Reference

bull David M Blei Andrew Y Ng Michael I Jordan Latent Dirichlet Allocation Journal of Machine Learning Research (JMLR) 2003

bull Matthew J Beal Variational Algorithms for Approximate Bayesian Inference PhD Thesis University of Cambridge 1998

bull L Fei-Fei and P Perona A Bayesian Hierarchical Model for Learning Natural Scene Categories CVPR 2005

Outline

bull Matrix Decompositionndash PCA SVD NMFndash LDA ICA Sparse Coding etc

bull Graphical Modelndash Basic concepts in probabilistic machine learningndash EMndash pLSAndash LDA

bull Two Applicationsndash Document decomposition for ldquolong queryrdquo retrieval ICCV 2009ndash Modeling Threaded Discussions SIGIR 2009

Large-Scale Indexing for ldquoLong Queryrdquo Retrieval (Similarity Search)

Xiao Zhang Zhiwei Li Lei Zhang Wei-Ying Ma and Heung-Yeung Shum

ICCV 2009

The Long Query Problem

bull If a query contains 1000 keywordsndash Need to access 1000 inverted listsndash The intersection of 1000 inverted lists may be emptyndash The union of 1000 inverted list may be the whole corpus

bull Dimension reductionTerm1 Term2 Term3 Term4 hellip TermN

Img1 1 2 0 0 hellip 2

f1 f2 hellip fM

Img1 02 01 hellip 003

Topic Projection

Dim = 1 million

Dim = 200

Key Idea Dimension Reduction + Residual Error Preservation

bull p original TF-IDF vector in vocabulary spacebull X projection matrix for dimension reductionbull w low dimensional feature vectorbull residual error

p Xw

An image = 0 01 02 03 04 05 06 07 08 09 1

0

2

4

6

8

10

12

14

16

+ a few words(10 words)

Orthogonal Decomposition

p Xw 1 11 1 1

2 21 2 1 2

1 1

1

k

k

W k W

W W Wk W

p x xp x x w

p wp x x

Base vector

Low dimensional representation

Residual

An image = 0 01 02 03 04 05 06 07 08 09 1

0

2

4

6

8

10

12

14

16

+ a few words(10 words)

T Tp q p q

p q

w w

X1 X2 X3 hellip Xk

A Probabilistic Implementation

x is a switch variable It controls a word generated from

bull a topic specific distribution

bull a document specific distribution

bull a background distribution

( | )p w d 1( 0 | ) ( | ) ( | )K

kp x d p w z k p z k d

( 1| ) ( | )p x d p w d ( 2 | ) ( | )p x d p w

C Chemudugunta etal Modeling General and Specific Aspects of Documents with a Probabilistic Topic Model NIPS 2006

Search (Online)

DS1 DS2 hellip

DS1 DS2 hellip

DS1 DS2 hellip

DS1 DS2 hellip

DS1 DS2 hellip

hellip

hellip

hellip

hellip

hellip

LSH Index

Doc 300 Doc 401 hellip

A query = 0 01 02 03 04 05 06 07 08 09 1

0

2

4

6

8

10

12

14

16

+ a few words

Re-rankingDoc 401 hellip

Doc 1

Doc 2

Doc 300

Doc 401

Doc N

Doc 300

Index 10M Images 46GBSearch Speed lt 100ms

Doc Meta

Search Example

Query Image

Search Example

Query Image

Simultaneously Modeling Semantics and Structure of Threaded Discussions A Sparse

Coding Approach and Its Applications

Chen LIN Jiang-Ming YANG Rui CAI Xin-jing WANG Wei WANG Lei ZHANG

SIGIR 2009

123

Semantic amp structure

124

SemanticTopics

StructureWho reply to who

Optimize them together

Model semantic

Model structure

125

Reply reconstruction

126

DocumentSimilarity

TopicSimilarity

StructureSimilarity

BaselinesNP

Reply to Nearest PostRR

Reply to RootDS

Document SimilarityLDA

Latent Dirichlet Allocation Project documents to topic space

SWB Special Words Topic Model with Background distribution Project documents to topic and junk topic space

127

Evaluation

method Slashdot Apple

All Posts Good Posts All Posts Good Posts

NP 0021 0012 0289 0239

RR 0183 0319 0269 0474

DS 0463 0643 0409 0628

LDA 0465 0644 0410 0648

SWB 0463 0644 0410 0641

SMSS 0524 0737 0517 0772

128

Expert finding

Reply reconstruction

Network construction

Expert finding

Methods

HITS

PageRank

hellip

129

Baselines

LMFormal Models for Expert Finding in Enterprise Corpora SIGIR

06Achieves stable performance in expert finding task using a

language modelPageRank

Benchmark nodal ranking methodHITS

Find hub nodes and authority nodeEABIF

Personalized Recommendation Driven by Information Flow SIGIR rsquo06

Find most influential node130

Evaluation

131

bull Bayesian estimate

Method MRR MAP P10

LM 0821 0698 0800

EABIF(ori) 0674 0362 0243

EABIF(rec) 0742 0318 0281

PageRank(ori) 0675 0377 0263

PageRank(rec) 0743 0321 0266

HITS(ori) 0906 0832 0900

HITS(rec) 0938 0822 0906

Summary

bull Matrix and probability are fundamental mathematics in information retrieval and computer visionndash Matrix decomposition ndash a good practice to learn matrixndash Graphical model ndash a good practice to learn probability

bull Graphical model is a good tool to analyze problems

bull The essence of decomposition is to discover a set of mid-level features to describe original documentsimages

bull It is more adaptable for various applications than matrix decomposition

  • An Introduction To Matrix Decomposition and Graphical Model
  • Outline
  • What Is Matrix Decomposition
  • Why We Need Matrix Decomposition
  • Why We Need Matrix Decomposition (2)
  • Principle Component Analysis
  • Definition ndash Eigenvalue amp Eigenvector
  • Definition ndash Principle Component Analysis
  • Principle Component Analysis (2)
  • Maximizing Variance
  • Optimization Problem
  • Property Data Decomposition
  • Face Recognition ndash Eigenface
  • Slide 14
  • Slide 15
  • PageRank ndash Power Iteration
  • Column-Stochastic amp Irreducible
  • Iterative PageRank Calculation
  • Convergence of the power iteration
  • Singular value decomposition
  • SVD - Definition
  • Singular Values And Singular Vectors
  • Matrix approximation
  • SVD and PCA
  • Example - LSI
  • SVD and PCA (2)
  • Latent Semantic Indexing (LSI)
  • Latent Semantic Indexing
  • Similarity Measures
  • Similarity Measures (2)
  • HITS (Hyperlink Induced Topic Search)
  • Power Iteration
  • HITS and SVD
  • HITS vs PageRank
  • NMF ndash Non-Negative Matrix Factorization
  • Definition
  • Motivation
  • Multiplicative Update Algorithm
  • Multiplicative Update Algorithm (2)
  • NMF vs PCA
  • Reference
  • Major Reference
  • Outline (2)
  • Not Included
  • Basic Concepts
  • What Is Machine Learning
  • Likelihood Function
  • Maximum Likelihood (ML)
  • IID ndash Independent Identically Distributed
  • Reference (2)
  • Expectation Maximization
  • Why We Need EM
  • More General
  • The Expectation Maximization (EM) Algorithm
  • Jensenrsquos Inequality
  • Lower Bounding the Log Likelihood
  • The E and M Steps of EM
  • The E Step
  • The M Step
  • EM Never Decreases the Likelihood
  • Reference (3)
  • Why Do We Need Graphical Model
  • Why Do We Need Graphical Models
  • Directed Acyclic Graphical Models (Bayesian Networks)
  • Directed Graphs for Statistical Models Plate Notation
  • pLSA ndash Probabilistic Latent Semantic Analysis
  • Latent Semantic Indexing (LSI) Review
  • pLSA ndash Probabilistic Latent Semantic Analysis (2)
  • pLSA
  • pLSA (2)
  • Joint Probability vs Likelihood
  • Document Decomposition
  • pLSA ndash Objective Function
  • EM Steps
  • Lower Bounding the Log Likelihood (2)
  • EM Steps (2)
  • Latent Subspace
  • pLSA vs LSA
  • pLSA vs LSA (2)
  • Applications
  • Text Mining
  • Scene Classification
  • Classification Result
  • Reference (4)
  • LDA ndash Latent Dirichilet Allocation
  • Problems in pLSA
  • Problems in pLSA (2)
  • Dirichlet Distribution
  • Dirichlet Distribution (2)
  • Example Dirichlet Distributions (K=3)
  • Example Dirichlet Distributions (K=3) (2)
  • The LDA Model
  • The LDA Model (2)
  • Joint Probability
  • Likelihood
  • Inference
  • Inference (2)
  • Variational Inference
  • Variantional Inference
  • VBEM vs EM
  • Parameter Estimation
  • Parameter Estimation (2)
  • Topic Examples in a 100-topic LDA Model)
  • Classification (50-topic LDA + SVM)
  • Problems in LDA
  • A Bayesian Hierarchical Model for Learning Natural Scene Catego
  • Codebook
  • Topic Distribution in Different Categories
  • Topic Hierarchical Clustering
  • More Topic Models
  • Are you really into Graphical Models
  • Reference (5)
  • Outline (3)
  • Slide 115
  • The Long Query Problem
  • Key Idea Dimension Reduction + Residual Error Preservation
  • Orthogonal Decomposition
  • A Probabilistic Implementation
  • Search (Online)
  • Search Example
  • Search Example (2)
  • Simultaneously Modeling Semantics and Structure of Threaded Dis
  • Semantic amp structure
  • Optimize them together
  • Reply reconstruction
  • Baselines
  • Evaluation
  • Expert finding
  • Baselines (2)
  • Evaluation (2)
  • Summary
Page 106: An Introduction To Matrix Decomposition and Graphical Model

Topic Distribution in Different Categories

Topic Hierarchical Clustering

More Topic Models

bull Dynamic topic models ICML 2006bull Correlated Topic Model NIPS 2005bull Hierarchical Dirichlet Process Journal of the American

Statistical Association 2003bull Nonparametric Bayes pachinko allocation UAI 2007bull Supervised LDA NIPS 2007bull MedLDA ndash Maximum Margin Discrimant LDA ICML 2009bull hellip

Are you really into Graphical Models

bull Describing Visual Scenes using Transformed Dirichlet Processes E Sudderth A Torralba W Freeman and A Willsky NIPS Dec 2005

Reference

bull David M Blei Andrew Y Ng Michael I Jordan Latent Dirichlet Allocation Journal of Machine Learning Research (JMLR) 2003

bull Matthew J Beal Variational Algorithms for Approximate Bayesian Inference PhD Thesis University of Cambridge 1998

bull L Fei-Fei and P Perona A Bayesian Hierarchical Model for Learning Natural Scene Categories CVPR 2005

Outline

bull Matrix Decompositionndash PCA SVD NMFndash LDA ICA Sparse Coding etc

bull Graphical Modelndash Basic concepts in probabilistic machine learningndash EMndash pLSAndash LDA

bull Two Applicationsndash Document decomposition for ldquolong queryrdquo retrieval ICCV 2009ndash Modeling Threaded Discussions SIGIR 2009

Large-Scale Indexing for ldquoLong Queryrdquo Retrieval (Similarity Search)

Xiao Zhang Zhiwei Li Lei Zhang Wei-Ying Ma and Heung-Yeung Shum

ICCV 2009

The Long Query Problem

bull If a query contains 1000 keywordsndash Need to access 1000 inverted listsndash The intersection of 1000 inverted lists may be emptyndash The union of 1000 inverted list may be the whole corpus

bull Dimension reductionTerm1 Term2 Term3 Term4 hellip TermN

Img1 1 2 0 0 hellip 2

f1 f2 hellip fM

Img1 02 01 hellip 003

Topic Projection

Dim = 1 million

Dim = 200

Key Idea Dimension Reduction + Residual Error Preservation

bull p original TF-IDF vector in vocabulary spacebull X projection matrix for dimension reductionbull w low dimensional feature vectorbull residual error

p Xw

An image = 0 01 02 03 04 05 06 07 08 09 1

0

2

4

6

8

10

12

14

16

+ a few words(10 words)

Orthogonal Decomposition

p Xw 1 11 1 1

2 21 2 1 2

1 1

1

k

k

W k W

W W Wk W

p x xp x x w

p wp x x

Base vector

Low dimensional representation

Residual

An image = 0 01 02 03 04 05 06 07 08 09 1

0

2

4

6

8

10

12

14

16

+ a few words(10 words)

T Tp q p q

p q

w w

X1 X2 X3 hellip Xk

A Probabilistic Implementation

x is a switch variable It controls a word generated from

bull a topic specific distribution

bull a document specific distribution

bull a background distribution

( | )p w d 1( 0 | ) ( | ) ( | )K

kp x d p w z k p z k d

( 1| ) ( | )p x d p w d ( 2 | ) ( | )p x d p w

C Chemudugunta etal Modeling General and Specific Aspects of Documents with a Probabilistic Topic Model NIPS 2006

Search (Online)

DS1 DS2 hellip

DS1 DS2 hellip

DS1 DS2 hellip

DS1 DS2 hellip

DS1 DS2 hellip

hellip

hellip

hellip

hellip

hellip

LSH Index

Doc 300 Doc 401 hellip

A query = 0 01 02 03 04 05 06 07 08 09 1

0

2

4

6

8

10

12

14

16

+ a few words

Re-rankingDoc 401 hellip

Doc 1

Doc 2

Doc 300

Doc 401

Doc N

Doc 300

Index 10M Images 46GBSearch Speed lt 100ms

Doc Meta

Search Example

Query Image

Search Example

Query Image

Simultaneously Modeling Semantics and Structure of Threaded Discussions A Sparse

Coding Approach and Its Applications

Chen LIN Jiang-Ming YANG Rui CAI Xin-jing WANG Wei WANG Lei ZHANG

SIGIR 2009

123

Semantic amp structure

124

SemanticTopics

StructureWho reply to who

Optimize them together

Model semantic

Model structure

125

Reply reconstruction

126

DocumentSimilarity

TopicSimilarity

StructureSimilarity

BaselinesNP

Reply to Nearest PostRR

Reply to RootDS

Document SimilarityLDA

Latent Dirichlet Allocation Project documents to topic space

SWB Special Words Topic Model with Background distribution Project documents to topic and junk topic space

127

Evaluation

method Slashdot Apple

All Posts Good Posts All Posts Good Posts

NP 0021 0012 0289 0239

RR 0183 0319 0269 0474

DS 0463 0643 0409 0628

LDA 0465 0644 0410 0648

SWB 0463 0644 0410 0641

SMSS 0524 0737 0517 0772

128

Expert finding

Reply reconstruction

Network construction

Expert finding

Methods

HITS

PageRank

hellip

129

Baselines

LMFormal Models for Expert Finding in Enterprise Corpora SIGIR

06Achieves stable performance in expert finding task using a

language modelPageRank

Benchmark nodal ranking methodHITS

Find hub nodes and authority nodeEABIF

Personalized Recommendation Driven by Information Flow SIGIR rsquo06

Find most influential node130

Evaluation

131

bull Bayesian estimate

Method MRR MAP P10

LM 0821 0698 0800

EABIF(ori) 0674 0362 0243

EABIF(rec) 0742 0318 0281

PageRank(ori) 0675 0377 0263

PageRank(rec) 0743 0321 0266

HITS(ori) 0906 0832 0900

HITS(rec) 0938 0822 0906

Summary

bull Matrix and probability are fundamental mathematics in information retrieval and computer visionndash Matrix decomposition ndash a good practice to learn matrixndash Graphical model ndash a good practice to learn probability

bull Graphical model is a good tool to analyze problems

bull The essence of decomposition is to discover a set of mid-level features to describe original documentsimages

bull It is more adaptable for various applications than matrix decomposition

  • An Introduction To Matrix Decomposition and Graphical Model
  • Outline
  • What Is Matrix Decomposition
  • Why We Need Matrix Decomposition
  • Why We Need Matrix Decomposition (2)
  • Principle Component Analysis
  • Definition ndash Eigenvalue amp Eigenvector
  • Definition ndash Principle Component Analysis
  • Principle Component Analysis (2)
  • Maximizing Variance
  • Optimization Problem
  • Property Data Decomposition
  • Face Recognition ndash Eigenface
  • Slide 14
  • Slide 15
  • PageRank ndash Power Iteration
  • Column-Stochastic amp Irreducible
  • Iterative PageRank Calculation
  • Convergence of the power iteration
  • Singular value decomposition
  • SVD - Definition
  • Singular Values And Singular Vectors
  • Matrix approximation
  • SVD and PCA
  • Example - LSI
  • SVD and PCA (2)
  • Latent Semantic Indexing (LSI)
  • Latent Semantic Indexing
  • Similarity Measures
  • Similarity Measures (2)
  • HITS (Hyperlink Induced Topic Search)
  • Power Iteration
  • HITS and SVD
  • HITS vs PageRank
  • NMF ndash Non-Negative Matrix Factorization
  • Definition
  • Motivation
  • Multiplicative Update Algorithm
  • Multiplicative Update Algorithm (2)
  • NMF vs PCA
  • Reference
  • Major Reference
  • Outline (2)
  • Not Included
  • Basic Concepts
  • What Is Machine Learning
  • Likelihood Function
  • Maximum Likelihood (ML)
  • IID ndash Independent Identically Distributed
  • Reference (2)
  • Expectation Maximization
  • Why We Need EM
  • More General
  • The Expectation Maximization (EM) Algorithm
  • Jensenrsquos Inequality
  • Lower Bounding the Log Likelihood
  • The E and M Steps of EM
  • The E Step
  • The M Step
  • EM Never Decreases the Likelihood
  • Reference (3)
  • Why Do We Need Graphical Model
  • Why Do We Need Graphical Models
  • Directed Acyclic Graphical Models (Bayesian Networks)
  • Directed Graphs for Statistical Models Plate Notation
  • pLSA ndash Probabilistic Latent Semantic Analysis
  • Latent Semantic Indexing (LSI) Review
  • pLSA ndash Probabilistic Latent Semantic Analysis (2)
  • pLSA
  • pLSA (2)
  • Joint Probability vs Likelihood
  • Document Decomposition
  • pLSA ndash Objective Function
  • EM Steps
  • Lower Bounding the Log Likelihood (2)
  • EM Steps (2)
  • Latent Subspace
  • pLSA vs LSA
  • pLSA vs LSA (2)
  • Applications
  • Text Mining
  • Scene Classification
  • Classification Result
  • Reference (4)
  • LDA ndash Latent Dirichilet Allocation
  • Problems in pLSA
  • Problems in pLSA (2)
  • Dirichlet Distribution
  • Dirichlet Distribution (2)
  • Example Dirichlet Distributions (K=3)
  • Example Dirichlet Distributions (K=3) (2)
  • The LDA Model
  • The LDA Model (2)
  • Joint Probability
  • Likelihood
  • Inference
  • Inference (2)
  • Variational Inference
  • Variantional Inference
  • VBEM vs EM
  • Parameter Estimation
  • Parameter Estimation (2)
  • Topic Examples in a 100-topic LDA Model)
  • Classification (50-topic LDA + SVM)
  • Problems in LDA
  • A Bayesian Hierarchical Model for Learning Natural Scene Catego
  • Codebook
  • Topic Distribution in Different Categories
  • Topic Hierarchical Clustering
  • More Topic Models
  • Are you really into Graphical Models
  • Reference (5)
  • Outline (3)
  • Slide 115
  • The Long Query Problem
  • Key Idea Dimension Reduction + Residual Error Preservation
  • Orthogonal Decomposition
  • A Probabilistic Implementation
  • Search (Online)
  • Search Example
  • Search Example (2)
  • Simultaneously Modeling Semantics and Structure of Threaded Dis
  • Semantic amp structure
  • Optimize them together
  • Reply reconstruction
  • Baselines
  • Evaluation
  • Expert finding
  • Baselines (2)
  • Evaluation (2)
  • Summary
Page 107: An Introduction To Matrix Decomposition and Graphical Model

Topic Hierarchical Clustering

More Topic Models

bull Dynamic topic models ICML 2006bull Correlated Topic Model NIPS 2005bull Hierarchical Dirichlet Process Journal of the American

Statistical Association 2003bull Nonparametric Bayes pachinko allocation UAI 2007bull Supervised LDA NIPS 2007bull MedLDA ndash Maximum Margin Discrimant LDA ICML 2009bull hellip

Are you really into Graphical Models

bull Describing Visual Scenes using Transformed Dirichlet Processes E Sudderth A Torralba W Freeman and A Willsky NIPS Dec 2005

Reference

bull David M Blei Andrew Y Ng Michael I Jordan Latent Dirichlet Allocation Journal of Machine Learning Research (JMLR) 2003

bull Matthew J Beal Variational Algorithms for Approximate Bayesian Inference PhD Thesis University of Cambridge 1998

bull L Fei-Fei and P Perona A Bayesian Hierarchical Model for Learning Natural Scene Categories CVPR 2005

Outline

bull Matrix Decompositionndash PCA SVD NMFndash LDA ICA Sparse Coding etc

bull Graphical Modelndash Basic concepts in probabilistic machine learningndash EMndash pLSAndash LDA

bull Two Applicationsndash Document decomposition for ldquolong queryrdquo retrieval ICCV 2009ndash Modeling Threaded Discussions SIGIR 2009

Large-Scale Indexing for ldquoLong Queryrdquo Retrieval (Similarity Search)

Xiao Zhang Zhiwei Li Lei Zhang Wei-Ying Ma and Heung-Yeung Shum

ICCV 2009

The Long Query Problem

bull If a query contains 1000 keywordsndash Need to access 1000 inverted listsndash The intersection of 1000 inverted lists may be emptyndash The union of 1000 inverted list may be the whole corpus

bull Dimension reductionTerm1 Term2 Term3 Term4 hellip TermN

Img1 1 2 0 0 hellip 2

f1 f2 hellip fM

Img1 02 01 hellip 003

Topic Projection

Dim = 1 million

Dim = 200

Key Idea Dimension Reduction + Residual Error Preservation

bull p original TF-IDF vector in vocabulary spacebull X projection matrix for dimension reductionbull w low dimensional feature vectorbull residual error

p Xw

An image = 0 01 02 03 04 05 06 07 08 09 1

0

2

4

6

8

10

12

14

16

+ a few words(10 words)

Orthogonal Decomposition

p Xw 1 11 1 1

2 21 2 1 2

1 1

1

k

k

W k W

W W Wk W

p x xp x x w

p wp x x

Base vector

Low dimensional representation

Residual

An image = 0 01 02 03 04 05 06 07 08 09 1

0

2

4

6

8

10

12

14

16

+ a few words(10 words)

T Tp q p q

p q

w w

X1 X2 X3 hellip Xk

A Probabilistic Implementation

x is a switch variable It controls a word generated from

bull a topic specific distribution

bull a document specific distribution

bull a background distribution

( | )p w d 1( 0 | ) ( | ) ( | )K

kp x d p w z k p z k d

( 1| ) ( | )p x d p w d ( 2 | ) ( | )p x d p w

C Chemudugunta etal Modeling General and Specific Aspects of Documents with a Probabilistic Topic Model NIPS 2006

Search (Online)

DS1 DS2 hellip

DS1 DS2 hellip

DS1 DS2 hellip

DS1 DS2 hellip

DS1 DS2 hellip

hellip

hellip

hellip

hellip

hellip

LSH Index

Doc 300 Doc 401 hellip

A query = 0 01 02 03 04 05 06 07 08 09 1

0

2

4

6

8

10

12

14

16

+ a few words

Re-rankingDoc 401 hellip

Doc 1

Doc 2

Doc 300

Doc 401

Doc N

Doc 300

Index 10M Images 46GBSearch Speed lt 100ms

Doc Meta

Search Example

Query Image

Search Example

Query Image

Simultaneously Modeling Semantics and Structure of Threaded Discussions A Sparse

Coding Approach and Its Applications

Chen LIN Jiang-Ming YANG Rui CAI Xin-jing WANG Wei WANG Lei ZHANG

SIGIR 2009

123

Semantic amp structure

124

SemanticTopics

StructureWho reply to who

Optimize them together

Model semantic

Model structure

125

Reply reconstruction

126

DocumentSimilarity

TopicSimilarity

StructureSimilarity

BaselinesNP

Reply to Nearest PostRR

Reply to RootDS

Document SimilarityLDA

Latent Dirichlet Allocation Project documents to topic space

SWB Special Words Topic Model with Background distribution Project documents to topic and junk topic space

127

Evaluation

method Slashdot Apple

All Posts Good Posts All Posts Good Posts

NP 0021 0012 0289 0239

RR 0183 0319 0269 0474

DS 0463 0643 0409 0628

LDA 0465 0644 0410 0648

SWB 0463 0644 0410 0641

SMSS 0524 0737 0517 0772

128

Expert finding

Reply reconstruction

Network construction

Expert finding

Methods

HITS

PageRank

hellip

129

Baselines

LMFormal Models for Expert Finding in Enterprise Corpora SIGIR

06Achieves stable performance in expert finding task using a

language modelPageRank

Benchmark nodal ranking methodHITS

Find hub nodes and authority nodeEABIF

Personalized Recommendation Driven by Information Flow SIGIR rsquo06

Find most influential node130

Evaluation

131

bull Bayesian estimate

Method MRR MAP P10

LM 0821 0698 0800

EABIF(ori) 0674 0362 0243

EABIF(rec) 0742 0318 0281

PageRank(ori) 0675 0377 0263

PageRank(rec) 0743 0321 0266

HITS(ori) 0906 0832 0900

HITS(rec) 0938 0822 0906

Summary

bull Matrix and probability are fundamental mathematics in information retrieval and computer visionndash Matrix decomposition ndash a good practice to learn matrixndash Graphical model ndash a good practice to learn probability

bull Graphical model is a good tool to analyze problems

bull The essence of decomposition is to discover a set of mid-level features to describe original documentsimages

bull It is more adaptable for various applications than matrix decomposition

  • An Introduction To Matrix Decomposition and Graphical Model
  • Outline
  • What Is Matrix Decomposition
  • Why We Need Matrix Decomposition
  • Why We Need Matrix Decomposition (2)
  • Principle Component Analysis
  • Definition ndash Eigenvalue amp Eigenvector
  • Definition ndash Principle Component Analysis
  • Principle Component Analysis (2)
  • Maximizing Variance
  • Optimization Problem
  • Property Data Decomposition
  • Face Recognition ndash Eigenface
  • Slide 14
  • Slide 15
  • PageRank ndash Power Iteration
  • Column-Stochastic amp Irreducible
  • Iterative PageRank Calculation
  • Convergence of the power iteration
  • Singular value decomposition
  • SVD - Definition
  • Singular Values And Singular Vectors
  • Matrix approximation
  • SVD and PCA
  • Example - LSI
  • SVD and PCA (2)
  • Latent Semantic Indexing (LSI)
  • Latent Semantic Indexing
  • Similarity Measures
  • Similarity Measures (2)
  • HITS (Hyperlink Induced Topic Search)
  • Power Iteration
  • HITS and SVD
  • HITS vs PageRank
  • NMF ndash Non-Negative Matrix Factorization
  • Definition
  • Motivation
  • Multiplicative Update Algorithm
  • Multiplicative Update Algorithm (2)
  • NMF vs PCA
  • Reference
  • Major Reference
  • Outline (2)
  • Not Included
  • Basic Concepts
  • What Is Machine Learning
  • Likelihood Function
  • Maximum Likelihood (ML)
  • IID ndash Independent Identically Distributed
  • Reference (2)
  • Expectation Maximization
  • Why We Need EM
  • More General
  • The Expectation Maximization (EM) Algorithm
  • Jensenrsquos Inequality
  • Lower Bounding the Log Likelihood
  • The E and M Steps of EM
  • The E Step
  • The M Step
  • EM Never Decreases the Likelihood
  • Reference (3)
  • Why Do We Need Graphical Model
  • Why Do We Need Graphical Models
  • Directed Acyclic Graphical Models (Bayesian Networks)
  • Directed Graphs for Statistical Models Plate Notation
  • pLSA ndash Probabilistic Latent Semantic Analysis
  • Latent Semantic Indexing (LSI) Review
  • pLSA ndash Probabilistic Latent Semantic Analysis (2)
  • pLSA
  • pLSA (2)
  • Joint Probability vs Likelihood
  • Document Decomposition
  • pLSA ndash Objective Function
  • EM Steps
  • Lower Bounding the Log Likelihood (2)
  • EM Steps (2)
  • Latent Subspace
  • pLSA vs LSA
  • pLSA vs LSA (2)
  • Applications
  • Text Mining
  • Scene Classification
  • Classification Result
  • Reference (4)
  • LDA ndash Latent Dirichilet Allocation
  • Problems in pLSA
  • Problems in pLSA (2)
  • Dirichlet Distribution
  • Dirichlet Distribution (2)
  • Example Dirichlet Distributions (K=3)
  • Example Dirichlet Distributions (K=3) (2)
  • The LDA Model
  • The LDA Model (2)
  • Joint Probability
  • Likelihood
  • Inference
  • Inference (2)
  • Variational Inference
  • Variantional Inference
  • VBEM vs EM
  • Parameter Estimation
  • Parameter Estimation (2)
  • Topic Examples in a 100-topic LDA Model)
  • Classification (50-topic LDA + SVM)
  • Problems in LDA
  • A Bayesian Hierarchical Model for Learning Natural Scene Catego
  • Codebook
  • Topic Distribution in Different Categories
  • Topic Hierarchical Clustering
  • More Topic Models
  • Are you really into Graphical Models
  • Reference (5)
  • Outline (3)
  • Slide 115
  • The Long Query Problem
  • Key Idea Dimension Reduction + Residual Error Preservation
  • Orthogonal Decomposition
  • A Probabilistic Implementation
  • Search (Online)
  • Search Example
  • Search Example (2)
  • Simultaneously Modeling Semantics and Structure of Threaded Dis
  • Semantic amp structure
  • Optimize them together
  • Reply reconstruction
  • Baselines
  • Evaluation
  • Expert finding
  • Baselines (2)
  • Evaluation (2)
  • Summary
Page 108: An Introduction To Matrix Decomposition and Graphical Model

More Topic Models

bull Dynamic topic models ICML 2006bull Correlated Topic Model NIPS 2005bull Hierarchical Dirichlet Process Journal of the American

Statistical Association 2003bull Nonparametric Bayes pachinko allocation UAI 2007bull Supervised LDA NIPS 2007bull MedLDA ndash Maximum Margin Discrimant LDA ICML 2009bull hellip

Are you really into Graphical Models

bull Describing Visual Scenes using Transformed Dirichlet Processes E Sudderth A Torralba W Freeman and A Willsky NIPS Dec 2005

Reference

bull David M Blei Andrew Y Ng Michael I Jordan Latent Dirichlet Allocation Journal of Machine Learning Research (JMLR) 2003

bull Matthew J Beal Variational Algorithms for Approximate Bayesian Inference PhD Thesis University of Cambridge 1998

bull L Fei-Fei and P Perona A Bayesian Hierarchical Model for Learning Natural Scene Categories CVPR 2005

Outline

bull Matrix Decompositionndash PCA SVD NMFndash LDA ICA Sparse Coding etc

bull Graphical Modelndash Basic concepts in probabilistic machine learningndash EMndash pLSAndash LDA

bull Two Applicationsndash Document decomposition for ldquolong queryrdquo retrieval ICCV 2009ndash Modeling Threaded Discussions SIGIR 2009

Large-Scale Indexing for ldquoLong Queryrdquo Retrieval (Similarity Search)

Xiao Zhang Zhiwei Li Lei Zhang Wei-Ying Ma and Heung-Yeung Shum

ICCV 2009

The Long Query Problem

bull If a query contains 1000 keywordsndash Need to access 1000 inverted listsndash The intersection of 1000 inverted lists may be emptyndash The union of 1000 inverted list may be the whole corpus

bull Dimension reductionTerm1 Term2 Term3 Term4 hellip TermN

Img1 1 2 0 0 hellip 2

f1 f2 hellip fM

Img1 02 01 hellip 003

Topic Projection

Dim = 1 million

Dim = 200

Key Idea Dimension Reduction + Residual Error Preservation

bull p original TF-IDF vector in vocabulary spacebull X projection matrix for dimension reductionbull w low dimensional feature vectorbull residual error

p Xw

An image = 0 01 02 03 04 05 06 07 08 09 1

0

2

4

6

8

10

12

14

16

+ a few words(10 words)

Orthogonal Decomposition

p Xw 1 11 1 1

2 21 2 1 2

1 1

1

k

k

W k W

W W Wk W

p x xp x x w

p wp x x

Base vector

Low dimensional representation

Residual

An image = 0 01 02 03 04 05 06 07 08 09 1

0

2

4

6

8

10

12

14

16

+ a few words(10 words)

T Tp q p q

p q

w w

X1 X2 X3 hellip Xk

A Probabilistic Implementation

x is a switch variable It controls a word generated from

bull a topic specific distribution

bull a document specific distribution

bull a background distribution

( | )p w d 1( 0 | ) ( | ) ( | )K

kp x d p w z k p z k d

( 1| ) ( | )p x d p w d ( 2 | ) ( | )p x d p w

C Chemudugunta etal Modeling General and Specific Aspects of Documents with a Probabilistic Topic Model NIPS 2006

Search (Online)

DS1 DS2 hellip

DS1 DS2 hellip

DS1 DS2 hellip

DS1 DS2 hellip

DS1 DS2 hellip

hellip

hellip

hellip

hellip

hellip

LSH Index

Doc 300 Doc 401 hellip

A query = 0 01 02 03 04 05 06 07 08 09 1

0

2

4

6

8

10

12

14

16

+ a few words

Re-rankingDoc 401 hellip

Doc 1

Doc 2

Doc 300

Doc 401

Doc N

Doc 300

Index 10M Images 46GBSearch Speed lt 100ms

Doc Meta

Search Example

Query Image

Search Example

Query Image

Simultaneously Modeling Semantics and Structure of Threaded Discussions A Sparse

Coding Approach and Its Applications

Chen LIN Jiang-Ming YANG Rui CAI Xin-jing WANG Wei WANG Lei ZHANG

SIGIR 2009

123

Semantic amp structure

124

SemanticTopics

StructureWho reply to who

Optimize them together

Model semantic

Model structure

125

Reply reconstruction

126

DocumentSimilarity

TopicSimilarity

StructureSimilarity

BaselinesNP

Reply to Nearest PostRR

Reply to RootDS

Document SimilarityLDA

Latent Dirichlet Allocation Project documents to topic space

SWB Special Words Topic Model with Background distribution Project documents to topic and junk topic space

127

Evaluation

method Slashdot Apple

All Posts Good Posts All Posts Good Posts

NP 0021 0012 0289 0239

RR 0183 0319 0269 0474

DS 0463 0643 0409 0628

LDA 0465 0644 0410 0648

SWB 0463 0644 0410 0641

SMSS 0524 0737 0517 0772

128

Expert finding

Reply reconstruction

Network construction

Expert finding

Methods

HITS

PageRank

hellip

129

Baselines

LMFormal Models for Expert Finding in Enterprise Corpora SIGIR

06Achieves stable performance in expert finding task using a

language modelPageRank

Benchmark nodal ranking methodHITS

Find hub nodes and authority nodeEABIF

Personalized Recommendation Driven by Information Flow SIGIR rsquo06

Find most influential node130

Evaluation

131

bull Bayesian estimate

Method MRR MAP P10

LM 0821 0698 0800

EABIF(ori) 0674 0362 0243

EABIF(rec) 0742 0318 0281

PageRank(ori) 0675 0377 0263

PageRank(rec) 0743 0321 0266

HITS(ori) 0906 0832 0900

HITS(rec) 0938 0822 0906

Summary

bull Matrix and probability are fundamental mathematics in information retrieval and computer visionndash Matrix decomposition ndash a good practice to learn matrixndash Graphical model ndash a good practice to learn probability

bull Graphical model is a good tool to analyze problems

bull The essence of decomposition is to discover a set of mid-level features to describe original documentsimages

bull It is more adaptable for various applications than matrix decomposition

  • An Introduction To Matrix Decomposition and Graphical Model
  • Outline
  • What Is Matrix Decomposition
  • Why We Need Matrix Decomposition
  • Why We Need Matrix Decomposition (2)
  • Principle Component Analysis
  • Definition ndash Eigenvalue amp Eigenvector
  • Definition ndash Principle Component Analysis
  • Principle Component Analysis (2)
  • Maximizing Variance
  • Optimization Problem
  • Property Data Decomposition
  • Face Recognition ndash Eigenface
  • Slide 14
  • Slide 15
  • PageRank ndash Power Iteration
  • Column-Stochastic amp Irreducible
  • Iterative PageRank Calculation
  • Convergence of the power iteration
  • Singular value decomposition
  • SVD - Definition
  • Singular Values And Singular Vectors
  • Matrix approximation
  • SVD and PCA
  • Example - LSI
  • SVD and PCA (2)
  • Latent Semantic Indexing (LSI)
  • Latent Semantic Indexing
  • Similarity Measures
  • Similarity Measures (2)
  • HITS (Hyperlink Induced Topic Search)
  • Power Iteration
  • HITS and SVD
  • HITS vs PageRank
  • NMF ndash Non-Negative Matrix Factorization
  • Definition
  • Motivation
  • Multiplicative Update Algorithm
  • Multiplicative Update Algorithm (2)
  • NMF vs PCA
  • Reference
  • Major Reference
  • Outline (2)
  • Not Included
  • Basic Concepts
  • What Is Machine Learning
  • Likelihood Function
  • Maximum Likelihood (ML)
  • IID ndash Independent Identically Distributed
  • Reference (2)
  • Expectation Maximization
  • Why We Need EM
  • More General
  • The Expectation Maximization (EM) Algorithm
  • Jensenrsquos Inequality
  • Lower Bounding the Log Likelihood
  • The E and M Steps of EM
  • The E Step
  • The M Step
  • EM Never Decreases the Likelihood
  • Reference (3)
  • Why Do We Need Graphical Model
  • Why Do We Need Graphical Models
  • Directed Acyclic Graphical Models (Bayesian Networks)
  • Directed Graphs for Statistical Models Plate Notation
  • pLSA ndash Probabilistic Latent Semantic Analysis
  • Latent Semantic Indexing (LSI) Review
  • pLSA ndash Probabilistic Latent Semantic Analysis (2)
  • pLSA
  • pLSA (2)
  • Joint Probability vs Likelihood
  • Document Decomposition
  • pLSA ndash Objective Function
  • EM Steps
  • Lower Bounding the Log Likelihood (2)
  • EM Steps (2)
  • Latent Subspace
  • pLSA vs LSA
  • pLSA vs LSA (2)
  • Applications
  • Text Mining
  • Scene Classification
  • Classification Result
  • Reference (4)
  • LDA ndash Latent Dirichilet Allocation
  • Problems in pLSA
  • Problems in pLSA (2)
  • Dirichlet Distribution
  • Dirichlet Distribution (2)
  • Example Dirichlet Distributions (K=3)
  • Example Dirichlet Distributions (K=3) (2)
  • The LDA Model
  • The LDA Model (2)
  • Joint Probability
  • Likelihood
  • Inference
  • Inference (2)
  • Variational Inference
  • Variantional Inference
  • VBEM vs EM
  • Parameter Estimation
  • Parameter Estimation (2)
  • Topic Examples in a 100-topic LDA Model)
  • Classification (50-topic LDA + SVM)
  • Problems in LDA
  • A Bayesian Hierarchical Model for Learning Natural Scene Catego
  • Codebook
  • Topic Distribution in Different Categories
  • Topic Hierarchical Clustering
  • More Topic Models
  • Are you really into Graphical Models
  • Reference (5)
  • Outline (3)
  • Slide 115
  • The Long Query Problem
  • Key Idea Dimension Reduction + Residual Error Preservation
  • Orthogonal Decomposition
  • A Probabilistic Implementation
  • Search (Online)
  • Search Example
  • Search Example (2)
  • Simultaneously Modeling Semantics and Structure of Threaded Dis
  • Semantic amp structure
  • Optimize them together
  • Reply reconstruction
  • Baselines
  • Evaluation
  • Expert finding
  • Baselines (2)
  • Evaluation (2)
  • Summary
Page 109: An Introduction To Matrix Decomposition and Graphical Model

Are you really into Graphical Models

bull Describing Visual Scenes using Transformed Dirichlet Processes E Sudderth A Torralba W Freeman and A Willsky NIPS Dec 2005

Reference

bull David M Blei Andrew Y Ng Michael I Jordan Latent Dirichlet Allocation Journal of Machine Learning Research (JMLR) 2003

bull Matthew J Beal Variational Algorithms for Approximate Bayesian Inference PhD Thesis University of Cambridge 1998

bull L Fei-Fei and P Perona A Bayesian Hierarchical Model for Learning Natural Scene Categories CVPR 2005

Outline

bull Matrix Decompositionndash PCA SVD NMFndash LDA ICA Sparse Coding etc

bull Graphical Modelndash Basic concepts in probabilistic machine learningndash EMndash pLSAndash LDA

bull Two Applicationsndash Document decomposition for ldquolong queryrdquo retrieval ICCV 2009ndash Modeling Threaded Discussions SIGIR 2009

Large-Scale Indexing for ldquoLong Queryrdquo Retrieval (Similarity Search)

Xiao Zhang Zhiwei Li Lei Zhang Wei-Ying Ma and Heung-Yeung Shum

ICCV 2009

The Long Query Problem

bull If a query contains 1000 keywordsndash Need to access 1000 inverted listsndash The intersection of 1000 inverted lists may be emptyndash The union of 1000 inverted list may be the whole corpus

bull Dimension reductionTerm1 Term2 Term3 Term4 hellip TermN

Img1 1 2 0 0 hellip 2

f1 f2 hellip fM

Img1 02 01 hellip 003

Topic Projection

Dim = 1 million

Dim = 200

Key Idea Dimension Reduction + Residual Error Preservation

bull p original TF-IDF vector in vocabulary spacebull X projection matrix for dimension reductionbull w low dimensional feature vectorbull residual error

p Xw

An image = 0 01 02 03 04 05 06 07 08 09 1

0

2

4

6

8

10

12

14

16

+ a few words(10 words)

Orthogonal Decomposition

p Xw 1 11 1 1

2 21 2 1 2

1 1

1

k

k

W k W

W W Wk W

p x xp x x w

p wp x x

Base vector

Low dimensional representation

Residual

An image = 0 01 02 03 04 05 06 07 08 09 1

0

2

4

6

8

10

12

14

16

+ a few words(10 words)

T Tp q p q

p q

w w

X1 X2 X3 hellip Xk

A Probabilistic Implementation

x is a switch variable It controls a word generated from

bull a topic specific distribution

bull a document specific distribution

bull a background distribution

( | )p w d 1( 0 | ) ( | ) ( | )K

kp x d p w z k p z k d

( 1| ) ( | )p x d p w d ( 2 | ) ( | )p x d p w

C Chemudugunta etal Modeling General and Specific Aspects of Documents with a Probabilistic Topic Model NIPS 2006

Search (Online)

DS1 DS2 hellip

DS1 DS2 hellip

DS1 DS2 hellip

DS1 DS2 hellip

DS1 DS2 hellip

hellip

hellip

hellip

hellip

hellip

LSH Index

Doc 300 Doc 401 hellip

A query = 0 01 02 03 04 05 06 07 08 09 1

0

2

4

6

8

10

12

14

16

+ a few words

Re-rankingDoc 401 hellip

Doc 1

Doc 2

Doc 300

Doc 401

Doc N

Doc 300

Index 10M Images 46GBSearch Speed lt 100ms

Doc Meta

Search Example

Query Image

Search Example

Query Image

Simultaneously Modeling Semantics and Structure of Threaded Discussions A Sparse

Coding Approach and Its Applications

Chen LIN Jiang-Ming YANG Rui CAI Xin-jing WANG Wei WANG Lei ZHANG

SIGIR 2009

123

Semantic amp structure

124

SemanticTopics

StructureWho reply to who

Optimize them together

Model semantic

Model structure

125

Reply reconstruction

126

DocumentSimilarity

TopicSimilarity

StructureSimilarity

BaselinesNP

Reply to Nearest PostRR

Reply to RootDS

Document SimilarityLDA

Latent Dirichlet Allocation Project documents to topic space

SWB Special Words Topic Model with Background distribution Project documents to topic and junk topic space

127

Evaluation

method Slashdot Apple

All Posts Good Posts All Posts Good Posts

NP 0021 0012 0289 0239

RR 0183 0319 0269 0474

DS 0463 0643 0409 0628

LDA 0465 0644 0410 0648

SWB 0463 0644 0410 0641

SMSS 0524 0737 0517 0772

128

Expert finding

Reply reconstruction

Network construction

Expert finding

Methods

HITS

PageRank

hellip

129

Baselines

LMFormal Models for Expert Finding in Enterprise Corpora SIGIR

06Achieves stable performance in expert finding task using a

language modelPageRank

Benchmark nodal ranking methodHITS

Find hub nodes and authority nodeEABIF

Personalized Recommendation Driven by Information Flow SIGIR rsquo06

Find most influential node130

Evaluation

131

bull Bayesian estimate

Method MRR MAP P10

LM 0821 0698 0800

EABIF(ori) 0674 0362 0243

EABIF(rec) 0742 0318 0281

PageRank(ori) 0675 0377 0263

PageRank(rec) 0743 0321 0266

HITS(ori) 0906 0832 0900

HITS(rec) 0938 0822 0906

Summary

bull Matrix and probability are fundamental mathematics in information retrieval and computer visionndash Matrix decomposition ndash a good practice to learn matrixndash Graphical model ndash a good practice to learn probability

bull Graphical model is a good tool to analyze problems

bull The essence of decomposition is to discover a set of mid-level features to describe original documentsimages

bull It is more adaptable for various applications than matrix decomposition

  • An Introduction To Matrix Decomposition and Graphical Model
  • Outline
  • What Is Matrix Decomposition
  • Why We Need Matrix Decomposition
  • Why We Need Matrix Decomposition (2)
  • Principle Component Analysis
  • Definition ndash Eigenvalue amp Eigenvector
  • Definition ndash Principle Component Analysis
  • Principle Component Analysis (2)
  • Maximizing Variance
  • Optimization Problem
  • Property Data Decomposition
  • Face Recognition ndash Eigenface
  • Slide 14
  • Slide 15
  • PageRank ndash Power Iteration
  • Column-Stochastic amp Irreducible
  • Iterative PageRank Calculation
  • Convergence of the power iteration
  • Singular value decomposition
  • SVD - Definition
  • Singular Values And Singular Vectors
  • Matrix approximation
  • SVD and PCA
  • Example - LSI
  • SVD and PCA (2)
  • Latent Semantic Indexing (LSI)
  • Latent Semantic Indexing
  • Similarity Measures
  • Similarity Measures (2)
  • HITS (Hyperlink Induced Topic Search)
  • Power Iteration
  • HITS and SVD
  • HITS vs PageRank
  • NMF ndash Non-Negative Matrix Factorization
  • Definition
  • Motivation
  • Multiplicative Update Algorithm
  • Multiplicative Update Algorithm (2)
  • NMF vs PCA
  • Reference
  • Major Reference
  • Outline (2)
  • Not Included
  • Basic Concepts
  • What Is Machine Learning
  • Likelihood Function
  • Maximum Likelihood (ML)
  • IID ndash Independent Identically Distributed
  • Reference (2)
  • Expectation Maximization
  • Why We Need EM
  • More General
  • The Expectation Maximization (EM) Algorithm
  • Jensenrsquos Inequality
  • Lower Bounding the Log Likelihood
  • The E and M Steps of EM
  • The E Step
  • The M Step
  • EM Never Decreases the Likelihood
  • Reference (3)
  • Why Do We Need Graphical Model
  • Why Do We Need Graphical Models
  • Directed Acyclic Graphical Models (Bayesian Networks)
  • Directed Graphs for Statistical Models Plate Notation
  • pLSA ndash Probabilistic Latent Semantic Analysis
  • Latent Semantic Indexing (LSI) Review
  • pLSA ndash Probabilistic Latent Semantic Analysis (2)
  • pLSA
  • pLSA (2)
  • Joint Probability vs Likelihood
  • Document Decomposition
  • pLSA ndash Objective Function
  • EM Steps
  • Lower Bounding the Log Likelihood (2)
  • EM Steps (2)
  • Latent Subspace
  • pLSA vs LSA
  • pLSA vs LSA (2)
  • Applications
  • Text Mining
  • Scene Classification
  • Classification Result
  • Reference (4)
  • LDA ndash Latent Dirichilet Allocation
  • Problems in pLSA
  • Problems in pLSA (2)
  • Dirichlet Distribution
  • Dirichlet Distribution (2)
  • Example Dirichlet Distributions (K=3)
  • Example Dirichlet Distributions (K=3) (2)
  • The LDA Model
  • The LDA Model (2)
  • Joint Probability
  • Likelihood
  • Inference
  • Inference (2)
  • Variational Inference
  • Variantional Inference
  • VBEM vs EM
  • Parameter Estimation
  • Parameter Estimation (2)
  • Topic Examples in a 100-topic LDA Model)
  • Classification (50-topic LDA + SVM)
  • Problems in LDA
  • A Bayesian Hierarchical Model for Learning Natural Scene Catego
  • Codebook
  • Topic Distribution in Different Categories
  • Topic Hierarchical Clustering
  • More Topic Models
  • Are you really into Graphical Models
  • Reference (5)
  • Outline (3)
  • Slide 115
  • The Long Query Problem
  • Key Idea Dimension Reduction + Residual Error Preservation
  • Orthogonal Decomposition
  • A Probabilistic Implementation
  • Search (Online)
  • Search Example
  • Search Example (2)
  • Simultaneously Modeling Semantics and Structure of Threaded Dis
  • Semantic amp structure
  • Optimize them together
  • Reply reconstruction
  • Baselines
  • Evaluation
  • Expert finding
  • Baselines (2)
  • Evaluation (2)
  • Summary
Page 110: An Introduction To Matrix Decomposition and Graphical Model

Reference

bull David M Blei Andrew Y Ng Michael I Jordan Latent Dirichlet Allocation Journal of Machine Learning Research (JMLR) 2003

bull Matthew J Beal Variational Algorithms for Approximate Bayesian Inference PhD Thesis University of Cambridge 1998

bull L Fei-Fei and P Perona A Bayesian Hierarchical Model for Learning Natural Scene Categories CVPR 2005

Outline

bull Matrix Decompositionndash PCA SVD NMFndash LDA ICA Sparse Coding etc

bull Graphical Modelndash Basic concepts in probabilistic machine learningndash EMndash pLSAndash LDA

bull Two Applicationsndash Document decomposition for ldquolong queryrdquo retrieval ICCV 2009ndash Modeling Threaded Discussions SIGIR 2009

Large-Scale Indexing for ldquoLong Queryrdquo Retrieval (Similarity Search)

Xiao Zhang Zhiwei Li Lei Zhang Wei-Ying Ma and Heung-Yeung Shum

ICCV 2009

The Long Query Problem

bull If a query contains 1000 keywordsndash Need to access 1000 inverted listsndash The intersection of 1000 inverted lists may be emptyndash The union of 1000 inverted list may be the whole corpus

bull Dimension reductionTerm1 Term2 Term3 Term4 hellip TermN

Img1 1 2 0 0 hellip 2

f1 f2 hellip fM

Img1 02 01 hellip 003

Topic Projection

Dim = 1 million

Dim = 200

Key Idea Dimension Reduction + Residual Error Preservation

bull p original TF-IDF vector in vocabulary spacebull X projection matrix for dimension reductionbull w low dimensional feature vectorbull residual error

p Xw

An image = 0 01 02 03 04 05 06 07 08 09 1

0

2

4

6

8

10

12

14

16

+ a few words(10 words)

Orthogonal Decomposition

p Xw 1 11 1 1

2 21 2 1 2

1 1

1

k

k

W k W

W W Wk W

p x xp x x w

p wp x x

Base vector

Low dimensional representation

Residual

An image = 0 01 02 03 04 05 06 07 08 09 1

0

2

4

6

8

10

12

14

16

+ a few words(10 words)

T Tp q p q

p q

w w

X1 X2 X3 hellip Xk

A Probabilistic Implementation

x is a switch variable It controls a word generated from

bull a topic specific distribution

bull a document specific distribution

bull a background distribution

( | )p w d 1( 0 | ) ( | ) ( | )K

kp x d p w z k p z k d

( 1| ) ( | )p x d p w d ( 2 | ) ( | )p x d p w

C Chemudugunta etal Modeling General and Specific Aspects of Documents with a Probabilistic Topic Model NIPS 2006

Search (Online)

DS1 DS2 hellip

DS1 DS2 hellip

DS1 DS2 hellip

DS1 DS2 hellip

DS1 DS2 hellip

hellip

hellip

hellip

hellip

hellip

LSH Index

Doc 300 Doc 401 hellip

A query = 0 01 02 03 04 05 06 07 08 09 1

0

2

4

6

8

10

12

14

16

+ a few words

Re-rankingDoc 401 hellip

Doc 1

Doc 2

Doc 300

Doc 401

Doc N

Doc 300

Index 10M Images 46GBSearch Speed lt 100ms

Doc Meta

Search Example

Query Image

Search Example

Query Image

Simultaneously Modeling Semantics and Structure of Threaded Discussions A Sparse

Coding Approach and Its Applications

Chen LIN Jiang-Ming YANG Rui CAI Xin-jing WANG Wei WANG Lei ZHANG

SIGIR 2009

123

Semantic amp structure

124

SemanticTopics

StructureWho reply to who

Optimize them together

Model semantic

Model structure

125

Reply reconstruction

126

DocumentSimilarity

TopicSimilarity

StructureSimilarity

BaselinesNP

Reply to Nearest PostRR

Reply to RootDS

Document SimilarityLDA

Latent Dirichlet Allocation Project documents to topic space

SWB Special Words Topic Model with Background distribution Project documents to topic and junk topic space

127

Evaluation

method Slashdot Apple

All Posts Good Posts All Posts Good Posts

NP 0021 0012 0289 0239

RR 0183 0319 0269 0474

DS 0463 0643 0409 0628

LDA 0465 0644 0410 0648

SWB 0463 0644 0410 0641

SMSS 0524 0737 0517 0772

128

Expert finding

Reply reconstruction

Network construction

Expert finding

Methods

HITS

PageRank

hellip

129

Baselines

LMFormal Models for Expert Finding in Enterprise Corpora SIGIR

06Achieves stable performance in expert finding task using a

language modelPageRank

Benchmark nodal ranking methodHITS

Find hub nodes and authority nodeEABIF

Personalized Recommendation Driven by Information Flow SIGIR rsquo06

Find most influential node130

Evaluation

131

bull Bayesian estimate

Method MRR MAP P10

LM 0821 0698 0800

EABIF(ori) 0674 0362 0243

EABIF(rec) 0742 0318 0281

PageRank(ori) 0675 0377 0263

PageRank(rec) 0743 0321 0266

HITS(ori) 0906 0832 0900

HITS(rec) 0938 0822 0906

Summary

bull Matrix and probability are fundamental mathematics in information retrieval and computer visionndash Matrix decomposition ndash a good practice to learn matrixndash Graphical model ndash a good practice to learn probability

bull Graphical model is a good tool to analyze problems

bull The essence of decomposition is to discover a set of mid-level features to describe original documentsimages

bull It is more adaptable for various applications than matrix decomposition

  • An Introduction To Matrix Decomposition and Graphical Model
  • Outline
  • What Is Matrix Decomposition
  • Why We Need Matrix Decomposition
  • Why We Need Matrix Decomposition (2)
  • Principle Component Analysis
  • Definition ndash Eigenvalue amp Eigenvector
  • Definition ndash Principle Component Analysis
  • Principle Component Analysis (2)
  • Maximizing Variance
  • Optimization Problem
  • Property Data Decomposition
  • Face Recognition ndash Eigenface
  • Slide 14
  • Slide 15
  • PageRank ndash Power Iteration
  • Column-Stochastic amp Irreducible
  • Iterative PageRank Calculation
  • Convergence of the power iteration
  • Singular value decomposition
  • SVD - Definition
  • Singular Values And Singular Vectors
  • Matrix approximation
  • SVD and PCA
  • Example - LSI
  • SVD and PCA (2)
  • Latent Semantic Indexing (LSI)
  • Latent Semantic Indexing
  • Similarity Measures
  • Similarity Measures (2)
  • HITS (Hyperlink Induced Topic Search)
  • Power Iteration
  • HITS and SVD
  • HITS vs PageRank
  • NMF ndash Non-Negative Matrix Factorization
  • Definition
  • Motivation
  • Multiplicative Update Algorithm
  • Multiplicative Update Algorithm (2)
  • NMF vs PCA
  • Reference
  • Major Reference
  • Outline (2)
  • Not Included
  • Basic Concepts
  • What Is Machine Learning
  • Likelihood Function
  • Maximum Likelihood (ML)
  • IID ndash Independent Identically Distributed
  • Reference (2)
  • Expectation Maximization
  • Why We Need EM
  • More General
  • The Expectation Maximization (EM) Algorithm
  • Jensenrsquos Inequality
  • Lower Bounding the Log Likelihood
  • The E and M Steps of EM
  • The E Step
  • The M Step
  • EM Never Decreases the Likelihood
  • Reference (3)
  • Why Do We Need Graphical Model
  • Why Do We Need Graphical Models
  • Directed Acyclic Graphical Models (Bayesian Networks)
  • Directed Graphs for Statistical Models Plate Notation
  • pLSA ndash Probabilistic Latent Semantic Analysis
  • Latent Semantic Indexing (LSI) Review
  • pLSA ndash Probabilistic Latent Semantic Analysis (2)
  • pLSA
  • pLSA (2)
  • Joint Probability vs Likelihood
  • Document Decomposition
  • pLSA ndash Objective Function
  • EM Steps
  • Lower Bounding the Log Likelihood (2)
  • EM Steps (2)
  • Latent Subspace
  • pLSA vs LSA
  • pLSA vs LSA (2)
  • Applications
  • Text Mining
  • Scene Classification
  • Classification Result
  • Reference (4)
  • LDA ndash Latent Dirichilet Allocation
  • Problems in pLSA
  • Problems in pLSA (2)
  • Dirichlet Distribution
  • Dirichlet Distribution (2)
  • Example Dirichlet Distributions (K=3)
  • Example Dirichlet Distributions (K=3) (2)
  • The LDA Model
  • The LDA Model (2)
  • Joint Probability
  • Likelihood
  • Inference
  • Inference (2)
  • Variational Inference
  • Variantional Inference
  • VBEM vs EM
  • Parameter Estimation
  • Parameter Estimation (2)
  • Topic Examples in a 100-topic LDA Model)
  • Classification (50-topic LDA + SVM)
  • Problems in LDA
  • A Bayesian Hierarchical Model for Learning Natural Scene Catego
  • Codebook
  • Topic Distribution in Different Categories
  • Topic Hierarchical Clustering
  • More Topic Models
  • Are you really into Graphical Models
  • Reference (5)
  • Outline (3)
  • Slide 115
  • The Long Query Problem
  • Key Idea Dimension Reduction + Residual Error Preservation
  • Orthogonal Decomposition
  • A Probabilistic Implementation
  • Search (Online)
  • Search Example
  • Search Example (2)
  • Simultaneously Modeling Semantics and Structure of Threaded Dis
  • Semantic amp structure
  • Optimize them together
  • Reply reconstruction
  • Baselines
  • Evaluation
  • Expert finding
  • Baselines (2)
  • Evaluation (2)
  • Summary
Page 111: An Introduction To Matrix Decomposition and Graphical Model

Outline

bull Matrix Decompositionndash PCA SVD NMFndash LDA ICA Sparse Coding etc

bull Graphical Modelndash Basic concepts in probabilistic machine learningndash EMndash pLSAndash LDA

bull Two Applicationsndash Document decomposition for ldquolong queryrdquo retrieval ICCV 2009ndash Modeling Threaded Discussions SIGIR 2009

Large-Scale Indexing for ldquoLong Queryrdquo Retrieval (Similarity Search)

Xiao Zhang Zhiwei Li Lei Zhang Wei-Ying Ma and Heung-Yeung Shum

ICCV 2009

The Long Query Problem

bull If a query contains 1000 keywordsndash Need to access 1000 inverted listsndash The intersection of 1000 inverted lists may be emptyndash The union of 1000 inverted list may be the whole corpus

bull Dimension reductionTerm1 Term2 Term3 Term4 hellip TermN

Img1 1 2 0 0 hellip 2

f1 f2 hellip fM

Img1 02 01 hellip 003

Topic Projection

Dim = 1 million

Dim = 200

Key Idea Dimension Reduction + Residual Error Preservation

bull p original TF-IDF vector in vocabulary spacebull X projection matrix for dimension reductionbull w low dimensional feature vectorbull residual error

p Xw

An image = 0 01 02 03 04 05 06 07 08 09 1

0

2

4

6

8

10

12

14

16

+ a few words(10 words)

Orthogonal Decomposition

p Xw 1 11 1 1

2 21 2 1 2

1 1

1

k

k

W k W

W W Wk W

p x xp x x w

p wp x x

Base vector

Low dimensional representation

Residual

An image = 0 01 02 03 04 05 06 07 08 09 1

0

2

4

6

8

10

12

14

16

+ a few words(10 words)

T Tp q p q

p q

w w

X1 X2 X3 hellip Xk

A Probabilistic Implementation

x is a switch variable It controls a word generated from

bull a topic specific distribution

bull a document specific distribution

bull a background distribution

( | )p w d 1( 0 | ) ( | ) ( | )K

kp x d p w z k p z k d

( 1| ) ( | )p x d p w d ( 2 | ) ( | )p x d p w

C Chemudugunta etal Modeling General and Specific Aspects of Documents with a Probabilistic Topic Model NIPS 2006

Search (Online)

DS1 DS2 hellip

DS1 DS2 hellip

DS1 DS2 hellip

DS1 DS2 hellip

DS1 DS2 hellip

hellip

hellip

hellip

hellip

hellip

LSH Index

Doc 300 Doc 401 hellip

A query = 0 01 02 03 04 05 06 07 08 09 1

0

2

4

6

8

10

12

14

16

+ a few words

Re-rankingDoc 401 hellip

Doc 1

Doc 2

Doc 300

Doc 401

Doc N

Doc 300

Index 10M Images 46GBSearch Speed lt 100ms

Doc Meta

Search Example

Query Image

Search Example

Query Image

Simultaneously Modeling Semantics and Structure of Threaded Discussions A Sparse

Coding Approach and Its Applications

Chen LIN Jiang-Ming YANG Rui CAI Xin-jing WANG Wei WANG Lei ZHANG

SIGIR 2009

123

Semantic amp structure

124

SemanticTopics

StructureWho reply to who

Optimize them together

Model semantic

Model structure

125

Reply reconstruction

126

DocumentSimilarity

TopicSimilarity

StructureSimilarity

BaselinesNP

Reply to Nearest PostRR

Reply to RootDS

Document SimilarityLDA

Latent Dirichlet Allocation Project documents to topic space

SWB Special Words Topic Model with Background distribution Project documents to topic and junk topic space

127

Evaluation

method Slashdot Apple

All Posts Good Posts All Posts Good Posts

NP 0021 0012 0289 0239

RR 0183 0319 0269 0474

DS 0463 0643 0409 0628

LDA 0465 0644 0410 0648

SWB 0463 0644 0410 0641

SMSS 0524 0737 0517 0772

128

Expert finding

Reply reconstruction

Network construction

Expert finding

Methods

HITS

PageRank

hellip

129

Baselines

LMFormal Models for Expert Finding in Enterprise Corpora SIGIR

06Achieves stable performance in expert finding task using a

language modelPageRank

Benchmark nodal ranking methodHITS

Find hub nodes and authority nodeEABIF

Personalized Recommendation Driven by Information Flow SIGIR rsquo06

Find most influential node130

Evaluation

131

bull Bayesian estimate

Method MRR MAP P10

LM 0821 0698 0800

EABIF(ori) 0674 0362 0243

EABIF(rec) 0742 0318 0281

PageRank(ori) 0675 0377 0263

PageRank(rec) 0743 0321 0266

HITS(ori) 0906 0832 0900

HITS(rec) 0938 0822 0906

Summary

bull Matrix and probability are fundamental mathematics in information retrieval and computer visionndash Matrix decomposition ndash a good practice to learn matrixndash Graphical model ndash a good practice to learn probability

bull Graphical model is a good tool to analyze problems

bull The essence of decomposition is to discover a set of mid-level features to describe original documentsimages

bull It is more adaptable for various applications than matrix decomposition

  • An Introduction To Matrix Decomposition and Graphical Model
  • Outline
  • What Is Matrix Decomposition
  • Why We Need Matrix Decomposition
  • Why We Need Matrix Decomposition (2)
  • Principle Component Analysis
  • Definition ndash Eigenvalue amp Eigenvector
  • Definition ndash Principle Component Analysis
  • Principle Component Analysis (2)
  • Maximizing Variance
  • Optimization Problem
  • Property Data Decomposition
  • Face Recognition ndash Eigenface
  • Slide 14
  • Slide 15
  • PageRank ndash Power Iteration
  • Column-Stochastic amp Irreducible
  • Iterative PageRank Calculation
  • Convergence of the power iteration
  • Singular value decomposition
  • SVD - Definition
  • Singular Values And Singular Vectors
  • Matrix approximation
  • SVD and PCA
  • Example - LSI
  • SVD and PCA (2)
  • Latent Semantic Indexing (LSI)
  • Latent Semantic Indexing
  • Similarity Measures
  • Similarity Measures (2)
  • HITS (Hyperlink Induced Topic Search)
  • Power Iteration
  • HITS and SVD
  • HITS vs PageRank
  • NMF ndash Non-Negative Matrix Factorization
  • Definition
  • Motivation
  • Multiplicative Update Algorithm
  • Multiplicative Update Algorithm (2)
  • NMF vs PCA
  • Reference
  • Major Reference
  • Outline (2)
  • Not Included
  • Basic Concepts
  • What Is Machine Learning
  • Likelihood Function
  • Maximum Likelihood (ML)
  • IID ndash Independent Identically Distributed
  • Reference (2)
  • Expectation Maximization
  • Why We Need EM
  • More General
  • The Expectation Maximization (EM) Algorithm
  • Jensenrsquos Inequality
  • Lower Bounding the Log Likelihood
  • The E and M Steps of EM
  • The E Step
  • The M Step
  • EM Never Decreases the Likelihood
  • Reference (3)
  • Why Do We Need Graphical Model
  • Why Do We Need Graphical Models
  • Directed Acyclic Graphical Models (Bayesian Networks)
  • Directed Graphs for Statistical Models Plate Notation
  • pLSA ndash Probabilistic Latent Semantic Analysis
  • Latent Semantic Indexing (LSI) Review
  • pLSA ndash Probabilistic Latent Semantic Analysis (2)
  • pLSA
  • pLSA (2)
  • Joint Probability vs Likelihood
  • Document Decomposition
  • pLSA ndash Objective Function
  • EM Steps
  • Lower Bounding the Log Likelihood (2)
  • EM Steps (2)
  • Latent Subspace
  • pLSA vs LSA
  • pLSA vs LSA (2)
  • Applications
  • Text Mining
  • Scene Classification
  • Classification Result
  • Reference (4)
  • LDA ndash Latent Dirichilet Allocation
  • Problems in pLSA
  • Problems in pLSA (2)
  • Dirichlet Distribution
  • Dirichlet Distribution (2)
  • Example Dirichlet Distributions (K=3)
  • Example Dirichlet Distributions (K=3) (2)
  • The LDA Model
  • The LDA Model (2)
  • Joint Probability
  • Likelihood
  • Inference
  • Inference (2)
  • Variational Inference
  • Variantional Inference
  • VBEM vs EM
  • Parameter Estimation
  • Parameter Estimation (2)
  • Topic Examples in a 100-topic LDA Model)
  • Classification (50-topic LDA + SVM)
  • Problems in LDA
  • A Bayesian Hierarchical Model for Learning Natural Scene Catego
  • Codebook
  • Topic Distribution in Different Categories
  • Topic Hierarchical Clustering
  • More Topic Models
  • Are you really into Graphical Models
  • Reference (5)
  • Outline (3)
  • Slide 115
  • The Long Query Problem
  • Key Idea Dimension Reduction + Residual Error Preservation
  • Orthogonal Decomposition
  • A Probabilistic Implementation
  • Search (Online)
  • Search Example
  • Search Example (2)
  • Simultaneously Modeling Semantics and Structure of Threaded Dis
  • Semantic amp structure
  • Optimize them together
  • Reply reconstruction
  • Baselines
  • Evaluation
  • Expert finding
  • Baselines (2)
  • Evaluation (2)
  • Summary
Page 112: An Introduction To Matrix Decomposition and Graphical Model

Large-Scale Indexing for ldquoLong Queryrdquo Retrieval (Similarity Search)

Xiao Zhang Zhiwei Li Lei Zhang Wei-Ying Ma and Heung-Yeung Shum

ICCV 2009

The Long Query Problem

bull If a query contains 1000 keywordsndash Need to access 1000 inverted listsndash The intersection of 1000 inverted lists may be emptyndash The union of 1000 inverted list may be the whole corpus

bull Dimension reductionTerm1 Term2 Term3 Term4 hellip TermN

Img1 1 2 0 0 hellip 2

f1 f2 hellip fM

Img1 02 01 hellip 003

Topic Projection

Dim = 1 million

Dim = 200

Key Idea Dimension Reduction + Residual Error Preservation

bull p original TF-IDF vector in vocabulary spacebull X projection matrix for dimension reductionbull w low dimensional feature vectorbull residual error

p Xw

An image = 0 01 02 03 04 05 06 07 08 09 1

0

2

4

6

8

10

12

14

16

+ a few words(10 words)

Orthogonal Decomposition

p Xw 1 11 1 1

2 21 2 1 2

1 1

1

k

k

W k W

W W Wk W

p x xp x x w

p wp x x

Base vector

Low dimensional representation

Residual

An image = 0 01 02 03 04 05 06 07 08 09 1

0

2

4

6

8

10

12

14

16

+ a few words(10 words)

T Tp q p q

p q

w w

X1 X2 X3 hellip Xk

A Probabilistic Implementation

x is a switch variable It controls a word generated from

bull a topic specific distribution

bull a document specific distribution

bull a background distribution

( | )p w d 1( 0 | ) ( | ) ( | )K

kp x d p w z k p z k d

( 1| ) ( | )p x d p w d ( 2 | ) ( | )p x d p w

C Chemudugunta etal Modeling General and Specific Aspects of Documents with a Probabilistic Topic Model NIPS 2006

Search (Online)

DS1 DS2 hellip

DS1 DS2 hellip

DS1 DS2 hellip

DS1 DS2 hellip

DS1 DS2 hellip

hellip

hellip

hellip

hellip

hellip

LSH Index

Doc 300 Doc 401 hellip

A query = 0 01 02 03 04 05 06 07 08 09 1

0

2

4

6

8

10

12

14

16

+ a few words

Re-rankingDoc 401 hellip

Doc 1

Doc 2

Doc 300

Doc 401

Doc N

Doc 300

Index 10M Images 46GBSearch Speed lt 100ms

Doc Meta

Search Example

Query Image

Search Example

Query Image

Simultaneously Modeling Semantics and Structure of Threaded Discussions A Sparse

Coding Approach and Its Applications

Chen LIN Jiang-Ming YANG Rui CAI Xin-jing WANG Wei WANG Lei ZHANG

SIGIR 2009

123

Semantic amp structure

124

SemanticTopics

StructureWho reply to who

Optimize them together

Model semantic

Model structure

125

Reply reconstruction

126

DocumentSimilarity

TopicSimilarity

StructureSimilarity

BaselinesNP

Reply to Nearest PostRR

Reply to RootDS

Document SimilarityLDA

Latent Dirichlet Allocation Project documents to topic space

SWB Special Words Topic Model with Background distribution Project documents to topic and junk topic space

127

Evaluation

method Slashdot Apple

All Posts Good Posts All Posts Good Posts

NP 0021 0012 0289 0239

RR 0183 0319 0269 0474

DS 0463 0643 0409 0628

LDA 0465 0644 0410 0648

SWB 0463 0644 0410 0641

SMSS 0524 0737 0517 0772

128

Expert finding

Reply reconstruction

Network construction

Expert finding

Methods

HITS

PageRank

hellip

129

Baselines

LMFormal Models for Expert Finding in Enterprise Corpora SIGIR

06Achieves stable performance in expert finding task using a

language modelPageRank

Benchmark nodal ranking methodHITS

Find hub nodes and authority nodeEABIF

Personalized Recommendation Driven by Information Flow SIGIR rsquo06

Find most influential node130

Evaluation

131

bull Bayesian estimate

Method MRR MAP P10

LM 0821 0698 0800

EABIF(ori) 0674 0362 0243

EABIF(rec) 0742 0318 0281

PageRank(ori) 0675 0377 0263

PageRank(rec) 0743 0321 0266

HITS(ori) 0906 0832 0900

HITS(rec) 0938 0822 0906

Summary

bull Matrix and probability are fundamental mathematics in information retrieval and computer visionndash Matrix decomposition ndash a good practice to learn matrixndash Graphical model ndash a good practice to learn probability

bull Graphical model is a good tool to analyze problems

bull The essence of decomposition is to discover a set of mid-level features to describe original documentsimages

bull It is more adaptable for various applications than matrix decomposition

  • An Introduction To Matrix Decomposition and Graphical Model
  • Outline
  • What Is Matrix Decomposition
  • Why We Need Matrix Decomposition
  • Why We Need Matrix Decomposition (2)
  • Principle Component Analysis
  • Definition ndash Eigenvalue amp Eigenvector
  • Definition ndash Principle Component Analysis
  • Principle Component Analysis (2)
  • Maximizing Variance
  • Optimization Problem
  • Property Data Decomposition
  • Face Recognition ndash Eigenface
  • Slide 14
  • Slide 15
  • PageRank ndash Power Iteration
  • Column-Stochastic amp Irreducible
  • Iterative PageRank Calculation
  • Convergence of the power iteration
  • Singular value decomposition
  • SVD - Definition
  • Singular Values And Singular Vectors
  • Matrix approximation
  • SVD and PCA
  • Example - LSI
  • SVD and PCA (2)
  • Latent Semantic Indexing (LSI)
  • Latent Semantic Indexing
  • Similarity Measures
  • Similarity Measures (2)
  • HITS (Hyperlink Induced Topic Search)
  • Power Iteration
  • HITS and SVD
  • HITS vs PageRank
  • NMF ndash Non-Negative Matrix Factorization
  • Definition
  • Motivation
  • Multiplicative Update Algorithm
  • Multiplicative Update Algorithm (2)
  • NMF vs PCA
  • Reference
  • Major Reference
  • Outline (2)
  • Not Included
  • Basic Concepts
  • What Is Machine Learning
  • Likelihood Function
  • Maximum Likelihood (ML)
  • IID ndash Independent Identically Distributed
  • Reference (2)
  • Expectation Maximization
  • Why We Need EM
  • More General
  • The Expectation Maximization (EM) Algorithm
  • Jensenrsquos Inequality
  • Lower Bounding the Log Likelihood
  • The E and M Steps of EM
  • The E Step
  • The M Step
  • EM Never Decreases the Likelihood
  • Reference (3)
  • Why Do We Need Graphical Model
  • Why Do We Need Graphical Models
  • Directed Acyclic Graphical Models (Bayesian Networks)
  • Directed Graphs for Statistical Models Plate Notation
  • pLSA ndash Probabilistic Latent Semantic Analysis
  • Latent Semantic Indexing (LSI) Review
  • pLSA ndash Probabilistic Latent Semantic Analysis (2)
  • pLSA
  • pLSA (2)
  • Joint Probability vs Likelihood
  • Document Decomposition
  • pLSA ndash Objective Function
  • EM Steps
  • Lower Bounding the Log Likelihood (2)
  • EM Steps (2)
  • Latent Subspace
  • pLSA vs LSA
  • pLSA vs LSA (2)
  • Applications
  • Text Mining
  • Scene Classification
  • Classification Result
  • Reference (4)
  • LDA ndash Latent Dirichilet Allocation
  • Problems in pLSA
  • Problems in pLSA (2)
  • Dirichlet Distribution
  • Dirichlet Distribution (2)
  • Example Dirichlet Distributions (K=3)
  • Example Dirichlet Distributions (K=3) (2)
  • The LDA Model
  • The LDA Model (2)
  • Joint Probability
  • Likelihood
  • Inference
  • Inference (2)
  • Variational Inference
  • Variantional Inference
  • VBEM vs EM
  • Parameter Estimation
  • Parameter Estimation (2)
  • Topic Examples in a 100-topic LDA Model)
  • Classification (50-topic LDA + SVM)
  • Problems in LDA
  • A Bayesian Hierarchical Model for Learning Natural Scene Catego
  • Codebook
  • Topic Distribution in Different Categories
  • Topic Hierarchical Clustering
  • More Topic Models
  • Are you really into Graphical Models
  • Reference (5)
  • Outline (3)
  • Slide 115
  • The Long Query Problem
  • Key Idea Dimension Reduction + Residual Error Preservation
  • Orthogonal Decomposition
  • A Probabilistic Implementation
  • Search (Online)
  • Search Example
  • Search Example (2)
  • Simultaneously Modeling Semantics and Structure of Threaded Dis
  • Semantic amp structure
  • Optimize them together
  • Reply reconstruction
  • Baselines
  • Evaluation
  • Expert finding
  • Baselines (2)
  • Evaluation (2)
  • Summary
Page 113: An Introduction To Matrix Decomposition and Graphical Model

The Long Query Problem

bull If a query contains 1000 keywordsndash Need to access 1000 inverted listsndash The intersection of 1000 inverted lists may be emptyndash The union of 1000 inverted list may be the whole corpus

bull Dimension reductionTerm1 Term2 Term3 Term4 hellip TermN

Img1 1 2 0 0 hellip 2

f1 f2 hellip fM

Img1 02 01 hellip 003

Topic Projection

Dim = 1 million

Dim = 200

Key Idea Dimension Reduction + Residual Error Preservation

bull p original TF-IDF vector in vocabulary spacebull X projection matrix for dimension reductionbull w low dimensional feature vectorbull residual error

p Xw

An image = 0 01 02 03 04 05 06 07 08 09 1

0

2

4

6

8

10

12

14

16

+ a few words(10 words)

Orthogonal Decomposition

p Xw 1 11 1 1

2 21 2 1 2

1 1

1

k

k

W k W

W W Wk W

p x xp x x w

p wp x x

Base vector

Low dimensional representation

Residual

An image = 0 01 02 03 04 05 06 07 08 09 1

0

2

4

6

8

10

12

14

16

+ a few words(10 words)

T Tp q p q

p q

w w

X1 X2 X3 hellip Xk

A Probabilistic Implementation

x is a switch variable It controls a word generated from

bull a topic specific distribution

bull a document specific distribution

bull a background distribution

( | )p w d 1( 0 | ) ( | ) ( | )K

kp x d p w z k p z k d

( 1| ) ( | )p x d p w d ( 2 | ) ( | )p x d p w

C Chemudugunta etal Modeling General and Specific Aspects of Documents with a Probabilistic Topic Model NIPS 2006

Search (Online)

DS1 DS2 hellip

DS1 DS2 hellip

DS1 DS2 hellip

DS1 DS2 hellip

DS1 DS2 hellip

hellip

hellip

hellip

hellip

hellip

LSH Index

Doc 300 Doc 401 hellip

A query = 0 01 02 03 04 05 06 07 08 09 1

0

2

4

6

8

10

12

14

16

+ a few words

Re-rankingDoc 401 hellip

Doc 1

Doc 2

Doc 300

Doc 401

Doc N

Doc 300

Index 10M Images 46GBSearch Speed lt 100ms

Doc Meta

Search Example

Query Image

Search Example

Query Image

Simultaneously Modeling Semantics and Structure of Threaded Discussions A Sparse

Coding Approach and Its Applications

Chen LIN Jiang-Ming YANG Rui CAI Xin-jing WANG Wei WANG Lei ZHANG

SIGIR 2009

123

Semantic amp structure

124

SemanticTopics

StructureWho reply to who

Optimize them together

Model semantic

Model structure

125

Reply reconstruction

126

DocumentSimilarity

TopicSimilarity

StructureSimilarity

BaselinesNP

Reply to Nearest PostRR

Reply to RootDS

Document SimilarityLDA

Latent Dirichlet Allocation Project documents to topic space

SWB Special Words Topic Model with Background distribution Project documents to topic and junk topic space

127

Evaluation

method Slashdot Apple

All Posts Good Posts All Posts Good Posts

NP 0021 0012 0289 0239

RR 0183 0319 0269 0474

DS 0463 0643 0409 0628

LDA 0465 0644 0410 0648

SWB 0463 0644 0410 0641

SMSS 0524 0737 0517 0772

128

Expert finding

Reply reconstruction

Network construction

Expert finding

Methods

HITS

PageRank

hellip

129

Baselines

LMFormal Models for Expert Finding in Enterprise Corpora SIGIR

06Achieves stable performance in expert finding task using a

language modelPageRank

Benchmark nodal ranking methodHITS

Find hub nodes and authority nodeEABIF

Personalized Recommendation Driven by Information Flow SIGIR rsquo06

Find most influential node130

Evaluation

131

bull Bayesian estimate

Method MRR MAP P10

LM 0821 0698 0800

EABIF(ori) 0674 0362 0243

EABIF(rec) 0742 0318 0281

PageRank(ori) 0675 0377 0263

PageRank(rec) 0743 0321 0266

HITS(ori) 0906 0832 0900

HITS(rec) 0938 0822 0906

Summary

bull Matrix and probability are fundamental mathematics in information retrieval and computer visionndash Matrix decomposition ndash a good practice to learn matrixndash Graphical model ndash a good practice to learn probability

bull Graphical model is a good tool to analyze problems

bull The essence of decomposition is to discover a set of mid-level features to describe original documentsimages

bull It is more adaptable for various applications than matrix decomposition

  • An Introduction To Matrix Decomposition and Graphical Model
  • Outline
  • What Is Matrix Decomposition
  • Why We Need Matrix Decomposition
  • Why We Need Matrix Decomposition (2)
  • Principle Component Analysis
  • Definition ndash Eigenvalue amp Eigenvector
  • Definition ndash Principle Component Analysis
  • Principle Component Analysis (2)
  • Maximizing Variance
  • Optimization Problem
  • Property Data Decomposition
  • Face Recognition ndash Eigenface
  • Slide 14
  • Slide 15
  • PageRank ndash Power Iteration
  • Column-Stochastic amp Irreducible
  • Iterative PageRank Calculation
  • Convergence of the power iteration
  • Singular value decomposition
  • SVD - Definition
  • Singular Values And Singular Vectors
  • Matrix approximation
  • SVD and PCA
  • Example - LSI
  • SVD and PCA (2)
  • Latent Semantic Indexing (LSI)
  • Latent Semantic Indexing
  • Similarity Measures
  • Similarity Measures (2)
  • HITS (Hyperlink Induced Topic Search)
  • Power Iteration
  • HITS and SVD
  • HITS vs PageRank
  • NMF ndash Non-Negative Matrix Factorization
  • Definition
  • Motivation
  • Multiplicative Update Algorithm
  • Multiplicative Update Algorithm (2)
  • NMF vs PCA
  • Reference
  • Major Reference
  • Outline (2)
  • Not Included
  • Basic Concepts
  • What Is Machine Learning
  • Likelihood Function
  • Maximum Likelihood (ML)
  • IID ndash Independent Identically Distributed
  • Reference (2)
  • Expectation Maximization
  • Why We Need EM
  • More General
  • The Expectation Maximization (EM) Algorithm
  • Jensenrsquos Inequality
  • Lower Bounding the Log Likelihood
  • The E and M Steps of EM
  • The E Step
  • The M Step
  • EM Never Decreases the Likelihood
  • Reference (3)
  • Why Do We Need Graphical Model
  • Why Do We Need Graphical Models
  • Directed Acyclic Graphical Models (Bayesian Networks)
  • Directed Graphs for Statistical Models Plate Notation
  • pLSA ndash Probabilistic Latent Semantic Analysis
  • Latent Semantic Indexing (LSI) Review
  • pLSA ndash Probabilistic Latent Semantic Analysis (2)
  • pLSA
  • pLSA (2)
  • Joint Probability vs Likelihood
  • Document Decomposition
  • pLSA ndash Objective Function
  • EM Steps
  • Lower Bounding the Log Likelihood (2)
  • EM Steps (2)
  • Latent Subspace
  • pLSA vs LSA
  • pLSA vs LSA (2)
  • Applications
  • Text Mining
  • Scene Classification
  • Classification Result
  • Reference (4)
  • LDA ndash Latent Dirichilet Allocation
  • Problems in pLSA
  • Problems in pLSA (2)
  • Dirichlet Distribution
  • Dirichlet Distribution (2)
  • Example Dirichlet Distributions (K=3)
  • Example Dirichlet Distributions (K=3) (2)
  • The LDA Model
  • The LDA Model (2)
  • Joint Probability
  • Likelihood
  • Inference
  • Inference (2)
  • Variational Inference
  • Variantional Inference
  • VBEM vs EM
  • Parameter Estimation
  • Parameter Estimation (2)
  • Topic Examples in a 100-topic LDA Model)
  • Classification (50-topic LDA + SVM)
  • Problems in LDA
  • A Bayesian Hierarchical Model for Learning Natural Scene Catego
  • Codebook
  • Topic Distribution in Different Categories
  • Topic Hierarchical Clustering
  • More Topic Models
  • Are you really into Graphical Models
  • Reference (5)
  • Outline (3)
  • Slide 115
  • The Long Query Problem
  • Key Idea Dimension Reduction + Residual Error Preservation
  • Orthogonal Decomposition
  • A Probabilistic Implementation
  • Search (Online)
  • Search Example
  • Search Example (2)
  • Simultaneously Modeling Semantics and Structure of Threaded Dis
  • Semantic amp structure
  • Optimize them together
  • Reply reconstruction
  • Baselines
  • Evaluation
  • Expert finding
  • Baselines (2)
  • Evaluation (2)
  • Summary
Page 114: An Introduction To Matrix Decomposition and Graphical Model

Key Idea Dimension Reduction + Residual Error Preservation

bull p original TF-IDF vector in vocabulary spacebull X projection matrix for dimension reductionbull w low dimensional feature vectorbull residual error

p Xw

An image = 0 01 02 03 04 05 06 07 08 09 1

0

2

4

6

8

10

12

14

16

+ a few words(10 words)

Orthogonal Decomposition

p Xw 1 11 1 1

2 21 2 1 2

1 1

1

k

k

W k W

W W Wk W

p x xp x x w

p wp x x

Base vector

Low dimensional representation

Residual

An image = 0 01 02 03 04 05 06 07 08 09 1

0

2

4

6

8

10

12

14

16

+ a few words(10 words)

T Tp q p q

p q

w w

X1 X2 X3 hellip Xk

A Probabilistic Implementation

x is a switch variable It controls a word generated from

bull a topic specific distribution

bull a document specific distribution

bull a background distribution

( | )p w d 1( 0 | ) ( | ) ( | )K

kp x d p w z k p z k d

( 1| ) ( | )p x d p w d ( 2 | ) ( | )p x d p w

C Chemudugunta etal Modeling General and Specific Aspects of Documents with a Probabilistic Topic Model NIPS 2006

Search (Online)

DS1 DS2 hellip

DS1 DS2 hellip

DS1 DS2 hellip

DS1 DS2 hellip

DS1 DS2 hellip

hellip

hellip

hellip

hellip

hellip

LSH Index

Doc 300 Doc 401 hellip

A query = 0 01 02 03 04 05 06 07 08 09 1

0

2

4

6

8

10

12

14

16

+ a few words

Re-rankingDoc 401 hellip

Doc 1

Doc 2

Doc 300

Doc 401

Doc N

Doc 300

Index 10M Images 46GBSearch Speed lt 100ms

Doc Meta

Search Example

Query Image

Search Example

Query Image

Simultaneously Modeling Semantics and Structure of Threaded Discussions A Sparse

Coding Approach and Its Applications

Chen LIN Jiang-Ming YANG Rui CAI Xin-jing WANG Wei WANG Lei ZHANG

SIGIR 2009

123

Semantic amp structure

124

SemanticTopics

StructureWho reply to who

Optimize them together

Model semantic

Model structure

125

Reply reconstruction

126

DocumentSimilarity

TopicSimilarity

StructureSimilarity

BaselinesNP

Reply to Nearest PostRR

Reply to RootDS

Document SimilarityLDA

Latent Dirichlet Allocation Project documents to topic space

SWB Special Words Topic Model with Background distribution Project documents to topic and junk topic space

127

Evaluation

method Slashdot Apple

All Posts Good Posts All Posts Good Posts

NP 0021 0012 0289 0239

RR 0183 0319 0269 0474

DS 0463 0643 0409 0628

LDA 0465 0644 0410 0648

SWB 0463 0644 0410 0641

SMSS 0524 0737 0517 0772

128

Expert finding

Reply reconstruction

Network construction

Expert finding

Methods

HITS

PageRank

hellip

129

Baselines

LMFormal Models for Expert Finding in Enterprise Corpora SIGIR

06Achieves stable performance in expert finding task using a

language modelPageRank

Benchmark nodal ranking methodHITS

Find hub nodes and authority nodeEABIF

Personalized Recommendation Driven by Information Flow SIGIR rsquo06

Find most influential node130

Evaluation

131

bull Bayesian estimate

Method MRR MAP P10

LM 0821 0698 0800

EABIF(ori) 0674 0362 0243

EABIF(rec) 0742 0318 0281

PageRank(ori) 0675 0377 0263

PageRank(rec) 0743 0321 0266

HITS(ori) 0906 0832 0900

HITS(rec) 0938 0822 0906

Summary

bull Matrix and probability are fundamental mathematics in information retrieval and computer visionndash Matrix decomposition ndash a good practice to learn matrixndash Graphical model ndash a good practice to learn probability

bull Graphical model is a good tool to analyze problems

bull The essence of decomposition is to discover a set of mid-level features to describe original documentsimages

bull It is more adaptable for various applications than matrix decomposition

  • An Introduction To Matrix Decomposition and Graphical Model
  • Outline
  • What Is Matrix Decomposition
  • Why We Need Matrix Decomposition
  • Why We Need Matrix Decomposition (2)
  • Principle Component Analysis
  • Definition ndash Eigenvalue amp Eigenvector
  • Definition ndash Principle Component Analysis
  • Principle Component Analysis (2)
  • Maximizing Variance
  • Optimization Problem
  • Property Data Decomposition
  • Face Recognition ndash Eigenface
  • Slide 14
  • Slide 15
  • PageRank ndash Power Iteration
  • Column-Stochastic amp Irreducible
  • Iterative PageRank Calculation
  • Convergence of the power iteration
  • Singular value decomposition
  • SVD - Definition
  • Singular Values And Singular Vectors
  • Matrix approximation
  • SVD and PCA
  • Example - LSI
  • SVD and PCA (2)
  • Latent Semantic Indexing (LSI)
  • Latent Semantic Indexing
  • Similarity Measures
  • Similarity Measures (2)
  • HITS (Hyperlink Induced Topic Search)
  • Power Iteration
  • HITS and SVD
  • HITS vs PageRank
  • NMF ndash Non-Negative Matrix Factorization
  • Definition
  • Motivation
  • Multiplicative Update Algorithm
  • Multiplicative Update Algorithm (2)
  • NMF vs PCA
  • Reference
  • Major Reference
  • Outline (2)
  • Not Included
  • Basic Concepts
  • What Is Machine Learning
  • Likelihood Function
  • Maximum Likelihood (ML)
  • IID ndash Independent Identically Distributed
  • Reference (2)
  • Expectation Maximization
  • Why We Need EM
  • More General
  • The Expectation Maximization (EM) Algorithm
  • Jensenrsquos Inequality
  • Lower Bounding the Log Likelihood
  • The E and M Steps of EM
  • The E Step
  • The M Step
  • EM Never Decreases the Likelihood
  • Reference (3)
  • Why Do We Need Graphical Model
  • Why Do We Need Graphical Models
  • Directed Acyclic Graphical Models (Bayesian Networks)
  • Directed Graphs for Statistical Models Plate Notation
  • pLSA ndash Probabilistic Latent Semantic Analysis
  • Latent Semantic Indexing (LSI) Review
  • pLSA ndash Probabilistic Latent Semantic Analysis (2)
  • pLSA
  • pLSA (2)
  • Joint Probability vs Likelihood
  • Document Decomposition
  • pLSA ndash Objective Function
  • EM Steps
  • Lower Bounding the Log Likelihood (2)
  • EM Steps (2)
  • Latent Subspace
  • pLSA vs LSA
  • pLSA vs LSA (2)
  • Applications
  • Text Mining
  • Scene Classification
  • Classification Result
  • Reference (4)
  • LDA ndash Latent Dirichilet Allocation
  • Problems in pLSA
  • Problems in pLSA (2)
  • Dirichlet Distribution
  • Dirichlet Distribution (2)
  • Example Dirichlet Distributions (K=3)
  • Example Dirichlet Distributions (K=3) (2)
  • The LDA Model
  • The LDA Model (2)
  • Joint Probability
  • Likelihood
  • Inference
  • Inference (2)
  • Variational Inference
  • Variantional Inference
  • VBEM vs EM
  • Parameter Estimation
  • Parameter Estimation (2)
  • Topic Examples in a 100-topic LDA Model)
  • Classification (50-topic LDA + SVM)
  • Problems in LDA
  • A Bayesian Hierarchical Model for Learning Natural Scene Catego
  • Codebook
  • Topic Distribution in Different Categories
  • Topic Hierarchical Clustering
  • More Topic Models
  • Are you really into Graphical Models
  • Reference (5)
  • Outline (3)
  • Slide 115
  • The Long Query Problem
  • Key Idea Dimension Reduction + Residual Error Preservation
  • Orthogonal Decomposition
  • A Probabilistic Implementation
  • Search (Online)
  • Search Example
  • Search Example (2)
  • Simultaneously Modeling Semantics and Structure of Threaded Dis
  • Semantic amp structure
  • Optimize them together
  • Reply reconstruction
  • Baselines
  • Evaluation
  • Expert finding
  • Baselines (2)
  • Evaluation (2)
  • Summary
Page 115: An Introduction To Matrix Decomposition and Graphical Model

Orthogonal Decomposition

p Xw 1 11 1 1

2 21 2 1 2

1 1

1

k

k

W k W

W W Wk W

p x xp x x w

p wp x x

Base vector

Low dimensional representation

Residual

An image = 0 01 02 03 04 05 06 07 08 09 1

0

2

4

6

8

10

12

14

16

+ a few words(10 words)

T Tp q p q

p q

w w

X1 X2 X3 hellip Xk

A Probabilistic Implementation

x is a switch variable It controls a word generated from

bull a topic specific distribution

bull a document specific distribution

bull a background distribution

( | )p w d 1( 0 | ) ( | ) ( | )K

kp x d p w z k p z k d

( 1| ) ( | )p x d p w d ( 2 | ) ( | )p x d p w

C Chemudugunta etal Modeling General and Specific Aspects of Documents with a Probabilistic Topic Model NIPS 2006

Search (Online)

DS1 DS2 hellip

DS1 DS2 hellip

DS1 DS2 hellip

DS1 DS2 hellip

DS1 DS2 hellip

hellip

hellip

hellip

hellip

hellip

LSH Index

Doc 300 Doc 401 hellip

A query = 0 01 02 03 04 05 06 07 08 09 1

0

2

4

6

8

10

12

14

16

+ a few words

Re-rankingDoc 401 hellip

Doc 1

Doc 2

Doc 300

Doc 401

Doc N

Doc 300

Index 10M Images 46GBSearch Speed lt 100ms

Doc Meta

Search Example

Query Image

Search Example

Query Image

Simultaneously Modeling Semantics and Structure of Threaded Discussions A Sparse

Coding Approach and Its Applications

Chen LIN Jiang-Ming YANG Rui CAI Xin-jing WANG Wei WANG Lei ZHANG

SIGIR 2009

123

Semantic amp structure

124

SemanticTopics

StructureWho reply to who

Optimize them together

Model semantic

Model structure

125

Reply reconstruction

126

DocumentSimilarity

TopicSimilarity

StructureSimilarity

BaselinesNP

Reply to Nearest PostRR

Reply to RootDS

Document SimilarityLDA

Latent Dirichlet Allocation Project documents to topic space

SWB Special Words Topic Model with Background distribution Project documents to topic and junk topic space

127

Evaluation

method Slashdot Apple

All Posts Good Posts All Posts Good Posts

NP 0021 0012 0289 0239

RR 0183 0319 0269 0474

DS 0463 0643 0409 0628

LDA 0465 0644 0410 0648

SWB 0463 0644 0410 0641

SMSS 0524 0737 0517 0772

128

Expert finding

Reply reconstruction

Network construction

Expert finding

Methods

HITS

PageRank

hellip

129

Baselines

LMFormal Models for Expert Finding in Enterprise Corpora SIGIR

06Achieves stable performance in expert finding task using a

language modelPageRank

Benchmark nodal ranking methodHITS

Find hub nodes and authority nodeEABIF

Personalized Recommendation Driven by Information Flow SIGIR rsquo06

Find most influential node130

Evaluation

131

bull Bayesian estimate

Method MRR MAP P10

LM 0821 0698 0800

EABIF(ori) 0674 0362 0243

EABIF(rec) 0742 0318 0281

PageRank(ori) 0675 0377 0263

PageRank(rec) 0743 0321 0266

HITS(ori) 0906 0832 0900

HITS(rec) 0938 0822 0906

Summary

bull Matrix and probability are fundamental mathematics in information retrieval and computer visionndash Matrix decomposition ndash a good practice to learn matrixndash Graphical model ndash a good practice to learn probability

bull Graphical model is a good tool to analyze problems

bull The essence of decomposition is to discover a set of mid-level features to describe original documentsimages

bull It is more adaptable for various applications than matrix decomposition

  • An Introduction To Matrix Decomposition and Graphical Model
  • Outline
  • What Is Matrix Decomposition
  • Why We Need Matrix Decomposition
  • Why We Need Matrix Decomposition (2)
  • Principle Component Analysis
  • Definition ndash Eigenvalue amp Eigenvector
  • Definition ndash Principle Component Analysis
  • Principle Component Analysis (2)
  • Maximizing Variance
  • Optimization Problem
  • Property Data Decomposition
  • Face Recognition ndash Eigenface
  • Slide 14
  • Slide 15
  • PageRank ndash Power Iteration
  • Column-Stochastic amp Irreducible
  • Iterative PageRank Calculation
  • Convergence of the power iteration
  • Singular value decomposition
  • SVD - Definition
  • Singular Values And Singular Vectors
  • Matrix approximation
  • SVD and PCA
  • Example - LSI
  • SVD and PCA (2)
  • Latent Semantic Indexing (LSI)
  • Latent Semantic Indexing
  • Similarity Measures
  • Similarity Measures (2)
  • HITS (Hyperlink Induced Topic Search)
  • Power Iteration
  • HITS and SVD
  • HITS vs PageRank
  • NMF ndash Non-Negative Matrix Factorization
  • Definition
  • Motivation
  • Multiplicative Update Algorithm
  • Multiplicative Update Algorithm (2)
  • NMF vs PCA
  • Reference
  • Major Reference
  • Outline (2)
  • Not Included
  • Basic Concepts
  • What Is Machine Learning
  • Likelihood Function
  • Maximum Likelihood (ML)
  • IID ndash Independent Identically Distributed
  • Reference (2)
  • Expectation Maximization
  • Why We Need EM
  • More General
  • The Expectation Maximization (EM) Algorithm
  • Jensenrsquos Inequality
  • Lower Bounding the Log Likelihood
  • The E and M Steps of EM
  • The E Step
  • The M Step
  • EM Never Decreases the Likelihood
  • Reference (3)
  • Why Do We Need Graphical Model
  • Why Do We Need Graphical Models
  • Directed Acyclic Graphical Models (Bayesian Networks)
  • Directed Graphs for Statistical Models Plate Notation
  • pLSA ndash Probabilistic Latent Semantic Analysis
  • Latent Semantic Indexing (LSI) Review
  • pLSA ndash Probabilistic Latent Semantic Analysis (2)
  • pLSA
  • pLSA (2)
  • Joint Probability vs Likelihood
  • Document Decomposition
  • pLSA ndash Objective Function
  • EM Steps
  • Lower Bounding the Log Likelihood (2)
  • EM Steps (2)
  • Latent Subspace
  • pLSA vs LSA
  • pLSA vs LSA (2)
  • Applications
  • Text Mining
  • Scene Classification
  • Classification Result
  • Reference (4)
  • LDA ndash Latent Dirichilet Allocation
  • Problems in pLSA
  • Problems in pLSA (2)
  • Dirichlet Distribution
  • Dirichlet Distribution (2)
  • Example Dirichlet Distributions (K=3)
  • Example Dirichlet Distributions (K=3) (2)
  • The LDA Model
  • The LDA Model (2)
  • Joint Probability
  • Likelihood
  • Inference
  • Inference (2)
  • Variational Inference
  • Variantional Inference
  • VBEM vs EM
  • Parameter Estimation
  • Parameter Estimation (2)
  • Topic Examples in a 100-topic LDA Model)
  • Classification (50-topic LDA + SVM)
  • Problems in LDA
  • A Bayesian Hierarchical Model for Learning Natural Scene Catego
  • Codebook
  • Topic Distribution in Different Categories
  • Topic Hierarchical Clustering
  • More Topic Models
  • Are you really into Graphical Models
  • Reference (5)
  • Outline (3)
  • Slide 115
  • The Long Query Problem
  • Key Idea Dimension Reduction + Residual Error Preservation
  • Orthogonal Decomposition
  • A Probabilistic Implementation
  • Search (Online)
  • Search Example
  • Search Example (2)
  • Simultaneously Modeling Semantics and Structure of Threaded Dis
  • Semantic amp structure
  • Optimize them together
  • Reply reconstruction
  • Baselines
  • Evaluation
  • Expert finding
  • Baselines (2)
  • Evaluation (2)
  • Summary
Page 116: An Introduction To Matrix Decomposition and Graphical Model

A Probabilistic Implementation

x is a switch variable It controls a word generated from

bull a topic specific distribution

bull a document specific distribution

bull a background distribution

( | )p w d 1( 0 | ) ( | ) ( | )K

kp x d p w z k p z k d

( 1| ) ( | )p x d p w d ( 2 | ) ( | )p x d p w

C Chemudugunta etal Modeling General and Specific Aspects of Documents with a Probabilistic Topic Model NIPS 2006

Search (Online)

DS1 DS2 hellip

DS1 DS2 hellip

DS1 DS2 hellip

DS1 DS2 hellip

DS1 DS2 hellip

hellip

hellip

hellip

hellip

hellip

LSH Index

Doc 300 Doc 401 hellip

A query = 0 01 02 03 04 05 06 07 08 09 1

0

2

4

6

8

10

12

14

16

+ a few words

Re-rankingDoc 401 hellip

Doc 1

Doc 2

Doc 300

Doc 401

Doc N

Doc 300

Index 10M Images 46GBSearch Speed lt 100ms

Doc Meta

Search Example

Query Image

Search Example

Query Image

Simultaneously Modeling Semantics and Structure of Threaded Discussions A Sparse

Coding Approach and Its Applications

Chen LIN Jiang-Ming YANG Rui CAI Xin-jing WANG Wei WANG Lei ZHANG

SIGIR 2009

123

Semantic amp structure

124

SemanticTopics

StructureWho reply to who

Optimize them together

Model semantic

Model structure

125

Reply reconstruction

126

DocumentSimilarity

TopicSimilarity

StructureSimilarity

BaselinesNP

Reply to Nearest PostRR

Reply to RootDS

Document SimilarityLDA

Latent Dirichlet Allocation Project documents to topic space

SWB Special Words Topic Model with Background distribution Project documents to topic and junk topic space

127

Evaluation

method Slashdot Apple

All Posts Good Posts All Posts Good Posts

NP 0021 0012 0289 0239

RR 0183 0319 0269 0474

DS 0463 0643 0409 0628

LDA 0465 0644 0410 0648

SWB 0463 0644 0410 0641

SMSS 0524 0737 0517 0772

128

Expert finding

Reply reconstruction

Network construction

Expert finding

Methods

HITS

PageRank

hellip

129

Baselines

LMFormal Models for Expert Finding in Enterprise Corpora SIGIR

06Achieves stable performance in expert finding task using a

language modelPageRank

Benchmark nodal ranking methodHITS

Find hub nodes and authority nodeEABIF

Personalized Recommendation Driven by Information Flow SIGIR rsquo06

Find most influential node130

Evaluation

131

bull Bayesian estimate

Method MRR MAP P10

LM 0821 0698 0800

EABIF(ori) 0674 0362 0243

EABIF(rec) 0742 0318 0281

PageRank(ori) 0675 0377 0263

PageRank(rec) 0743 0321 0266

HITS(ori) 0906 0832 0900

HITS(rec) 0938 0822 0906

Summary

bull Matrix and probability are fundamental mathematics in information retrieval and computer visionndash Matrix decomposition ndash a good practice to learn matrixndash Graphical model ndash a good practice to learn probability

bull Graphical model is a good tool to analyze problems

bull The essence of decomposition is to discover a set of mid-level features to describe original documentsimages

bull It is more adaptable for various applications than matrix decomposition

  • An Introduction To Matrix Decomposition and Graphical Model
  • Outline
  • What Is Matrix Decomposition
  • Why We Need Matrix Decomposition
  • Why We Need Matrix Decomposition (2)
  • Principle Component Analysis
  • Definition ndash Eigenvalue amp Eigenvector
  • Definition ndash Principle Component Analysis
  • Principle Component Analysis (2)
  • Maximizing Variance
  • Optimization Problem
  • Property Data Decomposition
  • Face Recognition ndash Eigenface
  • Slide 14
  • Slide 15
  • PageRank ndash Power Iteration
  • Column-Stochastic amp Irreducible
  • Iterative PageRank Calculation
  • Convergence of the power iteration
  • Singular value decomposition
  • SVD - Definition
  • Singular Values And Singular Vectors
  • Matrix approximation
  • SVD and PCA
  • Example - LSI
  • SVD and PCA (2)
  • Latent Semantic Indexing (LSI)
  • Latent Semantic Indexing
  • Similarity Measures
  • Similarity Measures (2)
  • HITS (Hyperlink Induced Topic Search)
  • Power Iteration
  • HITS and SVD
  • HITS vs PageRank
  • NMF ndash Non-Negative Matrix Factorization
  • Definition
  • Motivation
  • Multiplicative Update Algorithm
  • Multiplicative Update Algorithm (2)
  • NMF vs PCA
  • Reference
  • Major Reference
  • Outline (2)
  • Not Included
  • Basic Concepts
  • What Is Machine Learning
  • Likelihood Function
  • Maximum Likelihood (ML)
  • IID ndash Independent Identically Distributed
  • Reference (2)
  • Expectation Maximization
  • Why We Need EM
  • More General
  • The Expectation Maximization (EM) Algorithm
  • Jensenrsquos Inequality
  • Lower Bounding the Log Likelihood
  • The E and M Steps of EM
  • The E Step
  • The M Step
  • EM Never Decreases the Likelihood
  • Reference (3)
  • Why Do We Need Graphical Model
  • Why Do We Need Graphical Models
  • Directed Acyclic Graphical Models (Bayesian Networks)
  • Directed Graphs for Statistical Models Plate Notation
  • pLSA ndash Probabilistic Latent Semantic Analysis
  • Latent Semantic Indexing (LSI) Review
  • pLSA ndash Probabilistic Latent Semantic Analysis (2)
  • pLSA
  • pLSA (2)
  • Joint Probability vs Likelihood
  • Document Decomposition
  • pLSA ndash Objective Function
  • EM Steps
  • Lower Bounding the Log Likelihood (2)
  • EM Steps (2)
  • Latent Subspace
  • pLSA vs LSA
  • pLSA vs LSA (2)
  • Applications
  • Text Mining
  • Scene Classification
  • Classification Result
  • Reference (4)
  • LDA ndash Latent Dirichilet Allocation
  • Problems in pLSA
  • Problems in pLSA (2)
  • Dirichlet Distribution
  • Dirichlet Distribution (2)
  • Example Dirichlet Distributions (K=3)
  • Example Dirichlet Distributions (K=3) (2)
  • The LDA Model
  • The LDA Model (2)
  • Joint Probability
  • Likelihood
  • Inference
  • Inference (2)
  • Variational Inference
  • Variantional Inference
  • VBEM vs EM
  • Parameter Estimation
  • Parameter Estimation (2)
  • Topic Examples in a 100-topic LDA Model)
  • Classification (50-topic LDA + SVM)
  • Problems in LDA
  • A Bayesian Hierarchical Model for Learning Natural Scene Catego
  • Codebook
  • Topic Distribution in Different Categories
  • Topic Hierarchical Clustering
  • More Topic Models
  • Are you really into Graphical Models
  • Reference (5)
  • Outline (3)
  • Slide 115
  • The Long Query Problem
  • Key Idea Dimension Reduction + Residual Error Preservation
  • Orthogonal Decomposition
  • A Probabilistic Implementation
  • Search (Online)
  • Search Example
  • Search Example (2)
  • Simultaneously Modeling Semantics and Structure of Threaded Dis
  • Semantic amp structure
  • Optimize them together
  • Reply reconstruction
  • Baselines
  • Evaluation
  • Expert finding
  • Baselines (2)
  • Evaluation (2)
  • Summary
Page 117: An Introduction To Matrix Decomposition and Graphical Model

Search (Online)

DS1 DS2 hellip

DS1 DS2 hellip

DS1 DS2 hellip

DS1 DS2 hellip

DS1 DS2 hellip

hellip

hellip

hellip

hellip

hellip

LSH Index

Doc 300 Doc 401 hellip

A query = 0 01 02 03 04 05 06 07 08 09 1

0

2

4

6

8

10

12

14

16

+ a few words

Re-rankingDoc 401 hellip

Doc 1

Doc 2

Doc 300

Doc 401

Doc N

Doc 300

Index 10M Images 46GBSearch Speed lt 100ms

Doc Meta

Search Example

Query Image

Search Example

Query Image

Simultaneously Modeling Semantics and Structure of Threaded Discussions A Sparse

Coding Approach and Its Applications

Chen LIN Jiang-Ming YANG Rui CAI Xin-jing WANG Wei WANG Lei ZHANG

SIGIR 2009

123

Semantic amp structure

124

SemanticTopics

StructureWho reply to who

Optimize them together

Model semantic

Model structure

125

Reply reconstruction

126

DocumentSimilarity

TopicSimilarity

StructureSimilarity

BaselinesNP

Reply to Nearest PostRR

Reply to RootDS

Document SimilarityLDA

Latent Dirichlet Allocation Project documents to topic space

SWB Special Words Topic Model with Background distribution Project documents to topic and junk topic space

127

Evaluation

method Slashdot Apple

All Posts Good Posts All Posts Good Posts

NP 0021 0012 0289 0239

RR 0183 0319 0269 0474

DS 0463 0643 0409 0628

LDA 0465 0644 0410 0648

SWB 0463 0644 0410 0641

SMSS 0524 0737 0517 0772

128

Expert finding

Reply reconstruction

Network construction

Expert finding

Methods

HITS

PageRank

hellip

129

Baselines

LMFormal Models for Expert Finding in Enterprise Corpora SIGIR

06Achieves stable performance in expert finding task using a

language modelPageRank

Benchmark nodal ranking methodHITS

Find hub nodes and authority nodeEABIF

Personalized Recommendation Driven by Information Flow SIGIR rsquo06

Find most influential node130

Evaluation

131

bull Bayesian estimate

Method MRR MAP P10

LM 0821 0698 0800

EABIF(ori) 0674 0362 0243

EABIF(rec) 0742 0318 0281

PageRank(ori) 0675 0377 0263

PageRank(rec) 0743 0321 0266

HITS(ori) 0906 0832 0900

HITS(rec) 0938 0822 0906

Summary

bull Matrix and probability are fundamental mathematics in information retrieval and computer visionndash Matrix decomposition ndash a good practice to learn matrixndash Graphical model ndash a good practice to learn probability

bull Graphical model is a good tool to analyze problems

bull The essence of decomposition is to discover a set of mid-level features to describe original documentsimages

bull It is more adaptable for various applications than matrix decomposition

  • An Introduction To Matrix Decomposition and Graphical Model
  • Outline
  • What Is Matrix Decomposition
  • Why We Need Matrix Decomposition
  • Why We Need Matrix Decomposition (2)
  • Principle Component Analysis
  • Definition ndash Eigenvalue amp Eigenvector
  • Definition ndash Principle Component Analysis
  • Principle Component Analysis (2)
  • Maximizing Variance
  • Optimization Problem
  • Property Data Decomposition
  • Face Recognition ndash Eigenface
  • Slide 14
  • Slide 15
  • PageRank ndash Power Iteration
  • Column-Stochastic amp Irreducible
  • Iterative PageRank Calculation
  • Convergence of the power iteration
  • Singular value decomposition
  • SVD - Definition
  • Singular Values And Singular Vectors
  • Matrix approximation
  • SVD and PCA
  • Example - LSI
  • SVD and PCA (2)
  • Latent Semantic Indexing (LSI)
  • Latent Semantic Indexing
  • Similarity Measures
  • Similarity Measures (2)
  • HITS (Hyperlink Induced Topic Search)
  • Power Iteration
  • HITS and SVD
  • HITS vs PageRank
  • NMF ndash Non-Negative Matrix Factorization
  • Definition
  • Motivation
  • Multiplicative Update Algorithm
  • Multiplicative Update Algorithm (2)
  • NMF vs PCA
  • Reference
  • Major Reference
  • Outline (2)
  • Not Included
  • Basic Concepts
  • What Is Machine Learning
  • Likelihood Function
  • Maximum Likelihood (ML)
  • IID ndash Independent Identically Distributed
  • Reference (2)
  • Expectation Maximization
  • Why We Need EM
  • More General
  • The Expectation Maximization (EM) Algorithm
  • Jensenrsquos Inequality
  • Lower Bounding the Log Likelihood
  • The E and M Steps of EM
  • The E Step
  • The M Step
  • EM Never Decreases the Likelihood
  • Reference (3)
  • Why Do We Need Graphical Model
  • Why Do We Need Graphical Models
  • Directed Acyclic Graphical Models (Bayesian Networks)
  • Directed Graphs for Statistical Models Plate Notation
  • pLSA ndash Probabilistic Latent Semantic Analysis
  • Latent Semantic Indexing (LSI) Review
  • pLSA ndash Probabilistic Latent Semantic Analysis (2)
  • pLSA
  • pLSA (2)
  • Joint Probability vs Likelihood
  • Document Decomposition
  • pLSA ndash Objective Function
  • EM Steps
  • Lower Bounding the Log Likelihood (2)
  • EM Steps (2)
  • Latent Subspace
  • pLSA vs LSA
  • pLSA vs LSA (2)
  • Applications
  • Text Mining
  • Scene Classification
  • Classification Result
  • Reference (4)
  • LDA ndash Latent Dirichilet Allocation
  • Problems in pLSA
  • Problems in pLSA (2)
  • Dirichlet Distribution
  • Dirichlet Distribution (2)
  • Example Dirichlet Distributions (K=3)
  • Example Dirichlet Distributions (K=3) (2)
  • The LDA Model
  • The LDA Model (2)
  • Joint Probability
  • Likelihood
  • Inference
  • Inference (2)
  • Variational Inference
  • Variantional Inference
  • VBEM vs EM
  • Parameter Estimation
  • Parameter Estimation (2)
  • Topic Examples in a 100-topic LDA Model)
  • Classification (50-topic LDA + SVM)
  • Problems in LDA
  • A Bayesian Hierarchical Model for Learning Natural Scene Catego
  • Codebook
  • Topic Distribution in Different Categories
  • Topic Hierarchical Clustering
  • More Topic Models
  • Are you really into Graphical Models
  • Reference (5)
  • Outline (3)
  • Slide 115
  • The Long Query Problem
  • Key Idea Dimension Reduction + Residual Error Preservation
  • Orthogonal Decomposition
  • A Probabilistic Implementation
  • Search (Online)
  • Search Example
  • Search Example (2)
  • Simultaneously Modeling Semantics and Structure of Threaded Dis
  • Semantic amp structure
  • Optimize them together
  • Reply reconstruction
  • Baselines
  • Evaluation
  • Expert finding
  • Baselines (2)
  • Evaluation (2)
  • Summary
Page 118: An Introduction To Matrix Decomposition and Graphical Model

Search Example

Query Image

Search Example

Query Image

Simultaneously Modeling Semantics and Structure of Threaded Discussions A Sparse

Coding Approach and Its Applications

Chen LIN Jiang-Ming YANG Rui CAI Xin-jing WANG Wei WANG Lei ZHANG

SIGIR 2009

123

Semantic amp structure

124

SemanticTopics

StructureWho reply to who

Optimize them together

Model semantic

Model structure

125

Reply reconstruction

126

DocumentSimilarity

TopicSimilarity

StructureSimilarity

BaselinesNP

Reply to Nearest PostRR

Reply to RootDS

Document SimilarityLDA

Latent Dirichlet Allocation Project documents to topic space

SWB Special Words Topic Model with Background distribution Project documents to topic and junk topic space

127

Evaluation

method Slashdot Apple

All Posts Good Posts All Posts Good Posts

NP 0021 0012 0289 0239

RR 0183 0319 0269 0474

DS 0463 0643 0409 0628

LDA 0465 0644 0410 0648

SWB 0463 0644 0410 0641

SMSS 0524 0737 0517 0772

128

Expert finding

Reply reconstruction

Network construction

Expert finding

Methods

HITS

PageRank

hellip

129

Baselines

LMFormal Models for Expert Finding in Enterprise Corpora SIGIR

06Achieves stable performance in expert finding task using a

language modelPageRank

Benchmark nodal ranking methodHITS

Find hub nodes and authority nodeEABIF

Personalized Recommendation Driven by Information Flow SIGIR rsquo06

Find most influential node130

Evaluation

131

bull Bayesian estimate

Method MRR MAP P10

LM 0821 0698 0800

EABIF(ori) 0674 0362 0243

EABIF(rec) 0742 0318 0281

PageRank(ori) 0675 0377 0263

PageRank(rec) 0743 0321 0266

HITS(ori) 0906 0832 0900

HITS(rec) 0938 0822 0906

Summary

bull Matrix and probability are fundamental mathematics in information retrieval and computer visionndash Matrix decomposition ndash a good practice to learn matrixndash Graphical model ndash a good practice to learn probability

bull Graphical model is a good tool to analyze problems

bull The essence of decomposition is to discover a set of mid-level features to describe original documentsimages

bull It is more adaptable for various applications than matrix decomposition

  • An Introduction To Matrix Decomposition and Graphical Model
  • Outline
  • What Is Matrix Decomposition
  • Why We Need Matrix Decomposition
  • Why We Need Matrix Decomposition (2)
  • Principle Component Analysis
  • Definition ndash Eigenvalue amp Eigenvector
  • Definition ndash Principle Component Analysis
  • Principle Component Analysis (2)
  • Maximizing Variance
  • Optimization Problem
  • Property Data Decomposition
  • Face Recognition ndash Eigenface
  • Slide 14
  • Slide 15
  • PageRank ndash Power Iteration
  • Column-Stochastic amp Irreducible
  • Iterative PageRank Calculation
  • Convergence of the power iteration
  • Singular value decomposition
  • SVD - Definition
  • Singular Values And Singular Vectors
  • Matrix approximation
  • SVD and PCA
  • Example - LSI
  • SVD and PCA (2)
  • Latent Semantic Indexing (LSI)
  • Latent Semantic Indexing
  • Similarity Measures
  • Similarity Measures (2)
  • HITS (Hyperlink Induced Topic Search)
  • Power Iteration
  • HITS and SVD
  • HITS vs PageRank
  • NMF ndash Non-Negative Matrix Factorization
  • Definition
  • Motivation
  • Multiplicative Update Algorithm
  • Multiplicative Update Algorithm (2)
  • NMF vs PCA
  • Reference
  • Major Reference
  • Outline (2)
  • Not Included
  • Basic Concepts
  • What Is Machine Learning
  • Likelihood Function
  • Maximum Likelihood (ML)
  • IID ndash Independent Identically Distributed
  • Reference (2)
  • Expectation Maximization
  • Why We Need EM
  • More General
  • The Expectation Maximization (EM) Algorithm
  • Jensenrsquos Inequality
  • Lower Bounding the Log Likelihood
  • The E and M Steps of EM
  • The E Step
  • The M Step
  • EM Never Decreases the Likelihood
  • Reference (3)
  • Why Do We Need Graphical Model
  • Why Do We Need Graphical Models
  • Directed Acyclic Graphical Models (Bayesian Networks)
  • Directed Graphs for Statistical Models Plate Notation
  • pLSA ndash Probabilistic Latent Semantic Analysis
  • Latent Semantic Indexing (LSI) Review
  • pLSA ndash Probabilistic Latent Semantic Analysis (2)
  • pLSA
  • pLSA (2)
  • Joint Probability vs Likelihood
  • Document Decomposition
  • pLSA ndash Objective Function
  • EM Steps
  • Lower Bounding the Log Likelihood (2)
  • EM Steps (2)
  • Latent Subspace
  • pLSA vs LSA
  • pLSA vs LSA (2)
  • Applications
  • Text Mining
  • Scene Classification
  • Classification Result
  • Reference (4)
  • LDA ndash Latent Dirichilet Allocation
  • Problems in pLSA
  • Problems in pLSA (2)
  • Dirichlet Distribution
  • Dirichlet Distribution (2)
  • Example Dirichlet Distributions (K=3)
  • Example Dirichlet Distributions (K=3) (2)
  • The LDA Model
  • The LDA Model (2)
  • Joint Probability
  • Likelihood
  • Inference
  • Inference (2)
  • Variational Inference
  • Variantional Inference
  • VBEM vs EM
  • Parameter Estimation
  • Parameter Estimation (2)
  • Topic Examples in a 100-topic LDA Model)
  • Classification (50-topic LDA + SVM)
  • Problems in LDA
  • A Bayesian Hierarchical Model for Learning Natural Scene Catego
  • Codebook
  • Topic Distribution in Different Categories
  • Topic Hierarchical Clustering
  • More Topic Models
  • Are you really into Graphical Models
  • Reference (5)
  • Outline (3)
  • Slide 115
  • The Long Query Problem
  • Key Idea Dimension Reduction + Residual Error Preservation
  • Orthogonal Decomposition
  • A Probabilistic Implementation
  • Search (Online)
  • Search Example
  • Search Example (2)
  • Simultaneously Modeling Semantics and Structure of Threaded Dis
  • Semantic amp structure
  • Optimize them together
  • Reply reconstruction
  • Baselines
  • Evaluation
  • Expert finding
  • Baselines (2)
  • Evaluation (2)
  • Summary
Page 119: An Introduction To Matrix Decomposition and Graphical Model

Search Example

Query Image

Simultaneously Modeling Semantics and Structure of Threaded Discussions A Sparse

Coding Approach and Its Applications

Chen LIN Jiang-Ming YANG Rui CAI Xin-jing WANG Wei WANG Lei ZHANG

SIGIR 2009

123

Semantic amp structure

124

SemanticTopics

StructureWho reply to who

Optimize them together

Model semantic

Model structure

125

Reply reconstruction

126

DocumentSimilarity

TopicSimilarity

StructureSimilarity

BaselinesNP

Reply to Nearest PostRR

Reply to RootDS

Document SimilarityLDA

Latent Dirichlet Allocation Project documents to topic space

SWB Special Words Topic Model with Background distribution Project documents to topic and junk topic space

127

Evaluation

method Slashdot Apple

All Posts Good Posts All Posts Good Posts

NP 0021 0012 0289 0239

RR 0183 0319 0269 0474

DS 0463 0643 0409 0628

LDA 0465 0644 0410 0648

SWB 0463 0644 0410 0641

SMSS 0524 0737 0517 0772

128

Expert finding

Reply reconstruction

Network construction

Expert finding

Methods

HITS

PageRank

hellip

129

Baselines

LMFormal Models for Expert Finding in Enterprise Corpora SIGIR

06Achieves stable performance in expert finding task using a

language modelPageRank

Benchmark nodal ranking methodHITS

Find hub nodes and authority nodeEABIF

Personalized Recommendation Driven by Information Flow SIGIR rsquo06

Find most influential node130

Evaluation

131

bull Bayesian estimate

Method MRR MAP P10

LM 0821 0698 0800

EABIF(ori) 0674 0362 0243

EABIF(rec) 0742 0318 0281

PageRank(ori) 0675 0377 0263

PageRank(rec) 0743 0321 0266

HITS(ori) 0906 0832 0900

HITS(rec) 0938 0822 0906

Summary

bull Matrix and probability are fundamental mathematics in information retrieval and computer visionndash Matrix decomposition ndash a good practice to learn matrixndash Graphical model ndash a good practice to learn probability

bull Graphical model is a good tool to analyze problems

bull The essence of decomposition is to discover a set of mid-level features to describe original documentsimages

bull It is more adaptable for various applications than matrix decomposition

  • An Introduction To Matrix Decomposition and Graphical Model
  • Outline
  • What Is Matrix Decomposition
  • Why We Need Matrix Decomposition
  • Why We Need Matrix Decomposition (2)
  • Principle Component Analysis
  • Definition ndash Eigenvalue amp Eigenvector
  • Definition ndash Principle Component Analysis
  • Principle Component Analysis (2)
  • Maximizing Variance
  • Optimization Problem
  • Property Data Decomposition
  • Face Recognition ndash Eigenface
  • Slide 14
  • Slide 15
  • PageRank ndash Power Iteration
  • Column-Stochastic amp Irreducible
  • Iterative PageRank Calculation
  • Convergence of the power iteration
  • Singular value decomposition
  • SVD - Definition
  • Singular Values And Singular Vectors
  • Matrix approximation
  • SVD and PCA
  • Example - LSI
  • SVD and PCA (2)
  • Latent Semantic Indexing (LSI)
  • Latent Semantic Indexing
  • Similarity Measures
  • Similarity Measures (2)
  • HITS (Hyperlink Induced Topic Search)
  • Power Iteration
  • HITS and SVD
  • HITS vs PageRank
  • NMF ndash Non-Negative Matrix Factorization
  • Definition
  • Motivation
  • Multiplicative Update Algorithm
  • Multiplicative Update Algorithm (2)
  • NMF vs PCA
  • Reference
  • Major Reference
  • Outline (2)
  • Not Included
  • Basic Concepts
  • What Is Machine Learning
  • Likelihood Function
  • Maximum Likelihood (ML)
  • IID ndash Independent Identically Distributed
  • Reference (2)
  • Expectation Maximization
  • Why We Need EM
  • More General
  • The Expectation Maximization (EM) Algorithm
  • Jensenrsquos Inequality
  • Lower Bounding the Log Likelihood
  • The E and M Steps of EM
  • The E Step
  • The M Step
  • EM Never Decreases the Likelihood
  • Reference (3)
  • Why Do We Need Graphical Model
  • Why Do We Need Graphical Models
  • Directed Acyclic Graphical Models (Bayesian Networks)
  • Directed Graphs for Statistical Models Plate Notation
  • pLSA ndash Probabilistic Latent Semantic Analysis
  • Latent Semantic Indexing (LSI) Review
  • pLSA ndash Probabilistic Latent Semantic Analysis (2)
  • pLSA
  • pLSA (2)
  • Joint Probability vs Likelihood
  • Document Decomposition
  • pLSA ndash Objective Function
  • EM Steps
  • Lower Bounding the Log Likelihood (2)
  • EM Steps (2)
  • Latent Subspace
  • pLSA vs LSA
  • pLSA vs LSA (2)
  • Applications
  • Text Mining
  • Scene Classification
  • Classification Result
  • Reference (4)
  • LDA ndash Latent Dirichilet Allocation
  • Problems in pLSA
  • Problems in pLSA (2)
  • Dirichlet Distribution
  • Dirichlet Distribution (2)
  • Example Dirichlet Distributions (K=3)
  • Example Dirichlet Distributions (K=3) (2)
  • The LDA Model
  • The LDA Model (2)
  • Joint Probability
  • Likelihood
  • Inference
  • Inference (2)
  • Variational Inference
  • Variantional Inference
  • VBEM vs EM
  • Parameter Estimation
  • Parameter Estimation (2)
  • Topic Examples in a 100-topic LDA Model)
  • Classification (50-topic LDA + SVM)
  • Problems in LDA
  • A Bayesian Hierarchical Model for Learning Natural Scene Catego
  • Codebook
  • Topic Distribution in Different Categories
  • Topic Hierarchical Clustering
  • More Topic Models
  • Are you really into Graphical Models
  • Reference (5)
  • Outline (3)
  • Slide 115
  • The Long Query Problem
  • Key Idea Dimension Reduction + Residual Error Preservation
  • Orthogonal Decomposition
  • A Probabilistic Implementation
  • Search (Online)
  • Search Example
  • Search Example (2)
  • Simultaneously Modeling Semantics and Structure of Threaded Dis
  • Semantic amp structure
  • Optimize them together
  • Reply reconstruction
  • Baselines
  • Evaluation
  • Expert finding
  • Baselines (2)
  • Evaluation (2)
  • Summary
Page 120: An Introduction To Matrix Decomposition and Graphical Model

Simultaneously Modeling Semantics and Structure of Threaded Discussions A Sparse

Coding Approach and Its Applications

Chen LIN Jiang-Ming YANG Rui CAI Xin-jing WANG Wei WANG Lei ZHANG

SIGIR 2009

123

Semantic amp structure

124

SemanticTopics

StructureWho reply to who

Optimize them together

Model semantic

Model structure

125

Reply reconstruction

126

DocumentSimilarity

TopicSimilarity

StructureSimilarity

BaselinesNP

Reply to Nearest PostRR

Reply to RootDS

Document SimilarityLDA

Latent Dirichlet Allocation Project documents to topic space

SWB Special Words Topic Model with Background distribution Project documents to topic and junk topic space

127

Evaluation

method Slashdot Apple

All Posts Good Posts All Posts Good Posts

NP 0021 0012 0289 0239

RR 0183 0319 0269 0474

DS 0463 0643 0409 0628

LDA 0465 0644 0410 0648

SWB 0463 0644 0410 0641

SMSS 0524 0737 0517 0772

128

Expert finding

Reply reconstruction

Network construction

Expert finding

Methods

HITS

PageRank

hellip

129

Baselines

LMFormal Models for Expert Finding in Enterprise Corpora SIGIR

06Achieves stable performance in expert finding task using a

language modelPageRank

Benchmark nodal ranking methodHITS

Find hub nodes and authority nodeEABIF

Personalized Recommendation Driven by Information Flow SIGIR rsquo06

Find most influential node130

Evaluation

131

bull Bayesian estimate

Method MRR MAP P10

LM 0821 0698 0800

EABIF(ori) 0674 0362 0243

EABIF(rec) 0742 0318 0281

PageRank(ori) 0675 0377 0263

PageRank(rec) 0743 0321 0266

HITS(ori) 0906 0832 0900

HITS(rec) 0938 0822 0906

Summary

bull Matrix and probability are fundamental mathematics in information retrieval and computer visionndash Matrix decomposition ndash a good practice to learn matrixndash Graphical model ndash a good practice to learn probability

bull Graphical model is a good tool to analyze problems

bull The essence of decomposition is to discover a set of mid-level features to describe original documentsimages

bull It is more adaptable for various applications than matrix decomposition

  • An Introduction To Matrix Decomposition and Graphical Model
  • Outline
  • What Is Matrix Decomposition
  • Why We Need Matrix Decomposition
  • Why We Need Matrix Decomposition (2)
  • Principle Component Analysis
  • Definition ndash Eigenvalue amp Eigenvector
  • Definition ndash Principle Component Analysis
  • Principle Component Analysis (2)
  • Maximizing Variance
  • Optimization Problem
  • Property Data Decomposition
  • Face Recognition ndash Eigenface
  • Slide 14
  • Slide 15
  • PageRank ndash Power Iteration
  • Column-Stochastic amp Irreducible
  • Iterative PageRank Calculation
  • Convergence of the power iteration
  • Singular value decomposition
  • SVD - Definition
  • Singular Values And Singular Vectors
  • Matrix approximation
  • SVD and PCA
  • Example - LSI
  • SVD and PCA (2)
  • Latent Semantic Indexing (LSI)
  • Latent Semantic Indexing
  • Similarity Measures
  • Similarity Measures (2)
  • HITS (Hyperlink Induced Topic Search)
  • Power Iteration
  • HITS and SVD
  • HITS vs PageRank
  • NMF ndash Non-Negative Matrix Factorization
  • Definition
  • Motivation
  • Multiplicative Update Algorithm
  • Multiplicative Update Algorithm (2)
  • NMF vs PCA
  • Reference
  • Major Reference
  • Outline (2)
  • Not Included
  • Basic Concepts
  • What Is Machine Learning
  • Likelihood Function
  • Maximum Likelihood (ML)
  • IID ndash Independent Identically Distributed
  • Reference (2)
  • Expectation Maximization
  • Why We Need EM
  • More General
  • The Expectation Maximization (EM) Algorithm
  • Jensenrsquos Inequality
  • Lower Bounding the Log Likelihood
  • The E and M Steps of EM
  • The E Step
  • The M Step
  • EM Never Decreases the Likelihood
  • Reference (3)
  • Why Do We Need Graphical Model
  • Why Do We Need Graphical Models
  • Directed Acyclic Graphical Models (Bayesian Networks)
  • Directed Graphs for Statistical Models Plate Notation
  • pLSA ndash Probabilistic Latent Semantic Analysis
  • Latent Semantic Indexing (LSI) Review
  • pLSA ndash Probabilistic Latent Semantic Analysis (2)
  • pLSA
  • pLSA (2)
  • Joint Probability vs Likelihood
  • Document Decomposition
  • pLSA ndash Objective Function
  • EM Steps
  • Lower Bounding the Log Likelihood (2)
  • EM Steps (2)
  • Latent Subspace
  • pLSA vs LSA
  • pLSA vs LSA (2)
  • Applications
  • Text Mining
  • Scene Classification
  • Classification Result
  • Reference (4)
  • LDA ndash Latent Dirichilet Allocation
  • Problems in pLSA
  • Problems in pLSA (2)
  • Dirichlet Distribution
  • Dirichlet Distribution (2)
  • Example Dirichlet Distributions (K=3)
  • Example Dirichlet Distributions (K=3) (2)
  • The LDA Model
  • The LDA Model (2)
  • Joint Probability
  • Likelihood
  • Inference
  • Inference (2)
  • Variational Inference
  • Variantional Inference
  • VBEM vs EM
  • Parameter Estimation
  • Parameter Estimation (2)
  • Topic Examples in a 100-topic LDA Model)
  • Classification (50-topic LDA + SVM)
  • Problems in LDA
  • A Bayesian Hierarchical Model for Learning Natural Scene Catego
  • Codebook
  • Topic Distribution in Different Categories
  • Topic Hierarchical Clustering
  • More Topic Models
  • Are you really into Graphical Models
  • Reference (5)
  • Outline (3)
  • Slide 115
  • The Long Query Problem
  • Key Idea Dimension Reduction + Residual Error Preservation
  • Orthogonal Decomposition
  • A Probabilistic Implementation
  • Search (Online)
  • Search Example
  • Search Example (2)
  • Simultaneously Modeling Semantics and Structure of Threaded Dis
  • Semantic amp structure
  • Optimize them together
  • Reply reconstruction
  • Baselines
  • Evaluation
  • Expert finding
  • Baselines (2)
  • Evaluation (2)
  • Summary
Page 121: An Introduction To Matrix Decomposition and Graphical Model

Semantic amp structure

124

SemanticTopics

StructureWho reply to who

Optimize them together

Model semantic

Model structure

125

Reply reconstruction

126

DocumentSimilarity

TopicSimilarity

StructureSimilarity

BaselinesNP

Reply to Nearest PostRR

Reply to RootDS

Document SimilarityLDA

Latent Dirichlet Allocation Project documents to topic space

SWB Special Words Topic Model with Background distribution Project documents to topic and junk topic space

127

Evaluation

method Slashdot Apple

All Posts Good Posts All Posts Good Posts

NP 0021 0012 0289 0239

RR 0183 0319 0269 0474

DS 0463 0643 0409 0628

LDA 0465 0644 0410 0648

SWB 0463 0644 0410 0641

SMSS 0524 0737 0517 0772

128

Expert finding

Reply reconstruction

Network construction

Expert finding

Methods

HITS

PageRank

hellip

129

Baselines

LMFormal Models for Expert Finding in Enterprise Corpora SIGIR

06Achieves stable performance in expert finding task using a

language modelPageRank

Benchmark nodal ranking methodHITS

Find hub nodes and authority nodeEABIF

Personalized Recommendation Driven by Information Flow SIGIR rsquo06

Find most influential node130

Evaluation

131

bull Bayesian estimate

Method MRR MAP P10

LM 0821 0698 0800

EABIF(ori) 0674 0362 0243

EABIF(rec) 0742 0318 0281

PageRank(ori) 0675 0377 0263

PageRank(rec) 0743 0321 0266

HITS(ori) 0906 0832 0900

HITS(rec) 0938 0822 0906

Summary

bull Matrix and probability are fundamental mathematics in information retrieval and computer visionndash Matrix decomposition ndash a good practice to learn matrixndash Graphical model ndash a good practice to learn probability

bull Graphical model is a good tool to analyze problems

bull The essence of decomposition is to discover a set of mid-level features to describe original documentsimages

bull It is more adaptable for various applications than matrix decomposition

  • An Introduction To Matrix Decomposition and Graphical Model
  • Outline
  • What Is Matrix Decomposition
  • Why We Need Matrix Decomposition
  • Why We Need Matrix Decomposition (2)
  • Principle Component Analysis
  • Definition ndash Eigenvalue amp Eigenvector
  • Definition ndash Principle Component Analysis
  • Principle Component Analysis (2)
  • Maximizing Variance
  • Optimization Problem
  • Property Data Decomposition
  • Face Recognition ndash Eigenface
  • Slide 14
  • Slide 15
  • PageRank ndash Power Iteration
  • Column-Stochastic amp Irreducible
  • Iterative PageRank Calculation
  • Convergence of the power iteration
  • Singular value decomposition
  • SVD - Definition
  • Singular Values And Singular Vectors
  • Matrix approximation
  • SVD and PCA
  • Example - LSI
  • SVD and PCA (2)
  • Latent Semantic Indexing (LSI)
  • Latent Semantic Indexing
  • Similarity Measures
  • Similarity Measures (2)
  • HITS (Hyperlink Induced Topic Search)
  • Power Iteration
  • HITS and SVD
  • HITS vs PageRank
  • NMF ndash Non-Negative Matrix Factorization
  • Definition
  • Motivation
  • Multiplicative Update Algorithm
  • Multiplicative Update Algorithm (2)
  • NMF vs PCA
  • Reference
  • Major Reference
  • Outline (2)
  • Not Included
  • Basic Concepts
  • What Is Machine Learning
  • Likelihood Function
  • Maximum Likelihood (ML)
  • IID ndash Independent Identically Distributed
  • Reference (2)
  • Expectation Maximization
  • Why We Need EM
  • More General
  • The Expectation Maximization (EM) Algorithm
  • Jensenrsquos Inequality
  • Lower Bounding the Log Likelihood
  • The E and M Steps of EM
  • The E Step
  • The M Step
  • EM Never Decreases the Likelihood
  • Reference (3)
  • Why Do We Need Graphical Model
  • Why Do We Need Graphical Models
  • Directed Acyclic Graphical Models (Bayesian Networks)
  • Directed Graphs for Statistical Models Plate Notation
  • pLSA ndash Probabilistic Latent Semantic Analysis
  • Latent Semantic Indexing (LSI) Review
  • pLSA ndash Probabilistic Latent Semantic Analysis (2)
  • pLSA
  • pLSA (2)
  • Joint Probability vs Likelihood
  • Document Decomposition
  • pLSA ndash Objective Function
  • EM Steps
  • Lower Bounding the Log Likelihood (2)
  • EM Steps (2)
  • Latent Subspace
  • pLSA vs LSA
  • pLSA vs LSA (2)
  • Applications
  • Text Mining
  • Scene Classification
  • Classification Result
  • Reference (4)
  • LDA ndash Latent Dirichilet Allocation
  • Problems in pLSA
  • Problems in pLSA (2)
  • Dirichlet Distribution
  • Dirichlet Distribution (2)
  • Example Dirichlet Distributions (K=3)
  • Example Dirichlet Distributions (K=3) (2)
  • The LDA Model
  • The LDA Model (2)
  • Joint Probability
  • Likelihood
  • Inference
  • Inference (2)
  • Variational Inference
  • Variantional Inference
  • VBEM vs EM
  • Parameter Estimation
  • Parameter Estimation (2)
  • Topic Examples in a 100-topic LDA Model)
  • Classification (50-topic LDA + SVM)
  • Problems in LDA
  • A Bayesian Hierarchical Model for Learning Natural Scene Catego
  • Codebook
  • Topic Distribution in Different Categories
  • Topic Hierarchical Clustering
  • More Topic Models
  • Are you really into Graphical Models
  • Reference (5)
  • Outline (3)
  • Slide 115
  • The Long Query Problem
  • Key Idea Dimension Reduction + Residual Error Preservation
  • Orthogonal Decomposition
  • A Probabilistic Implementation
  • Search (Online)
  • Search Example
  • Search Example (2)
  • Simultaneously Modeling Semantics and Structure of Threaded Dis
  • Semantic amp structure
  • Optimize them together
  • Reply reconstruction
  • Baselines
  • Evaluation
  • Expert finding
  • Baselines (2)
  • Evaluation (2)
  • Summary
Page 122: An Introduction To Matrix Decomposition and Graphical Model

Optimize them together

Model semantic

Model structure

125

Reply reconstruction

126

DocumentSimilarity

TopicSimilarity

StructureSimilarity

BaselinesNP

Reply to Nearest PostRR

Reply to RootDS

Document SimilarityLDA

Latent Dirichlet Allocation Project documents to topic space

SWB Special Words Topic Model with Background distribution Project documents to topic and junk topic space

127

Evaluation

method Slashdot Apple

All Posts Good Posts All Posts Good Posts

NP 0021 0012 0289 0239

RR 0183 0319 0269 0474

DS 0463 0643 0409 0628

LDA 0465 0644 0410 0648

SWB 0463 0644 0410 0641

SMSS 0524 0737 0517 0772

128

Expert finding

Reply reconstruction

Network construction

Expert finding

Methods

HITS

PageRank

hellip

129

Baselines

LMFormal Models for Expert Finding in Enterprise Corpora SIGIR

06Achieves stable performance in expert finding task using a

language modelPageRank

Benchmark nodal ranking methodHITS

Find hub nodes and authority nodeEABIF

Personalized Recommendation Driven by Information Flow SIGIR rsquo06

Find most influential node130

Evaluation

131

bull Bayesian estimate

Method MRR MAP P10

LM 0821 0698 0800

EABIF(ori) 0674 0362 0243

EABIF(rec) 0742 0318 0281

PageRank(ori) 0675 0377 0263

PageRank(rec) 0743 0321 0266

HITS(ori) 0906 0832 0900

HITS(rec) 0938 0822 0906

Summary

bull Matrix and probability are fundamental mathematics in information retrieval and computer visionndash Matrix decomposition ndash a good practice to learn matrixndash Graphical model ndash a good practice to learn probability

bull Graphical model is a good tool to analyze problems

bull The essence of decomposition is to discover a set of mid-level features to describe original documentsimages

bull It is more adaptable for various applications than matrix decomposition

  • An Introduction To Matrix Decomposition and Graphical Model
  • Outline
  • What Is Matrix Decomposition
  • Why We Need Matrix Decomposition
  • Why We Need Matrix Decomposition (2)
  • Principle Component Analysis
  • Definition ndash Eigenvalue amp Eigenvector
  • Definition ndash Principle Component Analysis
  • Principle Component Analysis (2)
  • Maximizing Variance
  • Optimization Problem
  • Property Data Decomposition
  • Face Recognition ndash Eigenface
  • Slide 14
  • Slide 15
  • PageRank ndash Power Iteration
  • Column-Stochastic amp Irreducible
  • Iterative PageRank Calculation
  • Convergence of the power iteration
  • Singular value decomposition
  • SVD - Definition
  • Singular Values And Singular Vectors
  • Matrix approximation
  • SVD and PCA
  • Example - LSI
  • SVD and PCA (2)
  • Latent Semantic Indexing (LSI)
  • Latent Semantic Indexing
  • Similarity Measures
  • Similarity Measures (2)
  • HITS (Hyperlink Induced Topic Search)
  • Power Iteration
  • HITS and SVD
  • HITS vs PageRank
  • NMF ndash Non-Negative Matrix Factorization
  • Definition
  • Motivation
  • Multiplicative Update Algorithm
  • Multiplicative Update Algorithm (2)
  • NMF vs PCA
  • Reference
  • Major Reference
  • Outline (2)
  • Not Included
  • Basic Concepts
  • What Is Machine Learning
  • Likelihood Function
  • Maximum Likelihood (ML)
  • IID ndash Independent Identically Distributed
  • Reference (2)
  • Expectation Maximization
  • Why We Need EM
  • More General
  • The Expectation Maximization (EM) Algorithm
  • Jensenrsquos Inequality
  • Lower Bounding the Log Likelihood
  • The E and M Steps of EM
  • The E Step
  • The M Step
  • EM Never Decreases the Likelihood
  • Reference (3)
  • Why Do We Need Graphical Model
  • Why Do We Need Graphical Models
  • Directed Acyclic Graphical Models (Bayesian Networks)
  • Directed Graphs for Statistical Models Plate Notation
  • pLSA ndash Probabilistic Latent Semantic Analysis
  • Latent Semantic Indexing (LSI) Review
  • pLSA ndash Probabilistic Latent Semantic Analysis (2)
  • pLSA
  • pLSA (2)
  • Joint Probability vs Likelihood
  • Document Decomposition
  • pLSA ndash Objective Function
  • EM Steps
  • Lower Bounding the Log Likelihood (2)
  • EM Steps (2)
  • Latent Subspace
  • pLSA vs LSA
  • pLSA vs LSA (2)
  • Applications
  • Text Mining
  • Scene Classification
  • Classification Result
  • Reference (4)
  • LDA ndash Latent Dirichilet Allocation
  • Problems in pLSA
  • Problems in pLSA (2)
  • Dirichlet Distribution
  • Dirichlet Distribution (2)
  • Example Dirichlet Distributions (K=3)
  • Example Dirichlet Distributions (K=3) (2)
  • The LDA Model
  • The LDA Model (2)
  • Joint Probability
  • Likelihood
  • Inference
  • Inference (2)
  • Variational Inference
  • Variantional Inference
  • VBEM vs EM
  • Parameter Estimation
  • Parameter Estimation (2)
  • Topic Examples in a 100-topic LDA Model)
  • Classification (50-topic LDA + SVM)
  • Problems in LDA
  • A Bayesian Hierarchical Model for Learning Natural Scene Catego
  • Codebook
  • Topic Distribution in Different Categories
  • Topic Hierarchical Clustering
  • More Topic Models
  • Are you really into Graphical Models
  • Reference (5)
  • Outline (3)
  • Slide 115
  • The Long Query Problem
  • Key Idea Dimension Reduction + Residual Error Preservation
  • Orthogonal Decomposition
  • A Probabilistic Implementation
  • Search (Online)
  • Search Example
  • Search Example (2)
  • Simultaneously Modeling Semantics and Structure of Threaded Dis
  • Semantic amp structure
  • Optimize them together
  • Reply reconstruction
  • Baselines
  • Evaluation
  • Expert finding
  • Baselines (2)
  • Evaluation (2)
  • Summary
Page 123: An Introduction To Matrix Decomposition and Graphical Model

Reply reconstruction

126

DocumentSimilarity

TopicSimilarity

StructureSimilarity

BaselinesNP

Reply to Nearest PostRR

Reply to RootDS

Document SimilarityLDA

Latent Dirichlet Allocation Project documents to topic space

SWB Special Words Topic Model with Background distribution Project documents to topic and junk topic space

127

Evaluation

method Slashdot Apple

All Posts Good Posts All Posts Good Posts

NP 0021 0012 0289 0239

RR 0183 0319 0269 0474

DS 0463 0643 0409 0628

LDA 0465 0644 0410 0648

SWB 0463 0644 0410 0641

SMSS 0524 0737 0517 0772

128

Expert finding

Reply reconstruction

Network construction

Expert finding

Methods

HITS

PageRank

hellip

129

Baselines

LMFormal Models for Expert Finding in Enterprise Corpora SIGIR

06Achieves stable performance in expert finding task using a

language modelPageRank

Benchmark nodal ranking methodHITS

Find hub nodes and authority nodeEABIF

Personalized Recommendation Driven by Information Flow SIGIR rsquo06

Find most influential node130

Evaluation

131

bull Bayesian estimate

Method MRR MAP P10

LM 0821 0698 0800

EABIF(ori) 0674 0362 0243

EABIF(rec) 0742 0318 0281

PageRank(ori) 0675 0377 0263

PageRank(rec) 0743 0321 0266

HITS(ori) 0906 0832 0900

HITS(rec) 0938 0822 0906

Summary

bull Matrix and probability are fundamental mathematics in information retrieval and computer visionndash Matrix decomposition ndash a good practice to learn matrixndash Graphical model ndash a good practice to learn probability

bull Graphical model is a good tool to analyze problems

bull The essence of decomposition is to discover a set of mid-level features to describe original documentsimages

bull It is more adaptable for various applications than matrix decomposition

  • An Introduction To Matrix Decomposition and Graphical Model
  • Outline
  • What Is Matrix Decomposition
  • Why We Need Matrix Decomposition
  • Why We Need Matrix Decomposition (2)
  • Principle Component Analysis
  • Definition ndash Eigenvalue amp Eigenvector
  • Definition ndash Principle Component Analysis
  • Principle Component Analysis (2)
  • Maximizing Variance
  • Optimization Problem
  • Property Data Decomposition
  • Face Recognition ndash Eigenface
  • Slide 14
  • Slide 15
  • PageRank ndash Power Iteration
  • Column-Stochastic amp Irreducible
  • Iterative PageRank Calculation
  • Convergence of the power iteration
  • Singular value decomposition
  • SVD - Definition
  • Singular Values And Singular Vectors
  • Matrix approximation
  • SVD and PCA
  • Example - LSI
  • SVD and PCA (2)
  • Latent Semantic Indexing (LSI)
  • Latent Semantic Indexing
  • Similarity Measures
  • Similarity Measures (2)
  • HITS (Hyperlink Induced Topic Search)
  • Power Iteration
  • HITS and SVD
  • HITS vs PageRank
  • NMF ndash Non-Negative Matrix Factorization
  • Definition
  • Motivation
  • Multiplicative Update Algorithm
  • Multiplicative Update Algorithm (2)
  • NMF vs PCA
  • Reference
  • Major Reference
  • Outline (2)
  • Not Included
  • Basic Concepts
  • What Is Machine Learning
  • Likelihood Function
  • Maximum Likelihood (ML)
  • IID ndash Independent Identically Distributed
  • Reference (2)
  • Expectation Maximization
  • Why We Need EM
  • More General
  • The Expectation Maximization (EM) Algorithm
  • Jensenrsquos Inequality
  • Lower Bounding the Log Likelihood
  • The E and M Steps of EM
  • The E Step
  • The M Step
  • EM Never Decreases the Likelihood
  • Reference (3)
  • Why Do We Need Graphical Model
  • Why Do We Need Graphical Models
  • Directed Acyclic Graphical Models (Bayesian Networks)
  • Directed Graphs for Statistical Models Plate Notation
  • pLSA ndash Probabilistic Latent Semantic Analysis
  • Latent Semantic Indexing (LSI) Review
  • pLSA ndash Probabilistic Latent Semantic Analysis (2)
  • pLSA
  • pLSA (2)
  • Joint Probability vs Likelihood
  • Document Decomposition
  • pLSA ndash Objective Function
  • EM Steps
  • Lower Bounding the Log Likelihood (2)
  • EM Steps (2)
  • Latent Subspace
  • pLSA vs LSA
  • pLSA vs LSA (2)
  • Applications
  • Text Mining
  • Scene Classification
  • Classification Result
  • Reference (4)
  • LDA ndash Latent Dirichilet Allocation
  • Problems in pLSA
  • Problems in pLSA (2)
  • Dirichlet Distribution
  • Dirichlet Distribution (2)
  • Example Dirichlet Distributions (K=3)
  • Example Dirichlet Distributions (K=3) (2)
  • The LDA Model
  • The LDA Model (2)
  • Joint Probability
  • Likelihood
  • Inference
  • Inference (2)
  • Variational Inference
  • Variantional Inference
  • VBEM vs EM
  • Parameter Estimation
  • Parameter Estimation (2)
  • Topic Examples in a 100-topic LDA Model)
  • Classification (50-topic LDA + SVM)
  • Problems in LDA
  • A Bayesian Hierarchical Model for Learning Natural Scene Catego
  • Codebook
  • Topic Distribution in Different Categories
  • Topic Hierarchical Clustering
  • More Topic Models
  • Are you really into Graphical Models
  • Reference (5)
  • Outline (3)
  • Slide 115
  • The Long Query Problem
  • Key Idea Dimension Reduction + Residual Error Preservation
  • Orthogonal Decomposition
  • A Probabilistic Implementation
  • Search (Online)
  • Search Example
  • Search Example (2)
  • Simultaneously Modeling Semantics and Structure of Threaded Dis
  • Semantic amp structure
  • Optimize them together
  • Reply reconstruction
  • Baselines
  • Evaluation
  • Expert finding
  • Baselines (2)
  • Evaluation (2)
  • Summary
Page 124: An Introduction To Matrix Decomposition and Graphical Model

BaselinesNP

Reply to Nearest PostRR

Reply to RootDS

Document SimilarityLDA

Latent Dirichlet Allocation Project documents to topic space

SWB Special Words Topic Model with Background distribution Project documents to topic and junk topic space

127

Evaluation

method Slashdot Apple

All Posts Good Posts All Posts Good Posts

NP 0021 0012 0289 0239

RR 0183 0319 0269 0474

DS 0463 0643 0409 0628

LDA 0465 0644 0410 0648

SWB 0463 0644 0410 0641

SMSS 0524 0737 0517 0772

128

Expert finding

Reply reconstruction

Network construction

Expert finding

Methods

HITS

PageRank

hellip

129

Baselines

LMFormal Models for Expert Finding in Enterprise Corpora SIGIR

06Achieves stable performance in expert finding task using a

language modelPageRank

Benchmark nodal ranking methodHITS

Find hub nodes and authority nodeEABIF

Personalized Recommendation Driven by Information Flow SIGIR rsquo06

Find most influential node130

Evaluation

131

bull Bayesian estimate

Method MRR MAP P10

LM 0821 0698 0800

EABIF(ori) 0674 0362 0243

EABIF(rec) 0742 0318 0281

PageRank(ori) 0675 0377 0263

PageRank(rec) 0743 0321 0266

HITS(ori) 0906 0832 0900

HITS(rec) 0938 0822 0906

Summary

bull Matrix and probability are fundamental mathematics in information retrieval and computer visionndash Matrix decomposition ndash a good practice to learn matrixndash Graphical model ndash a good practice to learn probability

bull Graphical model is a good tool to analyze problems

bull The essence of decomposition is to discover a set of mid-level features to describe original documentsimages

bull It is more adaptable for various applications than matrix decomposition

  • An Introduction To Matrix Decomposition and Graphical Model
  • Outline
  • What Is Matrix Decomposition
  • Why We Need Matrix Decomposition
  • Why We Need Matrix Decomposition (2)
  • Principle Component Analysis
  • Definition ndash Eigenvalue amp Eigenvector
  • Definition ndash Principle Component Analysis
  • Principle Component Analysis (2)
  • Maximizing Variance
  • Optimization Problem
  • Property Data Decomposition
  • Face Recognition ndash Eigenface
  • Slide 14
  • Slide 15
  • PageRank ndash Power Iteration
  • Column-Stochastic amp Irreducible
  • Iterative PageRank Calculation
  • Convergence of the power iteration
  • Singular value decomposition
  • SVD - Definition
  • Singular Values And Singular Vectors
  • Matrix approximation
  • SVD and PCA
  • Example - LSI
  • SVD and PCA (2)
  • Latent Semantic Indexing (LSI)
  • Latent Semantic Indexing
  • Similarity Measures
  • Similarity Measures (2)
  • HITS (Hyperlink Induced Topic Search)
  • Power Iteration
  • HITS and SVD
  • HITS vs PageRank
  • NMF ndash Non-Negative Matrix Factorization
  • Definition
  • Motivation
  • Multiplicative Update Algorithm
  • Multiplicative Update Algorithm (2)
  • NMF vs PCA
  • Reference
  • Major Reference
  • Outline (2)
  • Not Included
  • Basic Concepts
  • What Is Machine Learning
  • Likelihood Function
  • Maximum Likelihood (ML)
  • IID ndash Independent Identically Distributed
  • Reference (2)
  • Expectation Maximization
  • Why We Need EM
  • More General
  • The Expectation Maximization (EM) Algorithm
  • Jensenrsquos Inequality
  • Lower Bounding the Log Likelihood
  • The E and M Steps of EM
  • The E Step
  • The M Step
  • EM Never Decreases the Likelihood
  • Reference (3)
  • Why Do We Need Graphical Model
  • Why Do We Need Graphical Models
  • Directed Acyclic Graphical Models (Bayesian Networks)
  • Directed Graphs for Statistical Models Plate Notation
  • pLSA ndash Probabilistic Latent Semantic Analysis
  • Latent Semantic Indexing (LSI) Review
  • pLSA ndash Probabilistic Latent Semantic Analysis (2)
  • pLSA
  • pLSA (2)
  • Joint Probability vs Likelihood
  • Document Decomposition
  • pLSA ndash Objective Function
  • EM Steps
  • Lower Bounding the Log Likelihood (2)
  • EM Steps (2)
  • Latent Subspace
  • pLSA vs LSA
  • pLSA vs LSA (2)
  • Applications
  • Text Mining
  • Scene Classification
  • Classification Result
  • Reference (4)
  • LDA ndash Latent Dirichilet Allocation
  • Problems in pLSA
  • Problems in pLSA (2)
  • Dirichlet Distribution
  • Dirichlet Distribution (2)
  • Example Dirichlet Distributions (K=3)
  • Example Dirichlet Distributions (K=3) (2)
  • The LDA Model
  • The LDA Model (2)
  • Joint Probability
  • Likelihood
  • Inference
  • Inference (2)
  • Variational Inference
  • Variantional Inference
  • VBEM vs EM
  • Parameter Estimation
  • Parameter Estimation (2)
  • Topic Examples in a 100-topic LDA Model)
  • Classification (50-topic LDA + SVM)
  • Problems in LDA
  • A Bayesian Hierarchical Model for Learning Natural Scene Catego
  • Codebook
  • Topic Distribution in Different Categories
  • Topic Hierarchical Clustering
  • More Topic Models
  • Are you really into Graphical Models
  • Reference (5)
  • Outline (3)
  • Slide 115
  • The Long Query Problem
  • Key Idea Dimension Reduction + Residual Error Preservation
  • Orthogonal Decomposition
  • A Probabilistic Implementation
  • Search (Online)
  • Search Example
  • Search Example (2)
  • Simultaneously Modeling Semantics and Structure of Threaded Dis
  • Semantic amp structure
  • Optimize them together
  • Reply reconstruction
  • Baselines
  • Evaluation
  • Expert finding
  • Baselines (2)
  • Evaluation (2)
  • Summary
Page 125: An Introduction To Matrix Decomposition and Graphical Model

Evaluation

method Slashdot Apple

All Posts Good Posts All Posts Good Posts

NP 0021 0012 0289 0239

RR 0183 0319 0269 0474

DS 0463 0643 0409 0628

LDA 0465 0644 0410 0648

SWB 0463 0644 0410 0641

SMSS 0524 0737 0517 0772

128

Expert finding

Reply reconstruction

Network construction

Expert finding

Methods

HITS

PageRank

hellip

129

Baselines

LMFormal Models for Expert Finding in Enterprise Corpora SIGIR

06Achieves stable performance in expert finding task using a

language modelPageRank

Benchmark nodal ranking methodHITS

Find hub nodes and authority nodeEABIF

Personalized Recommendation Driven by Information Flow SIGIR rsquo06

Find most influential node130

Evaluation

131

bull Bayesian estimate

Method MRR MAP P10

LM 0821 0698 0800

EABIF(ori) 0674 0362 0243

EABIF(rec) 0742 0318 0281

PageRank(ori) 0675 0377 0263

PageRank(rec) 0743 0321 0266

HITS(ori) 0906 0832 0900

HITS(rec) 0938 0822 0906

Summary

bull Matrix and probability are fundamental mathematics in information retrieval and computer visionndash Matrix decomposition ndash a good practice to learn matrixndash Graphical model ndash a good practice to learn probability

bull Graphical model is a good tool to analyze problems

bull The essence of decomposition is to discover a set of mid-level features to describe original documentsimages

bull It is more adaptable for various applications than matrix decomposition

  • An Introduction To Matrix Decomposition and Graphical Model
  • Outline
  • What Is Matrix Decomposition
  • Why We Need Matrix Decomposition
  • Why We Need Matrix Decomposition (2)
  • Principle Component Analysis
  • Definition ndash Eigenvalue amp Eigenvector
  • Definition ndash Principle Component Analysis
  • Principle Component Analysis (2)
  • Maximizing Variance
  • Optimization Problem
  • Property Data Decomposition
  • Face Recognition ndash Eigenface
  • Slide 14
  • Slide 15
  • PageRank ndash Power Iteration
  • Column-Stochastic amp Irreducible
  • Iterative PageRank Calculation
  • Convergence of the power iteration
  • Singular value decomposition
  • SVD - Definition
  • Singular Values And Singular Vectors
  • Matrix approximation
  • SVD and PCA
  • Example - LSI
  • SVD and PCA (2)
  • Latent Semantic Indexing (LSI)
  • Latent Semantic Indexing
  • Similarity Measures
  • Similarity Measures (2)
  • HITS (Hyperlink Induced Topic Search)
  • Power Iteration
  • HITS and SVD
  • HITS vs PageRank
  • NMF ndash Non-Negative Matrix Factorization
  • Definition
  • Motivation
  • Multiplicative Update Algorithm
  • Multiplicative Update Algorithm (2)
  • NMF vs PCA
  • Reference
  • Major Reference
  • Outline (2)
  • Not Included
  • Basic Concepts
  • What Is Machine Learning
  • Likelihood Function
  • Maximum Likelihood (ML)
  • IID ndash Independent Identically Distributed
  • Reference (2)
  • Expectation Maximization
  • Why We Need EM
  • More General
  • The Expectation Maximization (EM) Algorithm
  • Jensenrsquos Inequality
  • Lower Bounding the Log Likelihood
  • The E and M Steps of EM
  • The E Step
  • The M Step
  • EM Never Decreases the Likelihood
  • Reference (3)
  • Why Do We Need Graphical Model
  • Why Do We Need Graphical Models
  • Directed Acyclic Graphical Models (Bayesian Networks)
  • Directed Graphs for Statistical Models Plate Notation
  • pLSA ndash Probabilistic Latent Semantic Analysis
  • Latent Semantic Indexing (LSI) Review
  • pLSA ndash Probabilistic Latent Semantic Analysis (2)
  • pLSA
  • pLSA (2)
  • Joint Probability vs Likelihood
  • Document Decomposition
  • pLSA ndash Objective Function
  • EM Steps
  • Lower Bounding the Log Likelihood (2)
  • EM Steps (2)
  • Latent Subspace
  • pLSA vs LSA
  • pLSA vs LSA (2)
  • Applications
  • Text Mining
  • Scene Classification
  • Classification Result
  • Reference (4)
  • LDA ndash Latent Dirichilet Allocation
  • Problems in pLSA
  • Problems in pLSA (2)
  • Dirichlet Distribution
  • Dirichlet Distribution (2)
  • Example Dirichlet Distributions (K=3)
  • Example Dirichlet Distributions (K=3) (2)
  • The LDA Model
  • The LDA Model (2)
  • Joint Probability
  • Likelihood
  • Inference
  • Inference (2)
  • Variational Inference
  • Variantional Inference
  • VBEM vs EM
  • Parameter Estimation
  • Parameter Estimation (2)
  • Topic Examples in a 100-topic LDA Model)
  • Classification (50-topic LDA + SVM)
  • Problems in LDA
  • A Bayesian Hierarchical Model for Learning Natural Scene Catego
  • Codebook
  • Topic Distribution in Different Categories
  • Topic Hierarchical Clustering
  • More Topic Models
  • Are you really into Graphical Models
  • Reference (5)
  • Outline (3)
  • Slide 115
  • The Long Query Problem
  • Key Idea Dimension Reduction + Residual Error Preservation
  • Orthogonal Decomposition
  • A Probabilistic Implementation
  • Search (Online)
  • Search Example
  • Search Example (2)
  • Simultaneously Modeling Semantics and Structure of Threaded Dis
  • Semantic amp structure
  • Optimize them together
  • Reply reconstruction
  • Baselines
  • Evaluation
  • Expert finding
  • Baselines (2)
  • Evaluation (2)
  • Summary
Page 126: An Introduction To Matrix Decomposition and Graphical Model

Expert finding

Reply reconstruction

Network construction

Expert finding

Methods

HITS

PageRank

hellip

129

Baselines

LMFormal Models for Expert Finding in Enterprise Corpora SIGIR

06Achieves stable performance in expert finding task using a

language modelPageRank

Benchmark nodal ranking methodHITS

Find hub nodes and authority nodeEABIF

Personalized Recommendation Driven by Information Flow SIGIR rsquo06

Find most influential node130

Evaluation

131

bull Bayesian estimate

Method MRR MAP P10

LM 0821 0698 0800

EABIF(ori) 0674 0362 0243

EABIF(rec) 0742 0318 0281

PageRank(ori) 0675 0377 0263

PageRank(rec) 0743 0321 0266

HITS(ori) 0906 0832 0900

HITS(rec) 0938 0822 0906

Summary

bull Matrix and probability are fundamental mathematics in information retrieval and computer visionndash Matrix decomposition ndash a good practice to learn matrixndash Graphical model ndash a good practice to learn probability

bull Graphical model is a good tool to analyze problems

bull The essence of decomposition is to discover a set of mid-level features to describe original documentsimages

bull It is more adaptable for various applications than matrix decomposition

  • An Introduction To Matrix Decomposition and Graphical Model
  • Outline
  • What Is Matrix Decomposition
  • Why We Need Matrix Decomposition
  • Why We Need Matrix Decomposition (2)
  • Principle Component Analysis
  • Definition ndash Eigenvalue amp Eigenvector
  • Definition ndash Principle Component Analysis
  • Principle Component Analysis (2)
  • Maximizing Variance
  • Optimization Problem
  • Property Data Decomposition
  • Face Recognition ndash Eigenface
  • Slide 14
  • Slide 15
  • PageRank ndash Power Iteration
  • Column-Stochastic amp Irreducible
  • Iterative PageRank Calculation
  • Convergence of the power iteration
  • Singular value decomposition
  • SVD - Definition
  • Singular Values And Singular Vectors
  • Matrix approximation
  • SVD and PCA
  • Example - LSI
  • SVD and PCA (2)
  • Latent Semantic Indexing (LSI)
  • Latent Semantic Indexing
  • Similarity Measures
  • Similarity Measures (2)
  • HITS (Hyperlink Induced Topic Search)
  • Power Iteration
  • HITS and SVD
  • HITS vs PageRank
  • NMF ndash Non-Negative Matrix Factorization
  • Definition
  • Motivation
  • Multiplicative Update Algorithm
  • Multiplicative Update Algorithm (2)
  • NMF vs PCA
  • Reference
  • Major Reference
  • Outline (2)
  • Not Included
  • Basic Concepts
  • What Is Machine Learning
  • Likelihood Function
  • Maximum Likelihood (ML)
  • IID ndash Independent Identically Distributed
  • Reference (2)
  • Expectation Maximization
  • Why We Need EM
  • More General
  • The Expectation Maximization (EM) Algorithm
  • Jensenrsquos Inequality
  • Lower Bounding the Log Likelihood
  • The E and M Steps of EM
  • The E Step
  • The M Step
  • EM Never Decreases the Likelihood
  • Reference (3)
  • Why Do We Need Graphical Model
  • Why Do We Need Graphical Models
  • Directed Acyclic Graphical Models (Bayesian Networks)
  • Directed Graphs for Statistical Models Plate Notation
  • pLSA ndash Probabilistic Latent Semantic Analysis
  • Latent Semantic Indexing (LSI) Review
  • pLSA ndash Probabilistic Latent Semantic Analysis (2)
  • pLSA
  • pLSA (2)
  • Joint Probability vs Likelihood
  • Document Decomposition
  • pLSA ndash Objective Function
  • EM Steps
  • Lower Bounding the Log Likelihood (2)
  • EM Steps (2)
  • Latent Subspace
  • pLSA vs LSA
  • pLSA vs LSA (2)
  • Applications
  • Text Mining
  • Scene Classification
  • Classification Result
  • Reference (4)
  • LDA ndash Latent Dirichilet Allocation
  • Problems in pLSA
  • Problems in pLSA (2)
  • Dirichlet Distribution
  • Dirichlet Distribution (2)
  • Example Dirichlet Distributions (K=3)
  • Example Dirichlet Distributions (K=3) (2)
  • The LDA Model
  • The LDA Model (2)
  • Joint Probability
  • Likelihood
  • Inference
  • Inference (2)
  • Variational Inference
  • Variantional Inference
  • VBEM vs EM
  • Parameter Estimation
  • Parameter Estimation (2)
  • Topic Examples in a 100-topic LDA Model)
  • Classification (50-topic LDA + SVM)
  • Problems in LDA
  • A Bayesian Hierarchical Model for Learning Natural Scene Catego
  • Codebook
  • Topic Distribution in Different Categories
  • Topic Hierarchical Clustering
  • More Topic Models
  • Are you really into Graphical Models
  • Reference (5)
  • Outline (3)
  • Slide 115
  • The Long Query Problem
  • Key Idea Dimension Reduction + Residual Error Preservation
  • Orthogonal Decomposition
  • A Probabilistic Implementation
  • Search (Online)
  • Search Example
  • Search Example (2)
  • Simultaneously Modeling Semantics and Structure of Threaded Dis
  • Semantic amp structure
  • Optimize them together
  • Reply reconstruction
  • Baselines
  • Evaluation
  • Expert finding
  • Baselines (2)
  • Evaluation (2)
  • Summary
Page 127: An Introduction To Matrix Decomposition and Graphical Model

Baselines

LMFormal Models for Expert Finding in Enterprise Corpora SIGIR

06Achieves stable performance in expert finding task using a

language modelPageRank

Benchmark nodal ranking methodHITS

Find hub nodes and authority nodeEABIF

Personalized Recommendation Driven by Information Flow SIGIR rsquo06

Find most influential node130

Evaluation

131

bull Bayesian estimate

Method MRR MAP P10

LM 0821 0698 0800

EABIF(ori) 0674 0362 0243

EABIF(rec) 0742 0318 0281

PageRank(ori) 0675 0377 0263

PageRank(rec) 0743 0321 0266

HITS(ori) 0906 0832 0900

HITS(rec) 0938 0822 0906

Summary

bull Matrix and probability are fundamental mathematics in information retrieval and computer visionndash Matrix decomposition ndash a good practice to learn matrixndash Graphical model ndash a good practice to learn probability

bull Graphical model is a good tool to analyze problems

bull The essence of decomposition is to discover a set of mid-level features to describe original documentsimages

bull It is more adaptable for various applications than matrix decomposition

  • An Introduction To Matrix Decomposition and Graphical Model
  • Outline
  • What Is Matrix Decomposition
  • Why We Need Matrix Decomposition
  • Why We Need Matrix Decomposition (2)
  • Principle Component Analysis
  • Definition ndash Eigenvalue amp Eigenvector
  • Definition ndash Principle Component Analysis
  • Principle Component Analysis (2)
  • Maximizing Variance
  • Optimization Problem
  • Property Data Decomposition
  • Face Recognition ndash Eigenface
  • Slide 14
  • Slide 15
  • PageRank ndash Power Iteration
  • Column-Stochastic amp Irreducible
  • Iterative PageRank Calculation
  • Convergence of the power iteration
  • Singular value decomposition
  • SVD - Definition
  • Singular Values And Singular Vectors
  • Matrix approximation
  • SVD and PCA
  • Example - LSI
  • SVD and PCA (2)
  • Latent Semantic Indexing (LSI)
  • Latent Semantic Indexing
  • Similarity Measures
  • Similarity Measures (2)
  • HITS (Hyperlink Induced Topic Search)
  • Power Iteration
  • HITS and SVD
  • HITS vs PageRank
  • NMF ndash Non-Negative Matrix Factorization
  • Definition
  • Motivation
  • Multiplicative Update Algorithm
  • Multiplicative Update Algorithm (2)
  • NMF vs PCA
  • Reference
  • Major Reference
  • Outline (2)
  • Not Included
  • Basic Concepts
  • What Is Machine Learning
  • Likelihood Function
  • Maximum Likelihood (ML)
  • IID ndash Independent Identically Distributed
  • Reference (2)
  • Expectation Maximization
  • Why We Need EM
  • More General
  • The Expectation Maximization (EM) Algorithm
  • Jensenrsquos Inequality
  • Lower Bounding the Log Likelihood
  • The E and M Steps of EM
  • The E Step
  • The M Step
  • EM Never Decreases the Likelihood
  • Reference (3)
  • Why Do We Need Graphical Model
  • Why Do We Need Graphical Models
  • Directed Acyclic Graphical Models (Bayesian Networks)
  • Directed Graphs for Statistical Models Plate Notation
  • pLSA ndash Probabilistic Latent Semantic Analysis
  • Latent Semantic Indexing (LSI) Review
  • pLSA ndash Probabilistic Latent Semantic Analysis (2)
  • pLSA
  • pLSA (2)
  • Joint Probability vs Likelihood
  • Document Decomposition
  • pLSA ndash Objective Function
  • EM Steps
  • Lower Bounding the Log Likelihood (2)
  • EM Steps (2)
  • Latent Subspace
  • pLSA vs LSA
  • pLSA vs LSA (2)
  • Applications
  • Text Mining
  • Scene Classification
  • Classification Result
  • Reference (4)
  • LDA ndash Latent Dirichilet Allocation
  • Problems in pLSA
  • Problems in pLSA (2)
  • Dirichlet Distribution
  • Dirichlet Distribution (2)
  • Example Dirichlet Distributions (K=3)
  • Example Dirichlet Distributions (K=3) (2)
  • The LDA Model
  • The LDA Model (2)
  • Joint Probability
  • Likelihood
  • Inference
  • Inference (2)
  • Variational Inference
  • Variantional Inference
  • VBEM vs EM
  • Parameter Estimation
  • Parameter Estimation (2)
  • Topic Examples in a 100-topic LDA Model)
  • Classification (50-topic LDA + SVM)
  • Problems in LDA
  • A Bayesian Hierarchical Model for Learning Natural Scene Catego
  • Codebook
  • Topic Distribution in Different Categories
  • Topic Hierarchical Clustering
  • More Topic Models
  • Are you really into Graphical Models
  • Reference (5)
  • Outline (3)
  • Slide 115
  • The Long Query Problem
  • Key Idea Dimension Reduction + Residual Error Preservation
  • Orthogonal Decomposition
  • A Probabilistic Implementation
  • Search (Online)
  • Search Example
  • Search Example (2)
  • Simultaneously Modeling Semantics and Structure of Threaded Dis
  • Semantic amp structure
  • Optimize them together
  • Reply reconstruction
  • Baselines
  • Evaluation
  • Expert finding
  • Baselines (2)
  • Evaluation (2)
  • Summary
Page 128: An Introduction To Matrix Decomposition and Graphical Model

Evaluation

131

bull Bayesian estimate

Method MRR MAP P10

LM 0821 0698 0800

EABIF(ori) 0674 0362 0243

EABIF(rec) 0742 0318 0281

PageRank(ori) 0675 0377 0263

PageRank(rec) 0743 0321 0266

HITS(ori) 0906 0832 0900

HITS(rec) 0938 0822 0906

Summary

bull Matrix and probability are fundamental mathematics in information retrieval and computer visionndash Matrix decomposition ndash a good practice to learn matrixndash Graphical model ndash a good practice to learn probability

bull Graphical model is a good tool to analyze problems

bull The essence of decomposition is to discover a set of mid-level features to describe original documentsimages

bull It is more adaptable for various applications than matrix decomposition

  • An Introduction To Matrix Decomposition and Graphical Model
  • Outline
  • What Is Matrix Decomposition
  • Why We Need Matrix Decomposition
  • Why We Need Matrix Decomposition (2)
  • Principle Component Analysis
  • Definition ndash Eigenvalue amp Eigenvector
  • Definition ndash Principle Component Analysis
  • Principle Component Analysis (2)
  • Maximizing Variance
  • Optimization Problem
  • Property Data Decomposition
  • Face Recognition ndash Eigenface
  • Slide 14
  • Slide 15
  • PageRank ndash Power Iteration
  • Column-Stochastic amp Irreducible
  • Iterative PageRank Calculation
  • Convergence of the power iteration
  • Singular value decomposition
  • SVD - Definition
  • Singular Values And Singular Vectors
  • Matrix approximation
  • SVD and PCA
  • Example - LSI
  • SVD and PCA (2)
  • Latent Semantic Indexing (LSI)
  • Latent Semantic Indexing
  • Similarity Measures
  • Similarity Measures (2)
  • HITS (Hyperlink Induced Topic Search)
  • Power Iteration
  • HITS and SVD
  • HITS vs PageRank
  • NMF ndash Non-Negative Matrix Factorization
  • Definition
  • Motivation
  • Multiplicative Update Algorithm
  • Multiplicative Update Algorithm (2)
  • NMF vs PCA
  • Reference
  • Major Reference
  • Outline (2)
  • Not Included
  • Basic Concepts
  • What Is Machine Learning
  • Likelihood Function
  • Maximum Likelihood (ML)
  • IID ndash Independent Identically Distributed
  • Reference (2)
  • Expectation Maximization
  • Why We Need EM
  • More General
  • The Expectation Maximization (EM) Algorithm
  • Jensenrsquos Inequality
  • Lower Bounding the Log Likelihood
  • The E and M Steps of EM
  • The E Step
  • The M Step
  • EM Never Decreases the Likelihood
  • Reference (3)
  • Why Do We Need Graphical Model
  • Why Do We Need Graphical Models
  • Directed Acyclic Graphical Models (Bayesian Networks)
  • Directed Graphs for Statistical Models Plate Notation
  • pLSA ndash Probabilistic Latent Semantic Analysis
  • Latent Semantic Indexing (LSI) Review
  • pLSA ndash Probabilistic Latent Semantic Analysis (2)
  • pLSA
  • pLSA (2)
  • Joint Probability vs Likelihood
  • Document Decomposition
  • pLSA ndash Objective Function
  • EM Steps
  • Lower Bounding the Log Likelihood (2)
  • EM Steps (2)
  • Latent Subspace
  • pLSA vs LSA
  • pLSA vs LSA (2)
  • Applications
  • Text Mining
  • Scene Classification
  • Classification Result
  • Reference (4)
  • LDA ndash Latent Dirichilet Allocation
  • Problems in pLSA
  • Problems in pLSA (2)
  • Dirichlet Distribution
  • Dirichlet Distribution (2)
  • Example Dirichlet Distributions (K=3)
  • Example Dirichlet Distributions (K=3) (2)
  • The LDA Model
  • The LDA Model (2)
  • Joint Probability
  • Likelihood
  • Inference
  • Inference (2)
  • Variational Inference
  • Variantional Inference
  • VBEM vs EM
  • Parameter Estimation
  • Parameter Estimation (2)
  • Topic Examples in a 100-topic LDA Model)
  • Classification (50-topic LDA + SVM)
  • Problems in LDA
  • A Bayesian Hierarchical Model for Learning Natural Scene Catego
  • Codebook
  • Topic Distribution in Different Categories
  • Topic Hierarchical Clustering
  • More Topic Models
  • Are you really into Graphical Models
  • Reference (5)
  • Outline (3)
  • Slide 115
  • The Long Query Problem
  • Key Idea Dimension Reduction + Residual Error Preservation
  • Orthogonal Decomposition
  • A Probabilistic Implementation
  • Search (Online)
  • Search Example
  • Search Example (2)
  • Simultaneously Modeling Semantics and Structure of Threaded Dis
  • Semantic amp structure
  • Optimize them together
  • Reply reconstruction
  • Baselines
  • Evaluation
  • Expert finding
  • Baselines (2)
  • Evaluation (2)
  • Summary
Page 129: An Introduction To Matrix Decomposition and Graphical Model

Summary

bull Matrix and probability are fundamental mathematics in information retrieval and computer visionndash Matrix decomposition ndash a good practice to learn matrixndash Graphical model ndash a good practice to learn probability

bull Graphical model is a good tool to analyze problems

bull The essence of decomposition is to discover a set of mid-level features to describe original documentsimages

bull It is more adaptable for various applications than matrix decomposition

  • An Introduction To Matrix Decomposition and Graphical Model
  • Outline
  • What Is Matrix Decomposition
  • Why We Need Matrix Decomposition
  • Why We Need Matrix Decomposition (2)
  • Principle Component Analysis
  • Definition ndash Eigenvalue amp Eigenvector
  • Definition ndash Principle Component Analysis
  • Principle Component Analysis (2)
  • Maximizing Variance
  • Optimization Problem
  • Property Data Decomposition
  • Face Recognition ndash Eigenface
  • Slide 14
  • Slide 15
  • PageRank ndash Power Iteration
  • Column-Stochastic amp Irreducible
  • Iterative PageRank Calculation
  • Convergence of the power iteration
  • Singular value decomposition
  • SVD - Definition
  • Singular Values And Singular Vectors
  • Matrix approximation
  • SVD and PCA
  • Example - LSI
  • SVD and PCA (2)
  • Latent Semantic Indexing (LSI)
  • Latent Semantic Indexing
  • Similarity Measures
  • Similarity Measures (2)
  • HITS (Hyperlink Induced Topic Search)
  • Power Iteration
  • HITS and SVD
  • HITS vs PageRank
  • NMF ndash Non-Negative Matrix Factorization
  • Definition
  • Motivation
  • Multiplicative Update Algorithm
  • Multiplicative Update Algorithm (2)
  • NMF vs PCA
  • Reference
  • Major Reference
  • Outline (2)
  • Not Included
  • Basic Concepts
  • What Is Machine Learning
  • Likelihood Function
  • Maximum Likelihood (ML)
  • IID ndash Independent Identically Distributed
  • Reference (2)
  • Expectation Maximization
  • Why We Need EM
  • More General
  • The Expectation Maximization (EM) Algorithm
  • Jensenrsquos Inequality
  • Lower Bounding the Log Likelihood
  • The E and M Steps of EM
  • The E Step
  • The M Step
  • EM Never Decreases the Likelihood
  • Reference (3)
  • Why Do We Need Graphical Model
  • Why Do We Need Graphical Models
  • Directed Acyclic Graphical Models (Bayesian Networks)
  • Directed Graphs for Statistical Models Plate Notation
  • pLSA ndash Probabilistic Latent Semantic Analysis
  • Latent Semantic Indexing (LSI) Review
  • pLSA ndash Probabilistic Latent Semantic Analysis (2)
  • pLSA
  • pLSA (2)
  • Joint Probability vs Likelihood
  • Document Decomposition
  • pLSA ndash Objective Function
  • EM Steps
  • Lower Bounding the Log Likelihood (2)
  • EM Steps (2)
  • Latent Subspace
  • pLSA vs LSA
  • pLSA vs LSA (2)
  • Applications
  • Text Mining
  • Scene Classification
  • Classification Result
  • Reference (4)
  • LDA ndash Latent Dirichilet Allocation
  • Problems in pLSA
  • Problems in pLSA (2)
  • Dirichlet Distribution
  • Dirichlet Distribution (2)
  • Example Dirichlet Distributions (K=3)
  • Example Dirichlet Distributions (K=3) (2)
  • The LDA Model
  • The LDA Model (2)
  • Joint Probability
  • Likelihood
  • Inference
  • Inference (2)
  • Variational Inference
  • Variantional Inference
  • VBEM vs EM
  • Parameter Estimation
  • Parameter Estimation (2)
  • Topic Examples in a 100-topic LDA Model)
  • Classification (50-topic LDA + SVM)
  • Problems in LDA
  • A Bayesian Hierarchical Model for Learning Natural Scene Catego
  • Codebook
  • Topic Distribution in Different Categories
  • Topic Hierarchical Clustering
  • More Topic Models
  • Are you really into Graphical Models
  • Reference (5)
  • Outline (3)
  • Slide 115
  • The Long Query Problem
  • Key Idea Dimension Reduction + Residual Error Preservation
  • Orthogonal Decomposition
  • A Probabilistic Implementation
  • Search (Online)
  • Search Example
  • Search Example (2)
  • Simultaneously Modeling Semantics and Structure of Threaded Dis
  • Semantic amp structure
  • Optimize them together
  • Reply reconstruction
  • Baselines
  • Evaluation
  • Expert finding
  • Baselines (2)
  • Evaluation (2)
  • Summary