Harmonium Models for Semantic Video Representation and ...juny/Prof/papers/sdm07jyang.pdf ·...

Harmonium Models for Semantic Video Representation and

Classification

Jun Yang Yan Liu Eric P. Xing Alexander G. HauptmannSchool of Computer ScienceCarnegie Mellon University

{juny, yanliu, epxing, alex}@cs.cmu.edu

Abstract

Accurate and efficient video classification demands thefusion of multimodal information and the use of in-termediate representations. Combining the two ideasinto the same framework, we propose a probabilisticapproach for video classification using intermediate se-mantic representations derived from the multi-modalfeatures. Based on a class of bipartite undirected graph-ical models named harmonium, our approach representsvideo data as latent semantic topics derived by jointlymodeling the transcript keywords and color-histogramfeatures, and perform classification using these latenttopics under a unified framework. We show satisfactoryclassification performance of our approach on a bench-mark dataset, and some interesting insights of the dataprovided by this approach.

1 Introduction

Classifying video data into semantic categories, some-times known as semantic video concept detection, is animportant research topic. Video data contain multifari-ous data types including image frames, transcript text,speech, audio signal, each bearing correlated and com-plementary information essential to the analysis and re-trieval of video data. The fusion of such multimodalinformation is regarded as a key research problem [10],and has been a widely used technique in video clas-sification and retrieval methods. Many fusion strate-gies have been proposed, varying from early fusion [12],which merges the feature vectors extracted from dif-ferent modalities, to late fusion, which combines theoutputs of the classifiers or “retrieval experts” built oneach single modality [12, 6, 18, 15]. Empirical resultsshow that methods based on the fusion of multimodalinformation outperforms those based on any single typeof information in both video classification and retrievaltasks.

Another trend in video classification is the seek oflow-dimensional, intermediate representations of video

data. The primary reason is to make sophisticated clas-sifiers (e.g, SVM) affordable, which otherwise would becomputationally expensive on the high-dimensional rawfeatures. Moreover, using intermediate representationsholds the promise of better interpretation of the data se-mantics, and may lead to superior classification perfor-mance. Work along this direction includes the conven-tional dimension-reduction methods such as principalcomponent analysis (PCA) and Fisher linear discrimi-nant (FLD) [4], as well as probabilistic methods such asprobabilistic latent semantic indexing (pLSI) [5], latentDirichlet allocation (LDA) [2], exponential-family har-monium (EFH) [14]. While many of the models are forsingle-modal data such as textual documents, there arealso extensions for modeling multi-modal data such ascaptioned images and video [1, 17].

The key insights for video classification from previ-ous works appear to be combining multimodal informa-tion and using intermediate representations. The goalof this paper is to take advantage of both insights us-ing an integrated and principled approach. Based ona class of bipartite undirected graphical models (i.e.,random fields) called harmonium [14, 17], our approachextracts intermediate representation as latent seman-tic topics of video data by jointly modeling the corre-lated information in image regions and transcript key-words. Moreover, this approach explicitly introducescategory label(s) into the model, which allows the clas-sification and representation to be accomplished in aunified framework.

The proposed approach is significantly differentfrom previous models for text/multimedia data, mainlyin that it incorporates category labels as (hidden) modelvariables, in addition to the variables representing data(features) and latent semantic topics. This allows theclassification to be done by inferencing the distributionof these label variables conditioned on the observed datavariables. In contrast, existing models [2, 1, 5, 14, 17]are mainly for deriving the intermediate data repre-

sentation in the form of latent semantic topics. Theclassification, if needed, is performed using a separateclassifier built on top of the derived intermediate rep-resentations. Therefore, one advantage of our approachis the unified model for both representation and clas-sification, which avoids the need of building separateclassifiers. More importantly, by modeling the interac-tions between latent semantic topics and category la-bels, this approach tailors the intermediate representa-tion to reflect category information of the data. Such“supervised” intermediate representations are expectedto provide more discriminative power and insights of thedata than the “unsupervised” representations generatedby existing methods [2, 1, 5, 14, 17].

Our proposal includes two related models, eachbearing different implications to the representation andclassification of the video data. Family-of-harmonium(FoH) builds a family of category-specific harmoniummodels, with each one modeling the video data from aspecific category. The label of a video shot is predictedby comparing its likelihood in each harmonium model.Hierarchical harmonium (HH) incorporates all the cate-gory labels as an additional layer of hidden variables intoa single harmonium model, and performs classificationthrough the inference of these label variables. Due tothe model structure, the FoH model reveals the internalstructure of each category, and can be easily extendedto include new categories without retraining the wholemodel. In contrast, the HH model reveals the relation-ships between multiple categories, and takes advantageof such relationships in classification.

In Section 2 we review the related work on the fusionof multimodal video features as well as representationmodels for video data. We describe the two proposedmodels in Section 3. In Section 5, we show that theproposed models achieve satisfactory classification per-formance and interesting data interpretation throughexperiments on TRECVID video collection. The con-clusions and future work are discussed in Section 6.

2 Related Works

As pointed out in [10], the processing, indexing, andfusion of data in multiple modalities is a core problemof multimedia research. For video classification andretrieval, the fusion of features from multiple data types(e.g., key-frames, audio, transcript) allows them tocomplement each other to achieve better performancethan using any single type of feature, and has beenwidely used in many existing methods. The fusionstrategies vary from early fusion [12], which merges thefeature vectors extracted from different data modalities,to late fusion, which combines the output of classifiers or“retrieval experts” built on each single modality [12, 6,

18, 15]. It remains an open question as to which fusionstrategy is more appropriate for a certain task, and acomparison of the two strategies in video classificationis presented in [12]. The approach to be presentedhere uses neither of these two fusion strategies; instead,it derives the latent semantic representation of thevideo data by jointly modeling the multimodal low-levelfeatures, so the fusion takes place somewhere betweenearly fusion and late fusion.

There are many approaches to obtain low-dimensional intermediate representations of video data.Principal component analysis (PCA) has been the mostpopular method, which projects the raw features intoa lower-dimensional feature space where the data vari-ances are well preserved. Independent componentanalysis (ICA) and Fisher linear discriminant (FLD)are the other popular dimension reduction methods.Recently, there are also many proposals on modelingthe latent semantic topics of the text and multimediadata. For example, latent semantic indexing (LSI) byDeerwester et al. [3] transforms term counts linearlyinto a low-dimensional semantic eigenspace, and theidea was later extended by Hofmann to probabilisticLSI (pLSI) [5]. The latent Dirichlet allocation (LDA)by Blei et al. [2] is a directed graphical model thatprovides generative semantics of text documents, whereeach document is associated with a topic-mixing vectorand each word is independently sampled according to atopic drawn from this topic-mixing. LDA has been ex-tended to Gaussian-Mixture LDA (GM-LDA) and Cor-respondence LDA (Corr-LDA) [1], both of which areused to model annotated data such as captioned im-ages or video with transcript text. Exponential-familyharmonium (EFH) proposed by Welling et al. [14] is bi-partite undirected graphical model consisting a layer oflatent nodes representing semantic aspects and a layerof observed nodes representing raw data (features). Tomodel multi-modal data, Eric et al. [17] has extendedit to the multi-wing harmonium model where the datalayer consists of two or more “wings” of nodes repre-senting textual, imagery, and other types of data, re-spectively.

In practice, the methods mentioned above aremainly used for transforming the high-dimensional rawdata into a low-dimensional representation which pre-sumably capture the latent semantics of the data. Clas-sification task is usually performed by building a sepa-rate discriminative classifier (e.g., SVM) based on suchlatent semantic representations. In this paper, we seekfor a more unified approach where the representationand classification can be integrated into the same frame-work. This approach not only achieves satisfactory clas-sification performance, but also provides interesting in-

��

� ��

!"#$%&' ( !"#$%&' ) *+,!-#,#. /0%"

1%.#, ( 1%.#, )

2!34,' %5 0!&3%+4*3

6789:9:; 6<=69:;

>&#.4 "?!"#$%&'@!-#,

>&#.4 "?!"#$%&'@!-#,

&#$4%+A-!/#. %,%& 04/"%$&!3

B#'C%&./DEFGHIJKLMLNOMHPFQ

R �� S�� R�TT�� R�� S��U� ��

VWXYZ[\]YZ\X^]V_` a_\

b��T�� c�T�d ��e ��R��

Figure 1: A sketch of our approach

sights into the data semantics, such as the internal struc-ture of each category and the relationships between dif-ferent categories. Fei-Fei et al. [8] used a unified modelfor representing and classifying natural scene imagesby introducing category variables into the LDA model,which is similar to our approach except that our modelsare undirected.

3 Our Approach

A sketch of our approach is illustrated in Figure 1. Thebasic data unit to be classified is called video shots,namely video segments with length varying from a fewseconds to half minute or even longer. We representeach video shot by a bag of keywords extracted fromthe video transcript, which is obtained from the video’sclosed-captions or speech recognition, as well as thea set of fixed-sized image regions extracted from thekeyframe of the video shot. Each region is described byits color histogram feature. We learn the model thatbest describes the joint distribution of the keywordsand color features of the video shots in each category.In classification, we extract the keywords and colorfeatures from an unlabeled video shot, from which wederive the most likely category this shot belongs to.The two proposed models, family-of-harmonium andhierarchical harmonium, differ in the way the data aremodeled and classified.

Both of our models are based on a class of bipar-

tite undirected model (i.e., random fields) called har-monium, which has been used by Welling et al. [14]and Eric et al. [17] to model text and multimedia data.Our models use their models as the basic building block,but differ from theirs by explicitly incorporating the cat-egory labels into the model. This allows our models torepresent and classify video data in a unified framework,while the previous harmonium models are only for datarepresentation.

3.1 Notations and definitions The notations usedin the paper follow the convention of probabilistic mod-els. Uppercase characters represent random variables,while lowercase characters represent the instances (val-ues) of the random variables. Bold font is used to indi-cate a vector of random variables or their values. In theillustrations, shaded circles represent observed nodeswhile unfilled circles represent hidden (latent) nodes.Each node in a graphical model is associated with arandom variable, and we use the term node and vari-able interchangeably in this paper.

The semantics of the model variables are describedas follows:

• A video shot s is represented by a tuple as(x, z,h,y), which respectively denote the key-words, region-based color features, latent semantictopics, and category labels of the shot.

• The vector x = (x1, ..., xN ) denotes the keywordfeature extracted from the transcript associatedwith the shot. Here N is the size of the wordvocabulary, and xi ∈ {0, 1} is a binary variablethat indicates the absence or presence of the ith

keyword (of the vocabulary) in the shot.

• The vector z = (z1, ..., zM ) denotes color-histogramfeatures of the keyframe of the shot. Each keyframeis evenly divided into a grid of totally M fixed-sized rectangular regions, and zj ∈ RC is a C-dimensional vector that represents the color his-togram of the jth region. So z is a stacked vectorof length equal to CM .

• The vector h = (h1, ..., hK) represents the latentsemantic topics of the shot, where K is the totalnumber of the latent topics. Each componenthk ∈ R denotes how strongly this shot is associatedwith the kth latent topic.

• The category labels of a shot are modeled differ-ently in the two models. In family-of-harmonium,a single variable y ∈ {1, ..., T} indicates the cate-gory this shot belongs to, where T is the total num-ber of categories. In hierarchical harmonium, the

��

��

� � � � � ��

��

� � � � � �

�

),,,( 2222 UWββββαααα ),,,( TTTT UWββββαααα��

��

��

� � � � � �

),,,( 1111 UWββββαααα

Figure 2: The family-of-harmonium model

labels are represented by a vector y = (y1, ..., yT ),with each yt ∈ {0, 1} denoting whether the shot isin the tth category. Here a video shot belongs toonly one category, so we have

∑t yt = 1.

• The two proposed models have different sets of pa-rameters. The family-of-harmonium has a specificharmonium model for each category y, with para-meters as θy = (πy, αy, βy,W y, Uy). The hierar-chical harmonium has a single set of parameters asθ = (α, β, τ,W,U, V ).

3.2 Family-of-harmonium (FoH) The FoH modelis illustrated in Figure 2. It contains a set of Tcategory-specific harmoniums, with each harmoniummodeling the video data from a specific category. Eachharmonium is a bipartite undirected graphical modelthat consists of two layers of nodes. Nodes in the toplayer represent the latent semantic topics H = {Hk}of the data. To represent the bi-modal features ofvideo data, the bottom layer contains two “wings” ofobserved nodes that represent the keyword featuresX = {Xi} and region-based color features Z = {Zj},respectively. Each node is linked with all the nodesin the opposite layer, but not with any node in thesame layer. This topology ensures that the nodesin one layer are conditionally independent given thenodes in the opposite layer, a property important tothe construction and inference of the model. All thecomponent harmoniums in FoH share exactly the samestructure, but have unique set of parameters θy =(πy, αy, βy,W y, Uy) indexed by the category label y.

We now describe the distributions of these variables.The category label Y follows a prior distribution as amultinomial:

(3.1) p(y) = Multi (π1, ..., πT ),

where∑T

t=1 πt = 1. In FoH, Y is not actually linkedwith any nodes in the component harmoniums; instead,it serves as an indicator variable that selects a specificharmonium from the whole family of harmoniums tomodel the video data of a particular category. Inthe distribution function of each harmonium, Y onlyappears as the subscript of the model parameters.

Given its category label y, we see the raw features ofa shot as well as its latent semantic topics as two layersof representations mutually influencing each other in thespecific harmonium associated with this category. Wecan either conceive keyword and color features as beinggenerated by the latent semantic topics, or conceivethe semantic topics as being summarized from thekeyword and image features. This mutual influence isreflected in the conditional distributions of the variablesrepresenting the features and the semantic topics.

For the keyword feature, the variable xi indicatingthe presence/absence of term i ∈ {1, ..., N} in thevocabulary is distributed as:

P (Xi = 1|h, y) =1

1 + exp(−αyi −

∑k W y

ikhk)(3.2)

P (Xi = 0|h, y) = 1− P (Xi = 1|h, y)

This shows that the each keyword in a video shot issampled from a Bernoulli distribution dependent on thelatent semantic topics h. That is, the probability that akeyword appears is affected by a weighted combinationof semantic topics h. Parameter αy

i and W yik are both

scalars, so αy = (αy1 , ..., αy

N ) is a N -dimensional vector,and W y = [W y

ik] is a matrix of size N ×K. Due to theconditional independence between xi given h, we havep(x|h, y) =

∏i p(xi|h, y).

The color-histogram feature zj of the jth region inthe keyframe of the shot admits a conditional multivari-

ate Gaussian distribution as:

(3.3) p(zj |h, y) = N (zj |Σyj (βy

j +∑

k

Uyjkhk),Σy

j )

where zj is sampled from a distribution parameterizedby the latent semantic topics h. Here, both βy

j andUy

jk are C-dimensional vectors, and therefore βy =(βy

1 , ..., βyM ) is a stacked vector of dimension CM and

Uy = [Uyjk] is a matrix of size CM × K. Note that

Σyj is a C × C covariance matrix, which, for simplicity,

is set to identity matrix I in our model. Again,we have p(z|h, y) =

∏j p(zj |h, y) due to conditional

independence.Finally, each latent topic variable hj follows a

conditional univariance Gaussian distribution whosemean is determined by a weighted combination of thekeyword feature x and the color feature z:

(3.4) p(hk|x, z, c) = N (hk|∑

i

W yikxi +

∑

j

Uyjkzj , 1)

where W yik and Uy

jk are the same parameters usedin Eq.(3.2) and (3.3). Similarly, p(h|x, z, y) =∏

k p(hk|x, z, y) holds.So far we have presented the conditional distribu-

tions of all the variables in the model. These local con-ditionals can be mapped to the following harmoniumrandom fields as:

p(x, z,h|y) ∝ exp{ ∑

i αyi xi +

∑j βy

j zj −∑

j

z2j

2(3.5)

−∑k

h2k

2 +∑

ik W yikxihk +

∑jk Uy

jkzjhk

}

We present the details of the derivation of this randomfield in the Appendix. Note that the partition function(global normalization term) of this distribution is notexplicitly shown, so we use proportional sign instead ofequal sign. This hidden partition function increases thedifficulty of learning the model.

By integrating out the hidden variables h inEq.(3.5), we obtain the category-conditional distribu-tion over the observed keyword and color features of avideo shot:

p(x, z|y) ∝ exp{ ∑

i αyi xi +

∑j βy

j zj −∑

j

z2j

2(3.6)

+ 12

∑k(

∑i W y

ikxi +∑

j Uyjkzj)2

}.

There is also a hidden partition function in thisdistribution. The marginal distribution (likelihood) of alabeled video shot can be decomposed into a category-specific marginal and a prior over the categories, i.e.,p(x, z, y) = p(x, z|y)p(y).

��

��

� � � � � �

��

Figure 3: Hierarchical harmonium model

The learning of FoH involves learning T compo-nent harmoniums, with each harmonium learned inde-pendently using the (labeled) video shots from the cor-responding category. To learn the harmonium modelfor a category y, we estimate its model parametersθy = (αy, βy, W y, Uy) by maximizing the likelihood ofthe video shots in category y, where the likelihood func-tion is defined by Eq.(3.6). Due to the existence of par-tition function, the learning requires approximate in-ference methods. We will further discuss the learningmethods in Section 4.

The category of an unlabeled shot is predicted byfinding the component harmonium that best describesthe features of the shot. Given the keyword feature xand color feature z of a shot, we compute the posteriorprobability of each category label as:

(3.7) p(y|x, z) ∝ p(x, z|y)p(y) ∝ p(x, z|y)

The second step in the derivation assumes that thecategory prior is a uniform distribution as p(y) = 1/T .Eq.(3.7) indicates that we can predict the category ofa shot by comparing its likelihood p(x, z|y) in eachof the category-specific harmonium models computedby Eq.3.6. The harmonium that best fits the shotdetermines its category.

3.3 Hierarchical harmonium (HH) The secondproposed model, hierarchical harmonium, adopts a dif-ferent way of incorporating category labels into the basicharmonium model. Instead of building a separate har-monium for each category, it introduces the categorylabels as another layer of nodes Y = {Y1, ..., YT } into asingle harmonium, with Yt ∈ {0, 1} indicating a shot’smembership with category t. As illustrated in Figure 3,these label variables Y form a bipartite subgraph withthe latent topic nodes H. There is a link between any Yt

and Hj but not between two Yt, which are conditionallyindependent given H. Unlike FoH, there is only a singlehierarchical harmonium in this model.

In the HH model, the conditional distribution ofx and z stay the same as those in the FoH model,which are defined by Eq.(3.2) and Eq.(3.3), respectively.The only difference is that the model parameters θ =(α, β, τ, W,U, V ) no longer depend on category labels.The introduced label variable Yi follows a Bernoullidistribution as:

P (Yt = 1|h) =1

1 + exp(−τt −∑

k Vtkhk)(3.8)

P (Yt = 0|h) = 1− P (Yt = 1|h)

where V = [Vtk] is a matrix of size T × K. Note thatif we treat h as input, Vtk and τ as parameters, thisdistribution has exactly the same form as the posteriordistribution of the class label in logistic regression [4],i.e., P (Y = 1|x) = 1/(1 + exp(−β0 − βT x)). Thisimplies that the model is actually performing logisticregression to compute each category label Yt using thelatent semantic topics h as input.

The distribution of each latent topic variable hk

needs to be modified to incorporate the interactionsbetween label variables y and the topic variables h:

p(hk|x, z,y) =(3.9)

N (hk|∑

i

Wikxi +∑

j

Ujkzj +∑

t

Vtkyt, 1)

Therefore, the distribution of the latent semantic topicsare not only affected by the data features x and z, butalso by their labels y. This is a significant differencefrom existing harmonium models [14, 17] where thedistribution of latent topics only depend on the data.

With the incorporation of label variables, the ran-dom field of hierarchical harmonium becomes:

p(x, z,h,y) ∝ exp

�Xi

αixi +X

j

βjzj −X

j

z2j

2+X

t

τtyt

(3.10)

−X

k

h2k

2+X

ik

Wikxihk +X

jk

Ujkzjhk +X

tk

Vtkythk

�

After integrating out the hidden variable H, themarginal distribution of a labeled video shot (x, z,y)is:

p(x, z,y) ∝ exp

�Xi

αixi +X

j

βjzj −X

k

z2k

2+X

t

τtyt

(3.11)

+1

2

X

k

(X

i

Wikxi +X

j

Ujkzj +X

t

ytVtk)2�

.

The parameters of the HH model, θ =(α, β, τ, W,U, V ), are estimated under the maximumlikelihood principle using the likelihood function defined

by Eq.(3.11). The classification is performed in a verydifferent way in HH. To predict the category of an un-labeled video shot, we need to infer the unknown labelvariables Y of the shot, from its keyword and color fea-tures. This is done by computing the conditional prob-ability p(Yt = 1|x, z) for each label variable Yt. Thecategory that gives the highest conditional probabilityis predicted as the category of the shot:

(3.12) t∗ = argmaxtp(Yt = 1|x, z)

There is, however, no analytical solution to this condi-tional probability. Various approximate inference meth-ods are available to solve this probability, as further dis-cussed in Section 4.

3.4 Model comparison We compare our modelswith the existing models for text and multimedia data,including pLSI [5], LDA [2] and its variants GM-LDA and Corr-LDA [1], exponential-family harmonium[14, 17]. First of all, our models not only derive thelatent semantic representation of the data but also per-form classification within the same framework, while allthe above models are only for data representation andrequire separate classifiers for the classification task.This is not necessarily an advantage of our models.But in terms of classification, ours is a more integratedapproach, which presumably leads to superior perfor-mance and better data interpretation. A model similar(in spirit) to our FoH model is the Bayesian hierarchicalmodel for scene classification proposed by Fei-fei et al.[8], except that it is based on the directed LDA model.Second, in our models the category labels “supervise”the derivation of latent semantic representation. As aresult, the derived representation reflects not only thecharacteristics of the underlying data but also the cat-egory information. This is to be contrasted to the “un-supervised” derivation of latent semantic representationin all the other models. The third issue is the choice be-tween directed or undirected models. The harmoniummodels [14, 17], including ours, are all undirected mod-els, while the rest are directed ones. Using undirectedmodels makes the inference easier due to conditionalindependence of hidden variables, but the learning isusually harder due to the global normalization term.

There are also interest contrasts between the pro-posed models. They first differ in terms of the semanticsof the latent semantic topics derived. In FoH, each har-monium model is for a specific category, and the latenttopics learned in each harmonium capture the internalstructure of the data in that category, i.e., they repre-sent the themes or data sub-clusters in that particularcategory. There are no correspondences between thesemantic topics across different harmoniums: the first

topic in one harmonium is unrelated to the first topicin another. In contrast, HH has a single set of latentsemantic topics derived from the data in various cate-gories. These semantic topics are however different fromthose learned by other representation models, as theyare “supervised” by the category labels and presumablycontain more discriminative information. Sharing a sin-gle semantic representation also help reveals the connec-tions and differences between multiple categories. Thetwo models also differ in terms of scalability. FoH caneasily accommodate a new category by adding anotherharmonium trained from the data of this new category.The existing harmoniums do not need to be changed.However, introducing a new category into HH meansadding another (label) node into the model, which re-quires re-training of the whole model since its structureis changed.

4 Learning and inference

The parameters of our models, namely (αy, βy,W y, Uy)in the FoH model and (α, β, τ, W,U, V ) in the HH model,can be estimated by maximizing the data likelihood.However, there is no closed-form solution to the para-meters in complex models like ours, and therefore itera-tive searching algorithm has to be used. As an example,we discuss the learning and inference algorithms for theHH model. The learning and inference of each compo-nent harmonium in the FoH model can be easily derivedfrom that.

As described in the previous section, the log-likelihood of the data under the HH model is definedby Eq.(3.11). By taking derivatives of the log-likelihoodfunction w.r.t the parameters, we have the following gra-dient learning rules:

δαi = 〈xi〉p − 〈xi〉pδβj = 〈zj〉p − 〈zj〉pδτt = 〈yt〉p − 〈yt〉p

δWik =⟨xih

′k

⟩p− ⟨

xih′k

⟩p

δUjk =⟨zjh

′k

⟩p− ⟨

zjh′k

⟩p

δVtk =⟨yth

′k

⟩p− ⟨

yth′k

⟩p

(4.13)

where h′k =∑

i Wikxi +∑

j Ujkzj +∑

t Vtkyt, and 〈·〉pand 〈·〉p denotes expectation under empirical distrib-ution (i.e., data average) or model distribution of theharmonium, respectively. Like most undirected graph-ical models, there is a global normalizer term in thelikelihood function of harmonium, which makes directlycomputing 〈·〉p intractable. Therefore, we need approx-imate inference methods to approximate these modelexpectations 〈·〉p. We explored four methods which are

briefly discussed below. The conditional distribution ofthe label nodes p(Yt = 1|x, z) is also computed usingthese approximate inference methods.

4.1 Mean field approximation Mean field (MF)is a variational method that approximates the modeldistribution p through a factorized form as a productof marginals over clusters of variables [16]. We use thenaive version of mean field, where the joint probabilityp is approximated by an surrogate distribution q as aproduct of singleton marginals over the variables:

q(x, z,y,h) =∏

i

q(xi|νi)∏

j

q(zj |µj , I)∏

t

q(yt|λt)∏

k

q(hk|γk)

where the singleton marginals are defined as q(xi) ∼Bernoulli (νi), q(zj) ∼ N(µj , I), q(yt) ∼ Bernoulli (λt),and q(hk) ∼ N(γk, 1), and {νi, µj , λt, γk} are variationalparameters. The variation parameters can be computedby minimizing the KL-divergence between p and q,which results in the following fixed-point updatingequations i.e.

νi = σ(αi +∑

k

Wikγk)

µj = βj +∑

k

Ujkγk

λt = σ(τt +∑

k

Vtkγk)

γk =∑

i

Wikvi +∑

j

Ujkµj +∑

t

Vtkλt

where σ(x) = 1/(1 + exp(−x)) is the sigmoid funciton.After the fixed-point equations converge, the surrogatedistribution q is fully specified by the converged varia-tional parameters. We replace the intractable 〈·〉p with〈·〉q in Eq.(4.13), which is easy to compute from the fullyfactorized q. Note that after each iterative searchingstep in Eq.(4.13), we need to recompute the variationalparameters in q since the model parameters of p havebeen updated.

4.2 Gibbs sampling Gibbs sampling, as a spe-cial form of the Markov chain Monte Carlo (MCMC)method, has been used widely for approximate infer-ence in complex graphical models [7]. This method re-peatedly samples variables in a particular order, withone variable at a time and conditioned on the currentvalues of the other variables. For example in our hierar-chical harmonium model, we define the sampling orderas y1, . . . , yT , h1, . . . , hK , and then sample each yt fromthe conditional distribution defined in Eq.(3.8) using the

current values of hj , finally sample each hj according toEq.(3.9). After a large number of iterations (“burn-in”period), this procedure guarantees to reach an equilib-rium distribution that in theory is equal to the modeldistribution p. Therefore, we use the empirical expecta-tion computed using the Gibbs samples collected afterthe burn-in period to approximate the true expectation〈·〉p.

4.3 Contrastive divergence Instead of doing anexact gradient ascent search using the learning rules inEq.(4.13), we can use the contrastive divergence (CD)[13] proposed by Hinton and Welling to approximatethe gradient learning rules. In each step of the gradi-ent update, instead of computing the model expectation〈·〉p, CD runs the Gibbs sampling for up to only a fewiterations and uses the resulting distribution q to ap-proximate the model distribution p. It has been provedthat the final values of the parameters by this kind ofupdating will converge to the maximum likelihood es-timation [13]. In our implementation, we compute 〈·〉qfrom a large number of samples obtained by runningonly one step of Gibbs sampling with different initial-izations. Straightforwardly, CD is significantly more ef-ficient than the Gibbs sampling method since the “burn-in” process is skipped.

4.4 The uncorrected Langevin method The un-corrected Langevin method [9] is originated from theLangevin Monte Carlo method by accepting all the pro-posal moves. It makes use of gradient information andresembles noisy steepest ascent to avoid local optimal.Similar to the gradient ascent, the uncorrected Langevinalgorithm has the following update rule:

(4.14) λnewij = λij +

ε2

2∂

∂λijlog p(X, λ) + εnij

where nij ∼ N(0, 1) and ε is the parameter to controlthe step size. Like the contrastive divergence algorithm,we use only a few iterations of Gibbs sampling toapproximate the model distribution p.

5 Experiments

We evaluate the proposed models using video data fromthe TRECVID 2003 development set [11]. Based onthe manual annotations on this set, we choose 2468shots that belong to 15 semantic categories, whichare airplane, animal, baseball, basketball, beach, desert,fire, football, hockey, mountain, office, road traffic,skating, studio, and weather news. Each shot belongsto only one category. The size of a category variesfrom 46 to 373 shots. The keywords of each shot areextracted from the video closed-captions associated with

��

��

��

��

��

��

��

��

�� !

�� "

Figure 4: The representative images and keywords of 5latent topics derived from the data in category “Fire”

that shot. By removing non-informative words suchas stop words and rare words, we reduce the totalnumber of distinct keywords (vocabulary size) to 3000.Meanwhile, we evenly divide the key-frame of each shotinto a grid of 5x5 regions, and extract a 15-dimensionalcolor histogram on HVC color space from each region.Therefore, each video shot can be represented by a 3000-d keyword feature and a 375-d color histogram feature.For simplicity, the keyword features are made binary,meaning that they only capture the presence/absenceinformation of each keyword, because it is rare to see akeyword appears multiple times in the short duration ofa shot.

The experiment results are presented in two parts.In the first part, we will show some illustrative examplesof the latent semantic topics derived by the proposedmodels and discuss the insights they provide into thestructure and relationships of video categories. In thenext part, we will evaluate the performance of ourmodels in video classification in comparison with someof the existing approaches.

5.1 Interpretation of latent semantic topicsBoth the family-of-harmonium (FoH) and the hierarchi-cal harmonium (HH) model derive latent semantic top-ics as intermediate representation of video data. Sinceeach harmonium in FoH is learned independently from

��

��

��

��

��

��

�� !

�� "

�� #

�� $

Figure 5: The representative images and keywords of 5latent topics derived from the whole data set

the data of a specific category, its latent topics cap-ture the structure of that particular category. To showthese topics are meaningful, in Figure 4 we illustrate the5 latent topics learned from the video category “Fire”by showing the keywords and images associated with 5video shots that have the highest conditional probabil-ity given each latent topic. As we can see, the 5 topicsroughly correspond to 5 sub-categories under the cate-gory “fire”, which can be described as “forest fire in thenight”, “explosion in outer space”, “launch of missile orspace shuttle”, “smoke of fire”, and “close-up scene offire”. Since these latent topics are derived by jointlymodeling the textual and image features of the videodata, they are more than simply clusters in color orkeyword feature space, but sort of “co-clusters” in bothfeature spaces. For example, the shots of Topic 1 arevery similar to each other visually; the shots of Topic2 are not so similar visually, but it is clear that theyhave very close semantic meanings and share commonkeywords such as “flight” and “radar”. The keywordsassociated with Topic 5 seem to be irrelevant at thefirst glance, but later we find that these shots containthe scenes from a movie, which explains the occurrenceof keywords like “love”, “freedom”, and “beautiful”.

We also illustrate the 5 latent topics out of a set of20 topics learned in the HH model in Figure 5. Note thatthese topics are learned from the whole data set insteadof the data from one category, so they are expected

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15

1.fire

2.football

3.moutain

4.studio

5.airplane

6.animal

7.baseball

8.basketball

9.beach

10.desert

11.hockey

12.skating

13.office

14.traffic

15.weather

Figure 6: The color-coded matrix showing the pairwisesimilarity between categories. Best viewed with color.

to represent some high-level semantic topics. We cansee that these 5 topics are about “studio”, “baseballor football”, “weather news”, “airplane or skating”,“animal”, which can be roughly mapped to some of the15 categories in the data set. These results clearly showthat the latent semantic topics learned by our modelscapture the semantics of the video data.

Another advantage of hierarchical harmonium, aswe discussed in Section 3.4, is that it reveals of therelationships between different categories through thehidden topics. We can tell the how much a category tis associated with a latent topic j from the conditionalprobability p(yt|hj). Therefore, we are able to computethe similarity between any two categories by examiningthe hidden topics they are associated with. We show thepairwise similarity between the 15 categories using thecolor-coded confusion matrix in Figure 6, where red(er)color denotes higher similarity and blue(er) color de-notes lower similarity. We can see many meaningfulpairs of related categories, e.g., “mountain” is stronglyrelated to “animal”, “baseball” is related to “hockey”,while “studio” is not related to any category. These re-lationships are basically consistent with common sense.

5.2 Performance on video classification To eval-uate the performance of the FoH and HH model in videoclassification, we evenly divide our data set into a train-ing set and a test set. The model parameters are es-timated from the training set. Specifically, we imple-mented the learning methods based on the four inferencealgorithms described in Section 4, in order to examinetheir efficiency and accuracy. We also explore the issueof model selection, namely the impact of the number oflatent semantic topics to the classification performance.

Several other methods have been implemented forcomparison, all of which produce intermediate repre-

5 10 15 20 25 30 35 40 45 500.2

0.3

0.4

0.5

0.6

0.7

0.8

# of hidden topic nodes

Cla

ssifi

catio

n ac

cura

cyGM−MixtureGM−LDACorr−LDADWHFoHHH

Figure 7: Classification performance of different models

sentation of some kind for the video data. First, weimplemented the approach used in [17], which learns adual-wing harmonium (DWH) from the data and thenbuilds a SVM classifier based on the latent semantic rep-resentations generated by DWH. This method is differ-ent from our approach in that it uses harmonium modelonly for data representation and leaves the classificationtask to a discriminative classifier, while our approach in-tegrates the representation and classification. We alsoimplemented three directed graphical models for rep-resenting video data, which are Gaussian multinomialmixture model (GM-Mixture), Gaussian multinomiallatent Dirichlet allocation (GM-LDA), and correspon-dence latent Dirichlet allocation (Corr-LDA). The de-tails of these models can be found in [1]. Similar toDWH, all the three directed models are used only fordata representation, and each of them requires a SVMclassifier for classification. To make the experimentswith various models, learning algorithms, and numbersof latent topics tractable, we restrict this part of ex-periments to the 5 largest categories containing totally1078 shots as airplane, basketball, baseball, hockey, andweather news.

Figure 7 shows the classification accuracies of theproposed FoH and HH models as well as the comparisonmethods including DWH, GM-Mixture, GM-LDA, andCorr-LDA. To be fair, all the models are implementedusing the mean field variational method (MF) for learn-ing and inference, except GM-Mixture which is imple-mented using expectation-maximization (EM) method.All the approaches are evaluated with the number of la-tent semantic topics set to 5, 10, 20, and 50, in order tostudy the relationship between performance and modelcomplexity.

Several interesting observations can be drawn fromFigure 7. First, the three undirected models as FoH,

5 10 15 20 25 30 35 40 45 50

0.58

0.59

0.6

0.61

0.62

0.63

0.64

0.65

0.66

0.67

# of hidden topic nodes

Cla

ssifi

catio

n ac

cura

cy

CDMFLangevinGibbs

Figure 8: Classification performance of different approx-imate inference methods in hierarchical harmonium

HH, and DWH achieve significantly higher performancethan the directed models as GM-Mixture, GM-LDA,and Corr-LDA, showing that the harmonium model isan effective tool for video representation and classifica-tion. Among them, FoH is the best performer at 5 and10 latent semantic nodes, while DWH is the best per-former at 20 and 50 latent nodes with HH as the closerunner-up. Second, we find that the performance of FoHand HH is overall at the same level of DWH. Given thatDWH uses a SVM classifier, this result is encouraging asit shows that our approach is comparable to the perfor-mance of a state-of-the-art discriminative classifier. Onthe other hand, our approach enjoys many advantagesthat SVM does not have. For example, FoH can be eas-ily extended to accommodate a new category withoutre-training the whole model. Third, the performance ofDWH and HH improves as the number of latent topicsincreases, which is intuitive because using more latenttopics leads to better representation of the data. How-ever, this trend is reversed in the case of FoH, whichperforms much better when using smaller number of la-tent topics. While a theoretical explanation of this isstill unclear, in practice this is a good property of FoHsince it achieves high performance with simpler models.Fourth, 20 seems to be a reasonable number of latentsemantic topics for this data set, since further increasingthe number of topics does not result in a considerableimprovement of the performance.

Figure 8 shows the classification accuracies of HHmodel implemented using different approximate infer-ence methods. From the graph, we can see thatthe Langevin and contrastive divergence (CD) meth-ods have about the same performance, which is slightlyhigher than the performance of mean-filed (MF) andGibbs sampling. We also study the efficiency of theseinference methods by examining the time they need to

reach convergence in training. The result shows thatmean field is the most efficient (approx. 2 min), fol-lowed by CD and Langevin (approx. 9 min), and theslowest one is Gibbs sampling (approx. 49min). There-fore, Langevin and CD are good choices for the learningand inference of our models in terms of both efficiencyand classification performance.

6 Conclusion

We have described two bipartite undirected models forsemantic representation and classification of video data.The two models derive latent semantic representationof video data by jointly modeling the textual and imagefeatures of the data, and perform classification based onsuch latent representations. Experiments on TRECVIDdata have demonstrated that our models achieve satis-factory performance on video classification and provideinsights to the internal structure and relationships ofvideo categories. Approximate inference methods forthe learning and application of our models have beendiscussed and compared.

Our hierarchical harmonium by nature does notrestrict the number of categories an instance (shot)belongs to, since P (Yt = 1|x, z) can be high for multipleYt. Therefore, an interesting future work is to evaluatethe model with a multi-label data set, where eachinstance can belong to any number of categories. Inthis case, our method is actually a multi-task learning(MTL) method, and should be compared with otherMTL approaches. Our models can also be improvedusing better low-level features as input. The region-based color histogram features are quite sensitive toscale and illumination variations. Features such as localkeypoint features are more robust and can be easilyintegrated into our models. It is interesting to comparethe latent semantic interpretations and classificationperformance using different features.

References

[1] D. M. Blei and M. I. Jordan. Modeling annotateddata. In SIGIR ’03: Proceedings of the 26th annualInt’l ACM SIGIR Conf. on Research and developmentin informaion retrieval, pages 127–134, New York, NY,USA, 2003. ACM Press.

[2] D. M. Blei, A. Y. Ng, and M. I. Jordan. Latent dirichletallocation. volume 3, pages 993–1022, Cambridge, MA,USA, 2003. MIT Press.

[3] S. C. Deerwester, S. T. Dumais, T. K. Landauer, G. W.Furnas, and R. A. Harshman. Indexing by latentsemantic analysis. Journal of the American Society ofInformation Science, 41(6):391–407, 1990.

[4] T. Hastie, R. Tibshirani, and J. Friedman. The ele-

ments of statistical learning: Data mining, inference,and prediction.

[5] T. Hofmann. Probabilistic latent semantic analysis. InProc. of Uncertainty in Artificial Intelligence, UAI’99,Stockholm, 1999.

[6] G. Iyengar and H. J. Nock. Discriminative modelfusion for semantic concept detection and annotationin video. In Proc. of the 11th ACM Int’l Conf. onMultimedia, pages 255–258, New York, NY, USA, 2003.ACM Press.

[7] M. I. Jordan. Learning in Graphical Models: Founda-tions of Neural Computation. The MIT press, 1998.

[8] F.-F. Li and P. Perona. A bayesian hierarchical modelfor learning natural scene categories. In Proc. ofthe 2005 IEEE Computer Society Conf. on ComputerVision and Pattern Recognition (CVPR), pages 524–531, Washington, DC, USA, 2005. IEEE ComputerSociety.

[9] I. Murray and Z. Ghahramani. Bayesian learningin undirected graphical models: Approximate mcmcalgorithms. In M. Chickering and J. Halpern, editors,Proceedings of the 20th Annual Conf. on Uncertainty inArtificial Intelligence (UAI-04), pages 392–399. AUAIpress, 2004.

[10] Y. Rui, R. Jain, N. D. Georganas, H. Zhang, K. Nahrst-edt, J. Smith, and M. Kankanhalli. What is the stateof our community? In Proc. of the 13th annual ACMInt’l Conf. on Multimedia, pages 666–668, New York,NY, USA, 2005. ACM Press.

[11] A. Smeaton and P. Over. Trecvid: Benchmarking theeffectiveness of infomration retrieval tasks on digitalvideo. In Proc. of the Intl. Conf. on Image and VideoRetrieval, 2003.

[12] C. Snoek, M. Worring, and A. Smeulders. Early versuslate fusion in semantic video analysis, 2005.

[13] M. Welling and G. E. Hinton. A new learning algo-rithm for mean field boltzmann machines. In ICANN’02: Proceedings of the Int’l Conf. on Artificial NeuralNetworks, pages 351–357, London, UK, 2002. Springer-Verlag.

[14] M. Welling, M. Rosen-Zvi, and G. Hinton. Exponen-tial family harmoniums with an application to infor-mation retrieval. In Advances in Neural InformationProcessing Systems 17, pages 1481–1488, Cambridge,MA, 2004. MIT Press.

[15] Y. Wu, E. Y. Chang, K. C.-C. Chang, and J. R.Smith. Optimal multimodal fusion for multimedia dataanalysis. In Proc. of the 12th annual ACM Int’l Conf.on Multimedia, pages 572–579, 2004.

[16] E. Xing, M. Jordan, and S. Russell. A generalizedmean field algorithm for variational inference in ex-ponential families. In Uncertainty in Artificial Intelli-gence (UAI2003). Morgan Kaufmann Publishers, 2003.

[17] E. Xing, R. Yan, and A. Hauptmann. Mining associ-ated text and images with dual-wing harmoniums. InProceedings of the 21th Annual Conf. on Uncertaintyin Artificial Intelligence (UAI-05). AUAI press, 2005.

[18] R. Yan, J. Yang, and A. G. Hauptmann. Learning

query-class dependent weights in automatic video re-trieval. In Proc. of the 12th ACM Int’l Conf. on Mul-timedia, pages 548–555. ACM Press, 2004.

APPENDIXThis is to show the derivation of the harmonium randomfields (joint distribution) in the family-of-harmoniummodel. We start by introducing the general form ofexponential-family harmonium [14] that has H as thelatent topic variables and X and Z as two types ofobserved data variables. This harmonium random fieldhas the exponential form as:

p(x, z,h) ∝ exp

�Xia

θiafia(xi) +X

jb

ηjbgjb(zj) +X

kc

λkcekc(hk)

+X

ikac

W kcia fia(xi)ekc(hk) +

X

jkbc

Ukcjb gjb(zj)ekc(hk)

�.

where {fia(·)}, {gjb(·)}, and {ekc(·)} denote the suf-ficient statistics (features) of variables xi, zj , and hk,respectively.

The marginal distributions, say, p(x, z), is thenobtained by integrating out variables h:

p(x, z) =

Z

h

p(x, z,h)dh

∝ exp

�Xia

θiafia(xi) +X

jb

ηjbgjb(zj)

�Y

k

Z

hk

exp

�Xc�

λkc +Xia

W kcia fia(xi) +

X

jb

Ukcjb gjb(zj)

�ekc(hk)

�dhk

= exp

�Xia

θiafia(xi) +X

jb

ηjbgjb(zj) +X

k

Ck({λkc})�

and similarly we can derive:

p(x,h) ∝ exp

�Xia

θiafia(xi)+X

kc

λkcgkc(hk)+X

j

Bj({ηjb})�

p(z,h) ∝ exp

�X

jb

ηjbgjb(zj)+X

kc

λkcekc(hk)+X

i

Ai({θia})�

where the shifted parameters θia, ηjb and λkc are definedas:

θia = θia +X

kc

W kcia ekc(hk), ηjb = ηjb +

X

kc

Ukcjb ekc(hk)

λkc = λkc +Xia

W kcia fia(xi) +

X

jb

Ukcjb gjb(zj)

The functions Ai(·), Bj(·), and Ck(·) are defined as:

Ai({θia}) =

Z

xi

exp{X

a

θiafia(xi)}dxi

Bj({ηjb})Z

zj

exp{X

b

ηjbgjb(zj)}dzj

Ck({λkc}) =

Z

hk

exp{X

c

λkcekc(hk)}dhk

Further integrating out variables from these distri-bution give the marginal distribution of x, z, and h.

p(x) ∝ exp

�Xia

θiafia(xi)+X

j

Bj({ηjb})+X

k

Ck({λkc})�

p(z) ∝ exp

�X

jb

ηjbgjb(zj) +X

i

Ai({θia}) +X

k

Ck({λkc}�

p(h) ∝ exp

�X

kc

λkcekc(hk)+X

i

Ai({θia})+X

j

Bj({ηjb})�

We all the above marginal distributions, we areready to derive the conditional distributions as:

p(x|h) =p(x,h)

p(h)∝Y

i

exp

�Xa

θiafia(xi)−Ai({θia})�

p(z|h) =p(z,h)

p(h)∝Y

j

exp

�X

b

ηjbgjb(zj)−Bj({ηjb})�

p(h|x, z) =p(x, z,h)

p(x, z)∝Y

k

exp

�Xc

λkcekc(hk)−Ck({λkc})�

The specific conditional distribution of x, z, and hdefined in Eq.(3.2), (3.3), and (3.4) are all exponentialdistributions. They can be mapped to the general formsabove if we make the following definitions:

fi1(xi) = xi

θi1 = αi, θi1 = αi +X

k

Wikhk

gj1(zj) = zj , gj2(zj) = z2j

ηj1 = βj , ηj2 = −1/2, ηj1 = βj +X

k

Ujkhk

ek1 = hk, ek2 = h2k

λk1 = 0, λk2 = −1/2, λk1 =X

i

Wikhk +X

j

Ujkhk

Therefore, by plugging these definitions into generalform of harmonium random field at the beginning ofthis appendix, we have the specific random field as:

p(x, z,h) ∝ exp

�Pi αixi +

Pj βjzj −

Pj

z2j

2

−Pk

h2k2

+P

ik Wikxihk +P

jk Ujkzjhk

�

which is exactly the same as Eq.(3.5) except the latterone is defined for a specific category.

Harmonium Models for Semantic Video Representation and ...juny/Prof/papers/sdm07jyang.pdf ·...

Documents

Transcript of Harmonium Models for Semantic Video Representation and ...juny/Prof/papers/sdm07jyang.pdf ·...