Disentangling Factors of Variation via Generative Entangling

Disentangling Factors of Variation via Generative Entangling

Guillaume Desjardins [email protected] Courville [email protected] Bengio [email protected]

Departement d’informatique et de recherche op., Universite de Montreal Montreal, QC H3C 3J7 CANADA

Abstract

Here we propose a novel model family withthe objective of learning to disentangle thefactors of variation in data. Our approach isbased on the spike-and-slab restricted Boltz-mann machine which we generalize to includehigher-order interactions among multiple la-tent variables. Seen from a generative per-spective, the multiplicative interactions em-ulates the entangling of factors of variation.Inference in the model can be seen as dis-entangling these generative factors. Unlikeprevious attempts at disentangling latent fac-tors, the proposed model is trained using nosupervised information regarding the latentfactors. We apply our model to the task offacial expression classification.

1. Introduction

In many machine learning tasks, data originates froma generative process involving complex interaction ofmultiple factors. Alone each factor accounts for asource of variability in the data. Together their inter-action gives rise to the rich structure characteristic ofmany of the most challenging domains of application.Consider, for example, the task of facial expressionrecognition. Two images of different individuals withthe same facial expression may result in images thatare well separated in pixel space. On the other hand,two images of the same individuals showing differentexpressions may well be positioned very close togetherin pixel space. In this simplified scenario, there aretwo factors at play: (1) the identity of the individual,and (2) the facial expression. One of these factors,the identity, is irrelevant to the task of facial expres-sion recognition and yet of the two factors it couldwell dominate the representation of the image in pixelspace. As a result, pixel space-based facial expressionrecognition systems seem likely to suffer poor perfor-mance due to the variation in appearance of individual

faces.

Importantly, these interacting factors frequently donot combine as simple superpositions that can be eas-ily separated by choosing an appropriate affine projec-tion of the data. Rather, these factors often appeartightly entangled in the raw data. Our challenge is toconstruct representations of the data that cope withthe reality of entangled factors of variation and pro-vide features that may be appropriate to a wide varietyof possible tasks. In the context of our face data exam-ple, a representation capable of disentangling identityand expression would be an effective representation foreither the facial recognition or facial expression classi-fication.

In an effort to cope with these factors of variation,there has been a broad-based movement in machinelearning and in application domains such as computervision toward hand-engineering feature sets that areinvariant to common sources of variation in data. Thisis the motivation behind both the inclusion of fea-ture pooling stages in the convolutional network ar-chitecture (LeCun et al., 1989) and the recent trendtoward representations based on large scale pooling oflow-level features (Wang et al., 2009; Coates et al.,2011). These approaches all stem from the powerfulidea that invariant features of the data can be inducedthrough the pooling together of a set of simple filter re-sponses. Potentially even more powerful is the notionthat one can actually learn which filters to be pooledtogether from purely unsupervised data, and therebyextract directions of variance over which the poolingfeatures become invariant (Kohonen et al., 1979; Ko-honen, 1996; Hyvarinen and Hoyer, 2000; Le et al.,2010; Kavukcuoglu et al., 2009; Ranzato and Hinton,2010; Courville et al., 2011b). However, in situationswhere there are multiple relevant but entangled fac-tors of variation that give rise to the data, we requirea means of feature extraction that disentangles thesefactors in the data rather than simply learn to repre-sent some of these factors at the expense of those that

arX

iv:1

210.

5474

v1 [

stat

.ML

] 1

9 O

ct 2

012


are lost in the filter pooling operation.

Here we propose a novel model family with the ob-jective of learning to disentangle the factors of varia-tion evident in the data. Our approach is based onthe spike-and-slab restricted Boltzmann machine (ss-RBM) (Courville et al., 2011a) which has recently beenshown to be a promising model of natural image data.We generalize the ssRBM to include higher-order inter-actions among multiple binary latent variables. Seenfrom a generative perspective, the multiplicative in-teractions of the binary latent variables emulates theentangling of the factors that give rise to the data.Conversely, inference in the model can be seen as anattempt to assign credit to the various interacting fac-tors for their combined account of the data – in effect,to disentangle the generative factors. Our approach re-lies only on unsupervised approximate maximum like-lihood learning of the model parameters, and as suchwe do not require the use of any label information indefining the factors to be disentangled. We believethis to be a research direction of critical importance,as it is almost never the case that label informationexists for all factors responsible for variations in thedata distribution.

2. Learning Invariant Features VersusLearning to Disentangle Features

The principle that invariant features can actuallyemerge, using only unsupervised learning, from the or-ganization of features into subspaces was first estab-lished in the ASSOM model (Kohonen, 1996). Sincethen, the same basic strategy has reappeared in anumber of different models and learning paradigms,including topological independent component analy-sis (Hyvarinen and Hoyer, 2000; Le et al., 2010),invariant predictive sparse decomposition (IPSD)(Kavukcuoglu et al., 2009), as well as in Boltzmannmachine-based approaches (Ranzato and Hinton, 2010;Courville et al., 2011b). In each case, the basic strat-egy is to group filters together by, for example, using avariable (the pooling feature) that gates the activationfor all elements of the group. This gated activationmechanism causes the filters within the group to sharea common window on the dataset, which in turn leadsto filter groups composed of mutually complementaryfilters. In the end, the span of the filter vectors definesa subspace which specifies the directions in which thepooling feature is invariant. Somewhat surprisingly,this basic strategy has repeatedly demonstrated thatuseful invariant features can be learned in a strictlyunsupervised fashion, using only the statistical struc-ture inherent in the data. While remarkable, one im-

portant problem with using this learning strategy isthat the invariant representation formed by the pool-ing features offers a somewhat incomplete view on thedata as the detailed representation of the lower-levelfeatures is abstracted away in the pooling procedure.While we would like higher level features to be moreabstract and exhibit greater invariance, we have littlecontrol over what information is lost through featuresubspace pooling.

Invariant features, by definition, have reduced sensi-tivity in the direction of invariance. This is the goalof building invariant features and fully desirable if thedirections of invariance all reflect sources of variancein the data that are uninformative to the task at hand.However, it is often the case that the goal of featureextraction is the disentangling or separation of manydistinct but informative factors in the data. In this sit-uation, the methods of generating invariant features –namely, the feature subspace method – may be inade-quate. Returning to our facial expression classificationexample from the introduction, consider a pooling fea-ture made invariant to the expression of a subject byforming a subspace of low-level filters that representthe subject with various facial expressions (forming abasis for the subspace). If this is the only poolingfeature that is associated with the appearance of thissubject, then the facial expression information is lostto the model representation formed by the set of pool-ing features. As illustrated in our hypothetical facialexpression classification task, this loss of informationbecomes a problem when the information that is lostis necessary to successfully complete the task at hand.

Obviously, what we really would like is for a particu-lar feature set to be invariant to the irrelevant featuresand disentangle the relevant features. Unfortunately,it is often difficult to determine a priori which setof features will ultimately be relevant to the task athand. Further, as is often the case in the context ofdeep learning methods (Collobert and Weston, 2008),the feature set being trained may be destined to beused in multiple tasks that may have distinct subsetsof relevant features. Considerations such as these leadus to the conclusion that the most robust approachto feature learning is to disentangle as many factorsas possible, discarding as little information about thedata as is practical. This is the motivation behind ourproposed higher-order spike-and-slab Boltzmann ma-chine.


E(v, s, f, g, h) =1

2vT Λv −

∑k

ekfk +∑i,k

cikgik +∑j,k

djkhjk +1

2

∑i,j,k

αijks2ijk

+∑i,j,k

(−vTW·,ijksijk − αijkµijksijk +

1

2αijkµ

2ijk

)gikhjkfk,

Figure 1. Energy function of our higher-order spike & slab RBM (ssRBM), used to disentangle (multiplicative) factors ofvariation in the data. Two groups of latent spike variables, g and h, interact to explain the data v, through the weighttensor W . While the ssRBM instantiates a slab variable sj for each hidden unit hj , our higher-order model employs aslab sij for each pair of spike variables (gi,hj). µij and αij are respectively the mean and precision parameters of sij . Anadditional set of spike variables f are used to gate groups of latent variables h, g and serve to promote group sparsity.Most parameters are thus indexed by an extra subscript k. Finally, e, c and d are standard bias terms for variables f , gand h, while Λ is a diagonal precision matrix on the visible vector.

3. Higher-order Spike-and-SlabBoltzmann Machines

In this section, we introduce a model which makessome progress toward the ambitious goal of disentan-gling factors of variation. The model is based on theBoltzmann machine, an undirected graphical model.In particular we build on the spike-and-slab restrictedBoltzmann Machine (ssRBM) (Courville et al., 2011b),a model family that has previously shown promise asa means of learning invariant features via subspacepooling. The original ssRBM model possessed a lim-ited form of higher-order interaction of two latent ran-dom variables: the spike and the slab Our extensionadds higher-order interactions between four distinctlatent random variables. These include one set ofslab variables and three interacting binary spike vari-ables. Unlike the ssRBM, the interactions betweenthe latent variables violate the conditional indepen-dence constraint of the restricted Boltzmann machineand therefore does not belong to this class of models.As a consequence, exact inference in the model is nottractable and we resort to a mean-field approximation.

Our strategy in promoting this model is that we in-tend to disentangle factors of variation via inference(recovering the posterior distribution over our latentvariables) in a generative model. In the context ofgenerative models, inference can roughly be thoughtof as running the generative process in reverse. Thusif we wish our inference process to disentangle factorsof variation, our generative process should describe ameans of factor entangling. The generative model wepropose here represents one possible means of factorentangling.

Let v ∈ RD be the random visible vector that rep-resents our observations with its mean zeroed. Webuild a latent representation of this data with binarylatent variables f ∈ {0, 1}K , g ∈ {0, 1}M×K and

h ∈ {0, 1}N×K . In the spike-and-slab context, we canthink of f , g and h as a factored representation of the“spike” variables. We also include a set of real val-ued “slab” variables s ∈ RM×N×K , with element sijkassociated with hidden units fk, gik and hjk. The in-teraction between these variables is defined throughthe energy function of Fig. 1.

The parameters are defined as follows. W ∈RD×M×N×K is a weight 4-tensor connecting visibleunits to the interacting latent variables, these canbe interpreted as forming a basis in image space;µ ∈ RM×N×K and α ∈ RM×N×K are tensors describ-ing the mean and precision of each sijk; Λ ∈ RD×D isa diagonal precision matrix on the visible vector; andfinally c ∈ RM×K , d ∈ RN×K and e ∈ RK are bi-ases on the matrices g, h and vector f respectively.The energy function fully specifies the joint proba-bility distribution over the variables v, s ,f , g andh: p(v, s, f, g, h) = 1

Z exp {−E(v, s, f, g, h)} where Zis the partition function which ensures that the jointdistribution is normalized.

As specified above, the energy function is similar to thessRBM energy function (Courville et al., 2011b;a), butincludes a factored representation of the standard ss-RBM spike variable. Yet, clearly the properties of themodel are highly dependent on the topology of the in-teractions between the real-valued slab variables sijk,and three binary spike variables fk, gik and hjk. Weadopt a strategy that permits local interactions withinsmall groups of f , g and h in a block-like organizationalpattern as specified in Fig. 2. The local block struc-ture allows the model to work incrementally towardsdisentangling the features by focusing on manageablesubparts of the problem.

Similar to the standard spike-and-slab restrictedBoltzmann machine (Courville et al., 2011b;a), the en-ergy function in Eq. 1 gives rise to a Gaussian condi-


Figure 2. Block-sparse connectivity pattern with dense in-teractions between g and h within each block (only shownfor k-th block). Each block is gated by a separate fk vari-able.

tional over the visible variables:

p(v | s, f, g, h) = N

∑i,j,k

Λ−1W·ijksijkgikhjkfk , Λ−1

Here we have a four-way multiplicative interaction inthe latent variables s, f , g and h. The real-valuedslab variable sijk acts to scale the contribution of theweight vector W·ijk. As a consequence, after marginal-izing out s, the factors f , g and h can also be seen ascontributing both to the conditional mean and condi-tional variance of p(v | f, g, h):

p(v | f, g, h) = N

∑i,j,k

Λ−1W·ijkµijkgikhjkfk, Cv|f,g,h

Cv|f,g,h =

Λ−∑i,j,k

W·ijkWT·ijkα

−1ijkgikhjkfk

−1

This is an important property of the spike-and-slabframework that is also shared by other latent vari-able models of real-valued data such as the mean-covariance restricted Boltzmann machine (mcRBM)(Ranzato and Hinton, 2010) and the mean Product ofT-distributions model (mPoT) (Ranzato et al., 2010).

From a generative perspective, the model can bethought of as consisting of a set of K factor blockswhose activity is gated by the f variables. Withineach block, the variables g·k and h·k can be thoughtof as local latent factors whose interaction gives riseto the active block’s contribution to the visible vec-tor. Crucially, the multiplicative interaction between

the g·k and h·k for a given block k is mediated by theweight tensor W·,·,·,k and the corresponding slab vari-ables s·,·,k. Contrary to more standard probabilisticfactor models whose factors simply sum to give riseto the visible vector, the individual contributions ofthe elements of g·k and h·k are not easily isolated fromone another. We can think of the generative processas entangling the local block factor activations.

From an encoding perspective, we are interested in us-ing the posterior distribution over the latent variablesas a representation or encoding of the data. Unlikein RBMs, in the case of the proposed model wherewe have higher-order interactions over the latent vari-ables, the posterior over the latent variables does notfactorize cleanly. By marginalizing over the slab vari-ables s, we can recover a set of conditionals describinghow the binary latent variables f , g and h interact.The conditional P (f | v, g, h) is given below.

P (fk = 1 |v, g, h) = sigm

cik +∑i,j

vTW·ijkµijkgikhjk+

1

2

∑i,j

α−1ijk

[vTW·ijk

]2gikhjk

It illustrates that with the factor configuration givenin Fig. 2, the factors fk are activated (assume value 1)through the sum-pooled response of all the weight vec-tors W·ijk (∀1 ≤ i ≤ M and 1 ≤ j ≤ N) differentiallygated by the values of gik and hjk, whose conditionalsare respectively given by:

P (gik = 1 |v, f, h) = sigm

cik +

N∑j

vTW·ijkµijkhjkfk+

1

2

N∑j

α−1ijk

[vTW·ijk

]2hjkfk

P (hjk = 1 |v, f, g) = sigm

(djk +

M∑i

vTW·ijkµijkgikfk+

1

2

M∑i

α−1ijk

[vTW·ijk

]2gikfk

)

For completeness, we also include the Gaussian condi-tional distribution over the slab variables s

p(sijk | v, f, g, h) = N([α−1ijkv

TW·ijk· + µijk

]fkgikhjk, α

−1ijk

)From an encoding perspective, the gating pattern onthe g and h variables, evident from Fig. 2 and from


the conditionals distributions, defines a form of localbilinear interaction (Tenenbaum and Freeman, 2000).We can interpret the values of gik and hjk within blockk acting as basis indicators, in dimensions i and j,for the linear subspace in the visible space defined byW·ijksijk.

From this perspective, we can think of [g·k, h·K ] asdefining a block-local binary coordination encoding ofthe data. Consider the case illustrated by Fig. 2, wherewe have M = 5, N = 5 and the number of blocks (K)is 4. For each block, we have M × N = 25 filterswhich we encode using M + N = 10 binary latentvariables, where each gik (alternately hjk) effectivelypools over the subspace characterized by the variableshjk, 1 ≤ j ≤ N (alternately gik, 1 ≤ i ≤ M) throughtheir relative interaction with W·ijksijk. As a concreteexample, imagine that the structure of the weight ten-sor was such that, along the dimension indexed by i,the weight vectors W·ijk form oriented Gabor-like edgedetectors of different orientations. Yet along the di-mension indexed by j, the weight vectors W·ijk formoriented Gabor-like edge detectors of different colors.In this hypothetical example, gik encodes orientationinformation while being invariance to the color of theedge, while hjk encodes color information while beinginvariant to orientation. Hence we could say that wehave disentangled the latent factors.

3.1. Higher-order Interactions as a Multi-WayPooling Strategy

As alluded to above, one interpretation of the role ofg and h is as distinct and complementary sum-pooledfeature sets. Returning to Fig. 2, we can see that,for each block, the gik pool across the columns ofthe kth block, along the ith row, while the h·k poolacross rows, along the jth column. The f variablesare also interpretable as pooling across all elements ofthe block. One way to interpret the complementarypooling structures of the g and h is as a multi-waypooling strategy.

This particular pooling structure was chosen to studythe potential of learning the kind of bilinear interactionthat exists between the g·k and h·k within a block. Thefk are present to promote block cohesion by gatingthe interaction of between g·k and h·k and the visiblevector v.

This higher-order structure is of course just one choiceof many possible higher-order interaction architec-tures. One can easily imagine defining arbitrary over-lapping pooling regions, with the number of overlap-ping pooling regions specifying the order of the latentvariable interaction. We believe that explorations of

overlapping pooling regions of this type is a promisingdirection of future inquiry. One potentially interestingdirection is to consider overlapping blocks (such as ourf blocks). The overlap will define a topology over thefeatures as they will share lower-level features (i.e. theslab variables). A topology thus defined could poten-tially be exploited to build higher-level data represen-tations that possess local receptive fields. These kindof local receptive fields have been shown to be usefulin building large and deep models that perform wellin object classification tasks in natural images (Coateset al., 2011).

3.2. Variational inference and unsupervisedlearning

Due to the multiplicative interaction between the la-tent variables f , g and h, computation of P (f | v),

P (g | v) and P (h | v) is intractable. While the slab vari-ables also interact multiplicatively, we are able to an-alytically marginalize over them. Consequently we re-sort to a variational approximation of the joint condi-tional P (f, g, h | v) with the standard mean-field struc-ture. i.e. we choose Qv(f, g, h) = Qv(f)Qv(g)Qv(h) suchthat the KL divergence KL(Qv(f, g, h)‖P (f, g, h | v)) isminimized, or equivalently, that the variational lowerbound L(Qv) on the log likelihood of the data is max-imized:

maxQv

L(Qv) = maxQv

∑f,g,h

Qv(f)Qv(g)Qv(h)

log

(p(f, g, h | v)

Qv(f)Qv(g)Qv(h)

),

where the sums are taken over all values of the el-ements of f , g and h respectively. Maximizing thislower bound with respect to the variational param-eters fk ≡ Qv(fk = 1), gik ≡ Qv(gik = 1) and

hjk ≡ Qv(hjk = 1), results in the set of approximatingfactored distributions:

fk = sigm

cik +∑i,j

vTW·ijkµijkgikhjk +

1

2

∑i,j

α−1ijk

[vTW·ijk

]2gikhjk

,

gik = sigm

cik +

N∑j

vTW·ijkµijkhjkfk

+

1

2

N∑j

α−1ijk

[vTW·ijk

]2hjkfk

,


hjk = sigm

(djk +

M∑i

vTW·ijkµijkgikfk

+1

2

M∑i

α−1ijk

[vTW·ijk

]2gikfk

).

The above equations form a set of fixed point equationswhich we iterate until the values of all Qv(fk), Qv(gik)

and Qv(hjk) converge. Since the expression for fk does

not depend on fk′ , ∀k′, gik does not depend on gi′k′ ,∀i′, k′, and hjk does not depend on hj′k′ , ∀i′, k′, we candefine a three stage update strategy where we updatethe values of all K values of f in parallel, then updateall K×M values of g in parallel and finally update allK ×N values of h in parallel.

Following the variational EM training approach (Saulet al., 1996), we alternately maximize the lower bound

L(Qv) with respect to the variational parameters f ,

g and h (E-step) and maximizing L(Qv) with re-spect to the model parameters (M-step). The gra-dient of L(Qv) with respect to the model parametersθ = {W,µ, α,Λ, b, c, d, e} is given by:

∂L(Qv)

∂θ=∑f,g,h

Qv(f)Qv(g)Qv(h)Ep(s|v,f,g,h)

[−∂E∂θ

]

+ Ep(v,s,g,h)

[∂E

∂θ

],

where E is the energy function given in Eq. 1. As isevident from Eq. ??, the gradient of L(Qv) with re-spect to the model parameters contains two terms: apositive phase that depends on the data v and a neg-ative phase, derived from the partition function of thejoint p(v, s, f, g, h) that does not. We adopt a trainingstrategy similar to that of (Salakhutdinov and Hinton,2009), in that we combine a variational approximationof the positive phase of the gradient with a block Gibbssampling-based stochastic approximation of the nega-tive phase. Our Gibbs sampler alternately samples, inparallel, each set of random variables, sampling fromp(f | v, g, h), p(g | v, f, h), p(h | v, f, g), p(s | v, f, g, h),and finally sampling from p(v | f, g, h, s).

3.3. The Challenge of Unsupervised Learningto Disentangle

Above we have briefly outline our procedure for train-ing the unsupervised learning. The web of interac-tions between the latent random variables, particularlythose between g and h, makes the unsupervised learn-ing of the model parameters a particularly challeng-ing learning problem. It is the difficultly of learning

that motivates our block-wise organization of the in-teractions between the g and h variables. The blockstructure allows the interactions between g and h toremain local, with each g interacting with relativelyfew h and each h interacting with relatively few g.This local neighborhood structure allows the inferenceand learning procedures to better manage the com-plexities of teasing apart the latent variable interac-tions and adapting the model parameters to (approx-imately) maximize likelihood.

By using many of these blocks of local interactions wecan leverage the known tractable learning propertiesof models such as the RBM. Specifically, if we con-sider each block as a kind of super hidden unit gatedby f , then with no interactions across blocks (apartfrom those mediated by the mutual connections to thevisible units) the model assumes the form of an RBM.

While our chosen interaction structure allows ourhigher-order model to be able to learn, one conse-quence is that the model is only capable of disentan-gling relatively local factors that appear within a singleblock. We suggest that one promising avenue to ac-complish more extensive disentangling is to considerstacking multiple version of the proposed model andconsider layer-by-layer disentangling of the factors ofvariation present in the data. The idea is to start withlocal disentangling and move gradually toward disen-tangling non local and more abstract factors.

4. Related Work

The model proposed here was strongly influenced byprevious attempts to disentangle factors of variation indata using latent variable models. One of the earlierefforts in this direction also used higher-order interac-tions of latent variables, specifically bilinear (Tenen-baum and Freeman, 2000; Grimes and Rao, 2005) andmultilinear (Vasilescu and Terzopoulos, 2005) mod-els. One critical difference between these previousattempts to disentangle factors of variation and ourmethod is that unlike these previous methods, we areattempting to learn to disentangle from entirely unsu-pervised information. In this way, one can interpretour approach as an attempt to extend the subspacefeature pooling approach to the problem of disentan-gling factors of variation.

Bilinear models are essentially linear models where thehigher-level state is factored into the product of twovariables. Formally, the elements of observation x aregiven by xk =

∑i

∑j Wijkyizj , ∀k, where yi and zj are

elements of the two factors (y and z) representing theobservation and Wijk is an element of the tensor ofmodel parameters (Tenenbaum and Freeman, 2000).


The tensor W can be thought of as a generalizationof the typical weight matrix found in most unsuper-vised models we have considered above. (Tenenbaumand Freeman, 2000) developed an EM-based algorithmto learn the model parameters and demonstrated, us-ing images of letters from a set of distinct fonts, thatthe model could disentangle the style (font characteris-tics) from content (letter identity). (Grimes and Rao,2005) later developed a bilinear sparse coding modelof a similar form as described above but included ad-ditional terms to the objective function to render theelements of both y and z sparse. They also requireobservation of the factors in order to train the model,and used the model to develop transformation invari-ant features of natural images. Multilinear models aresimply a generalization of the bilinear model where thenumber of factors that can be composed together is 2or more. (Vasilescu and Terzopoulos, 2005) develop amultilinear ICA model, which they use to model im-ages of faces, to disentangle factors of variation suchas illumination, views (orientation of the image planerelative to the face) and identities of the people.

Hinton et al. (2011) also propose to disentangle factorsof variation by learning to extract features associatedwith pose parameters, where the changes in pose pa-rameters (but not the feature values) are known attraining time. The proposed model is also closely re-lated to recent work (Memisevic and Hinton, 2010),where higher-order Boltzmann Machines are used asmodels of spatial transformations in images. Whilethere are a number of differences between this modeland ours, the most significant difference is our useof multiplicative interactions between latent variables.While they included higher-order interactions withinthe Boltzmann energy function, they were used exclu-sively between observed variables, dramatically simpli-fying the inference and learning procedures. Anothermajor point of departure is that instead of relying onlow-rank approximations to the weight tensor, our ap-proach employs highly structured and sparse connec-tions between latent variables (e.g. gik is not inter-act with or hjk′ for k′ 6= k), reminiscent of recentwork on structured sparse coding (Gregor et al., 2011)and structured l1-norms (Bach et al., 2011). As dis-cussed above, our use of a sparse connection structureallows us to isolate groups of interacting latent vari-ables. Keeping the interactions local in this way, isa key component of our ability to successfully learnusing only unsupervised data.

g1g2

h1 h2 h3 h4 h5

g3

Figure 3. (top) Samples from our synthetic dataset (beforenoise). In each image, a figure “X” can appear at fivedifferent positions, in one of eight basic colors. Objectsin a given image must all be of the same color. (bottom)Filters learnt by a bilinear ssRBM with M = 3, N = 5,which succesfully show disentangling of color information(rows) from position (columns).

5. Experiments

5.1. Toy Experiment

We showcase the ability of our model to disentan-gle factors of variation, by training it on a syntheticdataset, a subset of which is shown in Fig. 3 (top).Each color image, of size 3 × 20 is composed of onebasic object of varying color, which can appear at fivedifferent positions. The constraint is that all objectsin a given image must be of the same color. Additivegaussian noise is super-imposed on the resulting im-ages to facilitate mixing of the RBM negative phase.A bilinear ssRBM with M = 3 and N = 5 should intheory have the capacity to disentangle the two factorsof variation present in the data, as there are 23 pos-sible colors and 25 configurations of object placement.The resulting filters are shown in Fig. 3 (bottom): themodel has succesfully learnt a binary encoding of coloralong g-units (rows) and positions along h (columns).Note that this would have been extremely difficult toperform without multiplicative interactions of latentvariables: an RBM with 15 hidden units technicallyhas the capacity to learn similar filters, however itwould be incapable of enforcing mutual exclusivity be-tween hidden units of different color. The bilinear ss-RBM model on the other hand generates near-perfectsamples (not shown), while factoring the representa-tion for use in deeper layers.

5.2. Toronto Face Dataset

We evaluate our model on the recently introducedToronto Face Dataset (TFD) (Susskind et al., 2010),which contains a large number of black & white 48×48preprocessed facial images. These span a wide rangeof identities and emotions and as such, the dataset


is well suited to study the problem of disentangling:models which can successfully separate identity fromemotion should perform well at the supervised learn-ing task, which involves classifying images into oneof seven categories: {anger, disgust, fear, happy, sad,surprise, neutral}. The dataset is divided into twoparts: a large unlabeled set (meant for unsupervisedfeature learning) and a smaller labeled set. Note thatemotions appear much more prominently in the latter,since these are acted out and thus prone to exaggera-tion. In contrast, most of the unlabeled set containsnatural expressions over a wider range of individuals.

In the course of this work, we have made several keyrefinements to the original spike-and-slab formulation.Notably, since the slab variables {sijk;∀j} can be in-terpreted as coordinates in the subspace of the spikevariable gik (which spans the set of filters {W·,ijk,∀j}),it is natural for these filters to be unit-norm. Eachmaximum likelihood gradient update is thus followedby a projection of the filters onto the unit-norm ball.Similarly, there exists an over-parametrization in thedirection of W·,ijk and the sign of µijk, the parame-ter controlling the mean of sijk. We thus constrainµijk to be positive, in our case greater than 1. Sim-ilar constraints are applied on B and α to ensurethat the variances on the visible and slab variables re-main bounded. While previous work (Courville et al.,2011a) used the expected value of the spike variablesas the input to classifiers, or higher-layers in deep net-works, we found that the above re-parametrizationconsistently lead to better results when using the prod-uct of expectations of h and s. For pooled models, wesimply take the product of each binary spike, with thenorm of its associated slab vector.

Disentangling Emotion from Identity. We be-gin with a qualitative evaluation of our model, by vi-sualizing the learned filters (inner-most dimension ofthe matrix W ) and pooling structures. We trained amodel with K = 100 and M = N = 5 (that is tosay 100 blocks of 5 × 5 interacting g and h units) ona weighted combination of the labeled and unlabeledtraining sets. Doing so (as opposed to training on theunlabeled set only) allows for greater interpretabilityof the results, as emotion is a more prominent factorof variation in the labeled set). The results, shown inFigure 4, clearly show global cohesion within blockspooled by fk, with row and column structure corre-lating with variances in appearance/identity and emo-tions.

Disentangling via Unsupervised Feature Learn-ing. We now evaluate the representation learnt byour disentangling RBM, by measuring its usefulness for

Figure 4. Example blocks obtained with K = 100, M =N = 5. The filters (inner-most dimension of tensor W ) ineach block exhibit global cohesion, specializing themselvesto a subset of identities and emotions: {happiness, fear,neutral} in (left) and {happiness, anger} in (right). In bothcases, g-units (which pool over columns) encode emotions,while h-units (which pool over rows) are more closely tiedto identity.

the task of emotion recognition. Our main objectivehere is to evaluate the usefulness of disentangling, overtraditional approaches of pooling, as well as the use oflarger, unpooled models. We thus consider ssRBMswith 3000 and 5000 features, with either (i) no pool-ing (i.e. K = 5000 spikes with N = 1 slabs per spike),(ii) pooling along a single dimension (i.e. K = 1000spike variables, pooling N = 5 slabs) or (iii) disentan-gled through our higher-order ssRBM (i.e. K = 200,with g and h units arranged in a M × N grid, withM = N = 5).

We followed the standard TFD training protocol ofperforming unsupervised training on the unlabeled set,and then using the learnt representation as input to alinear SVM, trained and cross-validated on the labeledset. Table 1 shows the test accuracy obtained by var-ious spike-and-slab models, averaged over the 5-folds.

We report two sets of numbers for models with pool-ing or disentangling: one where we use the “factoredrepresentation”, which is the element-wise productof spike variables with the norm of their associatedslab vector, and the “unfactored representation”: thehigher-dimensional representation formed by consider-ing all slab variables, each multiplied by their associ-ated spikes.

We can see that the higher-order ssRBM achieves thebest result: 77.4%, using the factored representation.The fact that that our model outperforms the “un-factored” one, confirms our disentangling hypothesis:our model has successfully learnt a lower-dimensional(factored) representation of the data, useful for clas-sification. For reference, a linear SVM classifier onthe pixels achieves 71.5% (Susskind et al., 2010), an


Factored UnfactoredModel K M N valid test valid test

ssRBM 3000 1 n/a n/a 76.0% 75.7%ssRBM 999 3 72.9% 74.4% 74.9% 73.5%hossRBM 330 3 3 76.0% 75.7% 75.3% 75.2%hossRBM 120 5 5 71.4% 70.7% 74.5% 74.2%

ss-RBM 5000 1 n/a n/a 76.7% 76.3%ss-RBM 1000 5 74.2% 74.0% 75.9% 74.6%hossRBM 555 3 3 77.6% 77.4% 76.2% 75.9%hossRBM 200 5 5 73.3% 73.3% 75.6% 75.3%

Table 1. Classification accuracy for Toronto Face Dataset. We compare our higher-order ssRBM for various block sizes Kand pooling regions M ×N . The comparison is against first-order ssRBMs, which thus pool in a single dimension of sizeN . First four models contain approximately 3, 000 filters, while bottom four contain 5, 000. In both cases, we comparethe effect of using the factored representation, to the unfactored representation.

MLP trained with supervised backprop 72.72%1, whilea deep mPoT model (Ranzato et al., 2011), which ex-ploits local receptive fields achieves 82.4%.

6. Conclusion

We have presented a higher-order extension of thespike-and-slab restricted Boltzmann machine that fac-tors the standard binary spike variable into three in-teracting factors. From a generative perspective, theseinteractions act to entangle the factors represented bythe latent binary variables. Inference is interpreted asa process of disentangling the factors of variation in thedata. As previously mentioned, we believe an impor-tant direction of future research to be the explorationof methods to gradually disentangle the factors of vari-ation by stacking multiple instantiations of proposedmodel into a deep architecture.

References

F. Bach, R. Jenatton, J. Mairal, and G. Obozinski. Struc-tured sparsity through convex optimization. CoRR,abs/1109.2397, 2011.

A. Coates, H. Lee, and A. Y. Ng. An analysis of single-layernetworks in unsupervised feature learning. In Proceed-ings of the Thirteenth International Conference on Arti-ficial Intelligence and Statistics (AISTATS 2011), 2011.

R. Collobert and J. Weston. A unified architecture fornatural language processing: Deep neural networks withmultitask learning. In W. W. Cohen, A. McCallum, andS. T. Roweis, editors, ICML 2008, pages 160–167. ACM,2008.

A. Courville, J. Bergstra, and Y. Bengio. Unsuper-vised models of images by spike-and-slab RBMs. InICML’2011, 2011a.

A. Courville, J. Bergstra, and Y. Bengio. A Spike and

1Salah Rifai, personal communication.

Slab Restricted Boltzmann Machine. In AISTATS’2011,2011b.

K. Gregor, A. Szlam, and Y. LeCun. Structured sparsecoding via lateral inhibition. In Advances in Neural In-formation Processing Systems (NIPS 2011), volume 24,2011.

D. B. Grimes and R. P. Rao. Bilinear sparse coding forinvariant vision. Neural computation, 17(1):47–73, Jan-uary 2005.

G. Hinton, A. Krizhevsky, and S. Wang. Transformingauto-encoders. In ICANN’2011: International Confer-ence on Artificial Neural Networks, 2011.

A. Hyvarinen and P. Hoyer. Emergence of phase and shiftinvariant features by decomposition of natural imagesinto independent feature subspaces. Neural Computa-tion, 12(7):1705–1720, 2000.

K. Kavukcuoglu, M. Ranzato, R. Fergus, and Y. LeCun.Learning invariant features through topographic filtermaps. In Proceedings of the Computer Vision and Pat-tern Recognition Conference (CVPR’09), pages 1605–1612. IEEE, 2009.

T. Kohonen. Emergence of invariant-feature detectors inthe adaptive-subspace self-organizing map. BiologicalCybernetics, 75:281–291, 1996. ISSN 0340-1200.

T. Kohonen, G. Nemeth, K.-J. Bry, M. Jalanko, and H. Ri-ittinen. Spectral classification of phonemes by learningsubspaces. In ICASSP ’79., volume 4, pages 97 – 100,1979.

Q. Le, J. Ngiam, Z. Chen, D. J. hao Chia, P. W. Koh,and A. Ng. Tiled convolutional neural networks. InNIPS’2010, 2010.

Y. LeCun, L. D. Jackel, B. Boser, J. S. Denker, H. P. Graf,I. Guyon, D. Henderson, R. E. Howard, and W. Hub-bard. Handwritten digit recognition: Applications ofneural network chips and automatic learning. IEEECommunications Magazine, 27(11):41–46, Nov. 1989.


R. Memisevic and G. E. Hinton. Learning to representspatial transformations with factored higher-order boltz-mann machines. Neural Computation, 22(6):1473–1492,June 2010.

M. Ranzato and G. H. Hinton. Modeling pixel means andcovariances using factorized third-order Boltzmann ma-chines. In CVPR’2010, pages 2551–2558, 2010.

M. Ranzato, V. Mnih, and G. Hinton. Generating morerealistic images using gated MRF’s. In NIPS’2010, 2010.

M. Ranzato, J. Susskind, V. Mnih, and G. Hinton. On deepgenerative models with applications to recognition. InCVPR’2011, 2011.

R. Salakhutdinov and G. Hinton. Deep Boltzmann ma-chines. In AISTATS’2009, volume 5, pages 448–455,2009.

L. K. Saul, T. Jaakkola, and M. I. Jordan. Mean fieldtheory for sigmoid belief networks. Journal of ArtificialIntelligence Research, 4:61–76, 1996.

J. Susskind, A. Anderson, and G. E. Hinton. The Torontoface dataset. Technical Report UTML TR 2010-001, U.Toronto, 2010.

J. B. Tenenbaum and W. T. Freeman. Separating styleand content with bilinear models. Neural Computation,12(6):1247–1283, 2000.

M. A. O. Vasilescu and D. Terzopoulos. Multilinear inde-pendent components analysis. In CVPR’2005, volume 1,pages 547–553, 2005.

H. Wang, M. M. Ullah, A. Klaser, I. Laptev, andC. Schmid. Evaluation of local spatio-temporal featuresfor action recognition. In British Machine Vision Con-ference (BMVC), pages 127–127, London, UK, Septem-ber 2009.

Disentangling Factors of Variation via Generative Entangling

Documents

Transcript of Disentangling Factors of Variation via Generative Entangling