Images as Bags of Pixels - Columbia Universityjebara/papers/iccv03.pdfImages as Bags of Pixels Tony...

Images as Bags of Pixels

Tony JebaraDepartment of Computer Science, Columbia University

New York, NY [email protected]

Abstract

We propose modeling images and related visual ob-jects as bags of pixels or sets of vectors. For instance,gray scale images are modeled as a collection or bag of(X,Y, I) pixel vectors. This representation implies apermutational invariance over the bag of pixels whichis naturally handled by endowing each image with apermutation matrix. Each matrix permits the image tospan a manifold of multiple configurations, capturingthe vector set’s invariance to orderings or permutationtransformations. Permutation configurations are op-timized while jointly modeling many images via max-imum likelihood. The solution is a uniquely solvableconvex program which computes correspondence simul-taneously for all images (as opposed to traditional pair-wise correspondence solutions). Maximum likelihoodperforms a nonlinear dimensionality reduction, choos-ing permutations that compact the permuted image vec-tors into a volumetrically minimal subspace. This ishighly suitable for principal components analysis which,when applied to the permutationally invariant bag ofpixels representation, outperforms PCA on appearance-based vectorization by orders of magnitude. Further-more, the bag of pixels subspace benefits from automaticcorrespondence estimation, giving rise to meaningfullinear variations such as morphings, translations, andjointly spatio-textural image transformations. Resultsare shown for several datasets.

1. Introduction

A vital component of any computer vision system isits choice of representation for images and visual data.The way visual information is parameterized, featuresare extracted, or images are mathematically describedremains an active area of research. The success ofsubsequent vision modules for recognition, segmenta-tion, tracking, and modeling often hinges on the ini-tial representation we chose. For instance, if impor-tant variations in our image data generate linear orsmooth changes under a given representation, a subse-

quent recognition module should perform significantlybetter. In this article we propose a bag of pixels or vec-tor set representation for images. For example, a grayscale image can be considered as a collection of N pixelseach with spatial coordinates (X,Y ) and an intensitycoordinate (I). Thus, each image in our database is abag of (X,Y, I) 3-tuples. Similarly, an edge or point-image can be seen as a bag of (X,Y ) tuples with no in-tensity information. Even color video can be describedas a vector set of (X,Y,R,G,B, time) 6-tuples. Fig-ure 1(a) depicts this bag of pixels or collection of tuplesrepresentation. For comparison purposes, Figure 1(b)shows a traditional appearance-based vectorized rep-resentation where the tuples are rigidly concatenatedalong a fixed ordering.

XYI

XYI

X Y I

XYI X

YI

XYI X

YI

XYI

XYI

XYI

XYI

XYI

(a) Bag of Pixels (b) Vectorization

Figure 1: A bag of pixels or vector set versus a directvectorization representation for a gray scale image.

In a bag of pixels, it is important to maintain thatthere is no ordering on the pixels and these can bepermuted arbitrarily. Concatenating the pixels into asingle long vector would obscure this important sourceof invariance in our representation. Instead of assum-ing a single ordering or correspondence on the pixelsin a bag, we will maintain that each bag of pixels canspan a manifold of configurations in a vectorized Eu-clidean space and treat the ordering as an unknown yetestimable parameter in our modeling process. Figure 2depicts a manifold representing a single image as a bagor set of vectors of N pixels each of which is a D-dimensional tuple. This manifold is embedded in a<N×D-dimensional Euclidean vector space where each

1

1

2

3

4

3

1

4

2

4

3

2

1

3

4

1

2

Figure 2: An image as a manifold of vector set config-urations in an embedding Euclidean vector space.

point on the manifold is given by concatenating thepixel tuples according to an arbitrary ordering. Ad-mittedly, hard permutations of a vector do not forma continuous manifold. Instead, we approximate per-mutation with soft doubly-stochastic matrices whichdo in fact create a smooth and continuous manifoldof configurations. In this toy example, we show fourconfigurations of the same image under four differentvector orderings. Each vector on the manifold is anarbitrarily ordered concatenation of the 4 pixel tuples.These configurations are invariant since all points onthe manifold correspond to the same bag of pixels andrender the same identical image. To parametricallyconsider arbitrary re-orderings, we endow each imageor bag of pixels with its own unknown soft block-wisepermutation matrix. Movement along each vector setor image’s manifold can be represented by varying theunspecified permutation matrix to explore the arbi-trary possible re-orderings of the vector set. Insteadof proposing a fixed scheme (flow-based, appearance-based, physical deformation , etc.) for establishing thebest correspondence or optimal setting of the permuta-tion matrix, we fold this permutation estimation intoour overall learning algorithm which jointly computesthe permutations while fitting a statistical model to adataset of many images (e.g. faces, hand-drawn digits,and so forth). Thus, while learning a model of multi-ple images (for instance a Gaussian or subspace model),we simultaneously estimate the optimal setting for thepermutation matrices for each image such that we max-imize the likelihood of the model fitting to the dataset.Essentially, the correspondence problem is simultane-ously solved for each image in the whole dataset viaa global criterion. Potential criteria we will considerinclude a Gaussian model fit or subspace model fit toa given database of images.

We show that the above estimation problem is solv-able as a convex program which yields a unique solutionfor the joint estimation of the model and representation.In other words, we jointly estimate a subspace modeland the correspondence or permutation matrices forall images. The permutation matrices are described

via a convex hull of constraints and we propose twoconvex cost functions for estimating them. The firstis the maximum likelihood Gaussian mean estimatorwhich gives rise to an estimate of permutation matri-ces that clusters images towards a common mean. Thesecond is the maximum likelihood Gaussian covarianceestimator which gives rise to an estimate of permuta-tion matrices (or correspondences) that aligns imagesinto a minimally low-dimensional subspace, which isan almost ideal pre-processing step for principal com-ponents analysis. We show update equations for solv-ing the convex problem for the permutation matricesas an iterative quadratic program. Treating imagesin this manner provides a general method for solvingcorrespondence and gives rise to more meaningful sub-spaces of variation for a given dataset. We show inter-esting experimental results on image datasets includingfaces and hand-drawn digit images. We note improvedmodeling, reconstruction, correspondence and repre-sentation of images as bags of pixels. For instance,subspace methods such as principal components showimproved reconstruction accuracy (by orders of mag-nitude) when the optimal permutations are estimatedinstead of being assumed (or heuristically computed).Furthermore, we show various spatio-textural morph-ing bases emerging automatically after learning fromimage data. We then conclude with extensions anda summary. Optimized implementation code, addi-tional results and further details are provided onlineat: www.cs.columbia.edu/∼ jebara.

2. Background

Other efforts in visual representations have exploredsimilar treatments of images as a collection of (X,Y, I)points or (X,Y ) points [15, 5]. However, one central is-sue plaguing this type of representation is the so-calledcorrespondence problem [18, 1, 5]. For instance, it isnot clear which tuple in bag A (representing image A)should match with an given tuple in bag B (represent-ing image B). A variety of methods exist for estab-lishing such correspondence, for example physics-basedmodels [15, 18]. Another way to improve appearancemodels that directly vectorize images is to estimate andapply spatial alignment [14, 13, 4, 12]. Optical flowand variants similarly compute small local correspon-dences to recover jointly spatial and illumination sub-space models [1, 8, 16, 2]. An interesting alternativethat is similar to our approach is to consider direct opti-mization criteria for establishing correspondence suchas minimizing the covariance of the data [11] or itsdescription length [3]. However, in most frameworks,correspondence is established a priori through some in-termediate criterion, pre-specified physical constraints,

pairwise matching or alignment optimization instead ofemerging automatically from invariances in the overallmodel learning process. We propose the latter scheme,estimating correspondences to agree with our modelingparadigm without resorting to prior or side constraints.

3. Bags of Pixels vs. Vectors

In this section we further motivate and compare thebag of pixels or vector set representation versus vector-ization or appearance-based models. Correspondenceissues and the topology of an image are discarded underdirect lexigraphic vectorization of the image. Vector-ization merely scans the image in a fixed spatial pat-tern, concatenating each scalar intensity entry into along vector and ignoring the spatial relationship be-tween adjacent pixels. In this representation, natu-ral variations in the image such as translation appearhighly nonlinear. For example, translating a rasterizedimage vector x0 by t pixels would involve multiply-ing it by a shifting matrix M taken to a power of t,i.e. xt = M tx0. This is a highly nonlinear operation.In Figure 1(b), direct appearance-based vectorizationnaively organizes pixels into a fixed vector by assuminga fixed ordering and sacrifices our ability to see varia-tion in spatial coordinates, if for example, we were toform an eigenspace over a database of such images. If,we instead represent the image as a bag of pixels with-out specifying the ordering, we can maintain propertiesof the data such as spatial proximity between adjacentpixels and see variations in X and Y as well as I.

02

46

810

1214

0

5

10

15

0

1

2

3

4

02

46

810

12

0

5

10

15

0

1

2

3

(a) Image(b) Vectorization Basis(c) Bag of Pixels Basis

Figure 3: Morph properties of bag of pixels or vectorsets.

One crucial property of the vector set or collectionof tuples (i.e. pixels) is that the correspondence be-tween images in the dataset is no longer specified andbecomes implicit in the learning process. For example,the aforementioned naive ordering used in vectorizationwould make the X and Y components of the large vec-tor redundant since these are constant from image toimage. The only sources of variation are the I-intensitycomponents. Meanwhile, the collection of pixels rep-resentation does not assume an ordering and, if we

were to properly estimate correspondence, variations inthe X and Y spatial coordinates could emerge. Con-sider applying a principal component analysis methodto vectorized images as in the eigenfaces method [13].Therein, a basis over the rasterized or vectorized repre-sentation would only involve additions and deletions ofintensity components in the image. Therefore, a sim-ple change like translation in the image appears highlynon-linear. Alternatively, a basis over a collection oftuples where correspondence is optimally estimated al-lows variations in X and Y coordinates just as easily asvariations in intensity. Thus, an image can morph ortranslate via a linear transformation in this represen-tation. This process is depicted in Figure 3. In (a) wesee a regular gray-scale image, which we can consideras a 3D surface with X-coordinates, Y -coordinates andI-intensity values. If vectorized in lexicographic order,the only basis of variation will be a vertical changein intensities as shown, for instance, in (b). In (c),however, we see that a basis of morphings in X andY coordinates could also emerge if the vectorization ishandled more elegantly. In this final figure, the vec-tors of flow indicate that we can translate the imageequally well in X, Y or I merely via linear variations.Thus we note that it is suboptimal to assume an arbi-trary ordering. We will instead compute the optimalcorrespondence or permutation matrix for each imagewhile we perform our model estimation (i.e. estimatinga PCA subspace or Gaussian model). One confound-ing aspect remains in our modeling problem however:each permutation matrix is an unknown transforma-tion parameter which moves the image along a path ofinvariance.

4. Modeling while Permuting

We can view the unusual vector set representation asa problem of learning a model under permutational in-variances of each image (here, each image is a datapoint in a given database we are attempting to model).A representation often implies invariance properties inan object. In the bag of pixel vectors, ordering of theobjects in the bag should be invariant. So, transforma-tions that permute the order of the tuples should notchange the representation.

Consider estimating of the optimal permutation orcorrespondence by using a simple example of a sub-space learning problem, such as principal componentsanalysis (PCA). In Figure 4(a) we see a data set in <3

which needs to be modeled by a lower dimensional man-ifold. Clearly, no appropriate manifold is apparent andPCA will not provide a useful result. This is becausewe have assumed a fixed ordering for each image (eachdata point) instead of maintaining the images as a man-

−2−1

01

2

−2−1

01

2

−0.4

−0.2

0

0.2

0.4

0.6

xy −2−1

01

2

−2−1

01

2

−0.4

−0.2

0

0.2

0.4

0.6

xy

(a) (b)

Figure 4: Invariant manifold learning.

ifold of possible vectorized configurations. However, ifwe generalize PCA and add invariants to the data, thisis no longer the case (see Figure 4(b)). For instance,the permutational invariance property of each imageprovides us with not only a single point for each imagebut a path or manifold 1 along which each image maymove invariantly prior to applying PCA (i.e. a corre-sponded principal components analysis). If points canbe moved along their permutation paths invariantly,they can more clearly form a compact two-dimensionalsubspace. Therefore, we will approach invariant modellearning as estimation of permutation transformationson the data (i.e. paths for each datum) while simulta-neously forming a model.

5. A Convex Program

For the above estimation problem to have a uniqueglobally optimal solution, we cast it as a jointly con-vex optimization over permutation matrices and modelparameters. In fact, we will implicitly compute the op-timal model parameters and fold them into a single costfunction exclusively over permutation matrices. Thisprocess will be explicated in the next section, but fornow, assume we have the following general scenario.We are given an input dataset of T vectors, X1, . . . , XT .Each of these T vectors is of size N × D and resultsfrom the concatenation of the tuples in the bag of pix-els according to some initial random ordering. We thenendow each vector with an affine transformation matrixAt that interacts linearly with it as follows:

∑j Aij

t Xjt .

These matrices then act to permute the entries in thegiven vector [7].

For our invariance, these At matrices are not justaffine matrices but rather block-wise permutation ma-trices. However, optimizing over hard permutation set-tings is intractable (max-cut or integer programmingmethods might help resolve this). Instead, we relax thematrices such that they are soft permutation matrices.

1A path is simply a 1-dimensional manifold

Also, the matrices can only permute the D-dimensionaltuples and not individual scalar entries in the Xt vec-tors. Therefore, each At matrix is a grid of many D×Didentity matrices scaled by an unknown non-negativescalar Aij

t . For example, a matrix permuting two dif-ferent tuples has the following structure:

At =[

A11t I A12

t IA21

t I A22t I

]

To relax the hard permutation matrices which onlyhave binary entries, we only constrain each At matrixto be doubly-stochastic, in other words:∑

i

Aijt = 1

∑j

Aijt = 1 Aij

t ≥ 0

Thus, we can re-sort the tuples in each Xt vector aswell as take convex combinations of them. In fact, theAt matrices need not be square, but can be of sizeND × NtD where Nt is the number of tuples in eachXt image vector. This is useful if we want the corre-spondence estimation to combine or mix two or morepixels in differently sized images so that all images mapinto a common <ND embedding space for PCA or ourGaussian model. We denote these many constrainedtransformation matrices as A = A1, . . . , AT and notethat they satisfy a set of linear equality and inequalityconstraints. Thus, our solution space over the set ofmatrices is a convex hull. The convex hull is a require-ment of any convex programming framework, as shownbelow:

minA

C(A) subject to∑ij

Aijt Qij

td + btd ≥ 0 ∀t, d (1)

Here, C(A) is a convex cost function to be mini-mized and the Qtd and btd constants specify a hullof constraints (such as our doubly-stochastic con-straints). Once we specify C(A), the above formulationis solvable via convex programming techniques, includ-ing dual and axis-parallel methods which all yield aglobal solution. The general optimization picture thatemerges is shown in Figure 5.

C(A)

A1

A2

Figure 5: Convex program over doubly-stochastic ma-trices.

We now propose possible choices for the convex costfunction over the C(A) matrices. These emerge auto-matically from traditional maximum likelihood mod-eling criteria (such as Gaussian modeling or subspacemodeling).

6. Maximum Likelihood Criteria

We now consider two model estimation criteria (al-though others are possible) and see how they giverise to a convex cost function over the permuta-tion matrices. For jointly performing model esti-mation while learning invariances, consider the sim-plest case of estimating a Gaussian mean by maxi-mizing likelihood of the data while we explore dif-ferent permutation settings. The log-likelihood isl(A,µ) =

∑t logN (AtXt;µ, I). The optimal mean is

µ̂ = 1/T∑

t AtXt which we can plug back into thelog-likelihood expression to get:

l(A, µ̂) = −TD

2log(2π) − 1

2

∑t

‖AtXt − µ̂‖2

The above likelihood can now be further maximizedover permutations. By negating, we convert likelihoodinto a cost function to minimize over A. With fur-ther manipulations, the cost function that emerges es-sentially minimizes the trace of the covariance of thepermuted image data:

C(A) = tr(Cov(A X))

The trace of the covariance is a convex quadratic func-tion over the At matrices. Combined with linear con-straints that make each At doubly-stochastic, this costis minimizable via quadratic programming or iterativemethods. This Gaussian mean criterion thus tends toselect permutation matrices that cluster data spher-ically, by moving images along their paths to centerdata towards a common mean.

We generalize to Gaussians of variable covarianceas in N (AX;µ,Σ) and also plug in the maximumlikelihood covariance estimate Σ̂ = 1/T

∑t(AtXt −

µ̂)(AtXt − µ̂)T into the likelihood function to obtain:

l(A, µ̂, Σ̂) = −TD

2log(2π) − T

2log |Σ̂|

−12

∑t

(AtXt − µ̂)T Σ̂−1(AtXt − µ̂)

After simplifications, the maximum likelihood solu-tion of A is equivalent to minimizing the cost function|Cov(A X)| 2. We will instead minimize the logarithm

2Both the trace and determinant were discussed by [11] yetwere not derived via maximum likelihood or convexified into auniquely solvable program over doubly-stochastic matrices.

of the above cost, i.e. C(A) = log |Cov(A X)| since itshares the same optima. We also regularize the costfunction by adding a small identity matrix to the co-variance and adding a small tr(Cov(A X)) term (asin the Gaussian mean case), we can avoid rank de-generacies and guarantee the cost stays convex. Morespecifically, we have:

C(A) = log |Cov(A X) + ε1I| + ε2tr(Cov(A X))

Both ε1 and ε2 are kept small (≈ 1.0). We prove con-vexity in the Appendix. Therefore our new C(A) isagain convex and Equation 1 results in a convex pro-gram. However, it is not a quadratic program. Wecan instead minimize C(A) by iteratively upper bound-ing using a quadratic function in A. This permitsus to sequentially solve multiple quadratic programsinterleaved with variational bounding steps until weconverge to the global solution. First consider log |S|where we have defined S = Cov(A X) + ε1I. The log-arithm of the determinant is concave over covariancematrices [6]. Since log |S| is concave, we can upperbound it with a tangential linear function in S thatis equal and has the same gradient R = S−1

0 at thecurrent setting of S = S0 which is computed from ourcurrent setting of our permutation matrices, A = A0.The upper bound is then:

log |S| ≤ trace(RS) + log |S0| − tr(RS0)

Adding our additional regularizer term with ε2 to theabove, we obtain the following upper bound on C(A):

C(A) ≤ trace(R(Cov(A X) + ε1I)) + log |S0|−tr(RS0) + ε2tr(Cov(A X))

Simplifying the bound by removing terms that are con-stant over A, we have the following surrogate cost tominimize:

C̃(A) = tr(MCov(A X))where M = (Cov(A X) + ε1I)−1 + ε2I

We thus update M for the current setting of the At

matrices (by computing the covariance of the data af-ter each At is applied to each Xt), then lock it for afew iterations while we minimize the trace to updatethe A permutation parameters. Updates of M are in-terleaved with updates of the A matrices until conver-gence. The above criterion attempts to cluster dataellipsoidally such that it forms a low-dimensional sub-manifold. It is well known that the determinant of acovariance matrix behaves like a volumetric estimatorand approximates the volume of the data. Minimizingvolume by varying the permutation matrices (i.e. com-puting the correspondence) is a valuable preprocessing

step for PCA since it concentrates signal energy intoa smaller number of eigenvalues, improving the effec-tiveness and reconstruction accuracy in the PCA sub-space. Therefore, this criterion attempts to flatten thedata via permutation such that it forms as flat andlow-dimensional a subspace as possible.

7. Implementation

The cost functions so far both involve minimizing thetrace of a matrix in the following general form (M =I for the Gaussian mean case, while in the Gaussiancovariance case, M is periodically recomputed from thecovariance of the data as it is being transformed):

tr(MCov(A X)) =1T

∑mpnqi

Amni Apq

i Xqi MpmXn

i

− 1T 2

∑mpnqij

Amni Apq

j Xqj MpmXn

i

Degeneracies may arise since we approximate permu-tations using doubly-stochastic matrices. For instance,At’s entries might all become a constant c = 1/N ,average out all tuples. To discourage this, we add aquadratic penalty to the cost as −λ

∑imn(Amn

i − c)2.This penalizes matrix entries close to the mean and fa-vors entries near 0 or 1. The λ is chosen adaptively tomaintain convexity3.

To minimize the cost with constraints, we use anSMO approach [17], although an SVD-like updated rulefor each At is also possible. As in SMO, we vary a sin-gle At matrix for the t’th image at a time and onlyupdate 4 of its entries (Amn

t , Amqt , Apn

t , Apqt ) while all

others are locked. Iterating randomly, we ultimatelyupdate all scalar entries in each At matrix. Howeveronly 4 scalars are updated at a time in a given it-eration. Double-stochasticity gives the equality con-straints: Amn

t +Amqt = a, Apn

t +Apqt = b, Amn

t +Apnt = c

and Amqt + Apq

t = d. So only one degree of freedomis left to compute per iteration and is updated as inFigure 6. The operations involve computing all pos-sible inner products between the X-tuples Xn and Xq

weighted by all relevant Mmm, Mmp, Mpp and Mpm

sub-matrices4. We compute the H1,H2,H3,H4 andNUM,DEN terms. The ratio of NUM over 2DENgives the optimal update value for the matrix entryAmn

t . We then limit Amnt to satisfy the inequalities

3Other simplifications are possible. For instance, when min-imizing in the determinant, we can force the Gaussian mean tobe equal to a single random image in our dataset, i.e. µ = Xi

for the i’th image and also lock its corresponding permutationmatrix to identity, i.e. Ai = I.

4For clarity, here the matrices and vectors are indexed withsubscripts instead of superscripts and all entries where the datumindex does not appear refer to the datum at the t’th index.

Amnt ∈ [max(0, a− d, c− 1),min(a, c, 1 + a− d)]. After

updating Amnt , we can update the other 3 entries via

their linear dependence on Amnt . Iterating the update

rule randomly over different entries and matrices whileintermittently recomputing bounds (via the inverse ofM) converges monotonically to the global minimum ofC(A).

If a new test image XT+1 is observed after trainingand we wish to compute its AT+1 matrix, we could re-optimize the full cost C(A1 . . . AT+1) again. A moreefficient approach is to fix previous estimates of ma-trices At for t = 1..T and just optimize C(A) for thenew AT+1. A further simplification is to apply PCAto the covariance matrix since the training data is nowflattened into a subspace. We maintain a reasonablenumber of k eigenvectors and align a new image to thiseigenspace by choosing its permutation matrix AT+1

to minimize its squared error of the reconstruction inthe subspace. If the eigenvectors are V1, . . . , Vk, wethus minimize the following quadratic cost subject todouble-stochasticity constraints on AT+1:

AT+1 = arg minA

‖∑

k

(V Tk AXT+1)Vk − AXT+1‖2

8. Experimental Results

We evaluated the framework on three datasets: (X,Y )point images of digits, (X,Y, I) intensity images of asingle faces and (X,Y, I) intensity images of multipleindividuals. In the first dataset, we obtained 28 × 28gray scale images of the digits 3 and 9 which were thenrepresented as a collection of 70 (X,Y ) pixels by sam-pling the region of high intensity. This generates cloudsof 2D points in the shape of 3’s or 9’s. A total of 20such point images was collected with 70 (X,Y ) pixelseach. We then estimated the At permutation matri-ces by minimizing the covariance’s determinant. Fig-ure 7(a) depicts 6 exemplars of the original image dataas point clouds. In Figure 7(b), standard PCA with

(a) Original Data

(b) Direct PCA

(c) Permuted PCA

Figure 7: Reconstruction of digit images via PCA withand without permutation estimation.

10 eigenvectors reconstructed the digits point clouds

Amnt ← NUM

2DENNUM = cXT

n MmpXn+cXTn MppXq+2aXT

q MpmXq+cXTn MpmXn+aXT

n MmmXq−2aXTq MmmXq−2cXT

n MppXn−aXT

n MmpXq−cXTn MpmXq−aXT

n MpmXq−XTq MmpXqd−XT

n MppXqd+XTn MppXqa+XT

n MmpXqd−XTq MpmXqd+

2XTq MppXqd−2XT

q MppXqa+2aXTq MmpXq +H1−H3+H4−H2−cXT

q MmpXn−aXTq MmpXn−aXT

q MpmXn +aXT

q MmmXn + cXTq MppXn + XT

q MpmXnd − XTq MppXnd + XT

q MppXna + 4aλ + 2cλ − 2dλ

DEN = XTq MmpXq − XT

n MmmXn − XTq MmmXq − XT

n MppXn + XTn MpmXn − XT

q MppXq + XTn MmmXq −

XTn MpmXq + XT

n MmpXn − XTn MmpXq + XT

n MppXq + XTq MpmXq − XT

q MpmXn − XTq MmpXn + XT

q MppXn +XT

q MmmXn + 4λ

H1 = (XTn MT

um + XTn Mmu)(

∑u,v 6={mn,mq,pn,pq} AuvXv − 1

T−1

∑u,v,τ 6=t Aτ,uvXτ,v)

H2 = (XTn MT

up + XTn Mpu)(


T−1


H3 = (XTq MT

um + XTq Mmu)(


T−1


H4 = (XTq MT

up + XTq Mpu)(


T−1


Figure 6: Update rule for iteratively adjusting entries of the permutation matrices.

poorly. In Figure 7(c), PCA with 10 eigenvectors wasinstead applied to the bag of pixels after estimation ofthe permutation matrices. Note that the images arereconstructed more faithfully in the latter case due tothe estimation of the permutation or correspondence.Unlike the direct PCA eigenvectors which do not re-solve correspondence, the eigenvectors after permuta-tion seem to translate and morph the digits smoothlyas in Figure 8(a). Also, the first eigenvectors have moreinterpretable modes of variation. When applied to thenumber 9, the first eigenvectors seem to morph thepoint cloud in interesting ways as in Figure 8(b) and(c).

(a) Interpolation

(b) Eigenvector

(c) Eigenvector

Figure 8: Linear interpolation (morphing) and effect ofthe first eigenvectors under a bag of pixels representa-tion.

In a larger experiment, we obtained T = 300 grayscale images of faces and sampled the pixels in skin-colored regions to obtain a collection of N = 2000(X,Y, I) pixels for each face. The face images were ofa single individual’s face as it spans many lighting, 3Dpose and expression configurations. Due to the largersize of this dataset, we avoided explicitly storing the At

matrices which are of size O(N2). A more efficient stor-age method for doubly-stochastic matrices can be use

where each is saved as 2N scalars and estimated withthe Invisible Hand algorithm [10]. On this dataset,we estimated the permutation transformation matricesand we found an eigenspace of 20 components. We thenreconstructed the images from 20 coefficients alone.Figure 9 depicts the accuracy of the reconstructed faces

(a) Original Data (b) PCA (c) Permuted PCA

Figure 9: Reconstruction of (X,Y, I) facial images withvectorized PCA or PCA on permutable bags of pixels.

when standard PCA (with 20 eigenvectors) was usedas well as when PCA for bags of pixels was used whichperforms permutation estimation. The images showmuch higher fidelity when permutation or correspon-dence is optimized. The permuted (X,Y, I) vector-seteigenvectors act smoothly, rotating and morphing theface in 3D as well as changing its illumination. Tra-ditional appearance-based PCA causes ghosting effectswhere translated images do not move smoothly but,instead, appear to fade in and out. More interestingly,in terms of squared error, PCA had a reconstructionerror of 9e5 while permuted PCA had reconstructionerror of 5e3. The novel method reduces squared errorby approximately 2.5 orders of magnitude.

Finally, a large dataset of 3D synthesized faces ofmany individuals was used. Permutations were esti-mated in a semi-supervised way for efficiency and then

PCA was performed on the permuted pixels. We showthe first few eigenvectors as ± displacements from themean face in Figure 10. Each row is one eigenvector(the top row is the top eigenvector) while columns showvarying degrees of additions and deletions of the eigen-vector. Note how the eigenvectors correspond to natu-ral out-of-plane and in-plane rotations, as well as jointmorphings and intensity variations. In current work,we are building a bag of pixels face tracker based onthese eigenvectors.

Figure 10: Top 5 Bag of Pixels Eigenvectors applied tothe Mean from Multi-Person (X,Y, I) Face Images.

9. Summary and Conclusions

We explored permutation invariance for dealing withimages that are organized into collections of tuples, vec-tor sets or bags of pixels. Statistical modeling (Gaus-sian mean, covariance, appearance subspaces) wasshown to clearly benefit from the simultaneous estima-tion of per-image permutation transformations as wellas model parameters. Image data points are effectivelypermuted and aligned simultaneously during the mod-eling process. They are also aligned as a whole datasetinstead of via pair-wise or ad hoc criteria. We notedrastically improved reconstruction accuracy as well asmore meaningful linear bases of variation (morphings,spatio-textural change, etc.). Currently, we are explor-ing a kernel or distance K(χ, χ) − 2K(χ, ν) + K(ν, ν)between bags of pixels which corresponds to findingthe closest distance between two permutation mani-folds. Surprisingly, this can be done implicitly andefficiently without explicitly computing the optimalpermutations[9].

Acknowledgments

Thanks to the reviewers and N. Jojic for valuable feed-back. This work was supported in part by NSF ITR0312690.

Appendix

Theorem The function C(X ) = log |Cov(X ) + ε1I| +ε2tr(Cov(X )) of the vector dataset X = {~x1, . . . , ~xN}is convex in the data (and also in variables such as An

that linearly interact with the data) when ε1, ε2 ≥ 1.

Proof Without loss of generality, assume the vectorsin X are zero-mean which yields Cov(X ) =

∑n ~xn~xT

n

when we ignore scaling by 1/N . The function then sim-plifies to:

C(X ) = log

∣∣∣∣∣∑

n

~xn~xTn + ε1I

∣∣∣∣∣ + ε2tr

(∑n

~xn~xTn

)

Compute the Hessian of C(X ) over single large vec-tor argument ~X formed by the concatenation of all the~x1, . . . , ~xN . The desired 1

2C ′′( ~X) is then:|Cov(X ) + ε1I|−1 I + ε2I − 2 |Cov(X ) + ε1I|−2 ~X ~XT

Here I is a large identity matrix the size of ~X ~XT .For convexity, we show the Hessian or a lower boundon it remains positive definite. Note − ~X ~XT is lowerbounded by − ~XT ~XI in the Loewner ordering sense.Also, note ~X ~XT = tr(Cov(X )). Thus, the Hessian islower bounded by the following scalar times I:|Cov(X ) + ε1I|−1 + ε2 − 2 |Cov(X ) + ε1I|−2

tr(Cov(X ))

Rearranging and writing traces and determinants interms of the (arbitrary yet non-negative) eigenvaluesλ1, . . . , λD of Cov(X ) we get following non-negativityrequirement:

12ε2

∏d

(λd + ε1)2 +12

∏d

(λd + ε1) −∑

d

λd ≥ 0

In the unidimensional case, there is one eigenvalue λ1

and the above is a quadratic which remains positivewhenever ε1ε2 ≥ 1/8. In the multidimensional case,observe the gradient ∂

∂λiof the left hand side of the

above inequality:12ε2

∏d6=i

(λd + ε1)2(λi + ε1) +12

∏d6=i

(λd + ε1) − 1

If ε1 ≥ 1 and ε2 ≥ 1 the gradient stays positive. Thus,to minimize the left hand side of the bound, eigenval-ues can move against the gradient and shrink to 0.However, the inequality is still satisfied at their low-est setting λ = ~0 where the left hand side attains itsnon-negative minimum ensuring the function C(X ) isconvex for ε1 ≥ 1 and ε2 ≥ 1.

References

[1] D. Beymer and T. Poggio. Image representation forvisual learning. Science, 272, 1996.

[2] T. Cootes, G. Edwards, and C. Talyor. Active appear-ance models. IEEE Transactions on Pattern Analysisand Machine Intelligence, 23(6):681–685, 2001.

[3] R. Davies, C. Twining, T. Cootes, J. Waterton, andC. Taylor. 3d statistical shape models using directoptimisation of description length. In European Con-ference on Computer Vision, 2002.

[4] B. Frey and N. Jojic. Estimating mixture models ofimages and inferring spatial transformations using theEM algorithm. In Computer Vision and Pattern Recog-nition, 1999.

[5] S. Gold, C.P. Lu, A. Rangarajan, S. Pappu, andE. Mjolsness. New algorithms for 2D and 3D pointmatching: Pose estimation and correspondence. InNeural Information Processing Systems 7, 1995.

[6] D. Jakobson and I. Rivin. Extremal metrics on graphs.Forum Math, 14(1), 2002.

[7] T. Jebara. Convex invariance learning. In ArtificialIntelligence and Statistics 9, 2003.

[8] M. Jones and T. Poggio. Hierarchical morphable mod-els. In Computer Vision and Pattern Recognition,1998.

[9] R. Kondor and T. Jebara. A kernel between sets ofvectors. In Machine Learning: Tenth InternationalConference, 2003.

[10] J. Kosowsky and A. Yuille. The invisible hand algo-rithm: Solving the assignment problem with statisticalphysics. Neural Networks, 7:477–490, 1994.

[11] A.C.W. Kotcheff and C.J. Taylor. Automatic construc-tion of eigenshape models by direct optimisation. Med-ical Image Analysis, 2(4):303–314, 1998.

[12] E. Miller, N. Matsakis, and P. Viola. Learning fromone example through shared densities on transforms.In Computer Vision and Pattern Recognition, 2000.

[13] B. Moghaddam and A. Pentland. Probabilistic visuallearning for object detection. In International Confer-ence on Computer Vision, 1995.

[14] H. Murase and S. Nayar. Learning object models fromappearance. In Proceedings of the AAAI, 1993.

[15] C. Nastar, B. Moghaddam, and A. Pentland. General-ized image matching: Statistical learning of physically-based deformations. In European Conference on Com-puter Vision, 1996.

[16] A. O’Toole, T. Vetter, and V. Blanz. Three-dimensional shape and two-dimensional surface re-flectance contributions to face recognition: An applica-tion of three-dimensional morphing. Vision Research,39, 1999.

[17] J. Platt. Using analytic QP and sparseness to speedtraining of support vector machines. In Neural Infor-mation Processing Systems 11, 1999.

[18] S. Sclaroff and A. Pentland. Modal matching forcorrespondence and recognition. IEEE Transactionson Pattern Analysis and Machine Intelligence, 15(6),1993.

Images as Bags of Pixels - Columbia Universityjebara/papers/iccv03.pdfImages as Bags of Pixels Tony...

Documents

Transcript of Images as Bags of Pixels - Columbia Universityjebara/papers/iccv03.pdfImages as Bags of Pixels Tony...