956 IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 24, NO. 3, MARCH 2015 Multiview...

956 IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 24, NO. 3, MARCH 2015

Multiview Alignment Hashing forEfficient Image Search

Li Liu, Mengyang Yu, Student Member, IEEE, and Ling Shao, Senior Member, IEEE

Abstract— Hashing is a popular and efficient method fornearest neighbor search in large-scale data spaces by embeddinghigh-dimensional feature descriptors into a similarity preservingHamming space with a low dimension. For most hashing methods,the performance of retrieval heavily depends on the choice ofthe high-dimensional feature descriptor. Furthermore, a singletype of feature cannot be descriptive enough for differentimages when it is used for hashing. Thus, how to combinemultiple representations for learning effective hashing functionsis an imminent task. In this paper, we present a novelunsupervised multiview alignment hashing approach based onregularized kernel nonnegative matrix factorization, which canfind a compact representation uncovering the hidden semanticsand simultaneously respecting the joint probability distributionof data. In particular, we aim to seek a matrix factorizationto effectively fuse the multiple information sources meanwhilediscarding the feature redundancy. Since the raised problem isregarded as nonconvex and discrete, our objective function is thenoptimized via an alternate way with relaxation and convergesto a locally optimal solution. After finding the low-dimensionalrepresentation, the hashing functions are finally obtained throughmultivariable logistic regression. The proposed method issystematically evaluated on three data sets: 1) Caltech-256;2) CIFAR-10; and 3) CIFAR-20, and the results show that ourmethod significantly outperforms the state-of-the-art multiviewhashing techniques.

Index Terms— Hashing, multiview, NMF, alternateoptimization, logistic regression, image similarity search.

I. INTRODUCTION

LEARNING discriminative embedding has been a criticalproblem in many fields of information processing and

analysis, such as object recognition [1], [2], image/videoretrieval [3], [4] and visual detection [5]. Among them,scalable retrieval of similar visual information is attractive,since with the advances of computer technologies and thedevelopment of the World Wide Web, a huge amount of digitaldata has been generated and applied. The most basic butessential scheme for similarity search is the nearest neigh-bor (NN) search: given a query image, to find an imagethat is most similar to it within a large database and assignthe same label of the nearest neighbor to this query image.

Manuscript received May 25, 2014; revised January 5, 2015; acceptedJanuary 5, 2015. Date of publication January 12, 2015; date of currentversion January 30, 2015. The associate editor coordinating the review of thismanuscript and approving it for publication was Prof. Jie Liang.(Corresponding author: Ling Shao)

The authors are with the Department of Computer Science and DigitalTechnologies, Northumbria University, Newcastle upon Tyne NE1 8ST,U.K. (e-mail: [email protected]; [email protected]; [email protected]).

Color versions of one or more of the figures in this paper are availableonline at http://ieeexplore.ieee.org.

Digital Object Identifier 10.1109/TIP.2015.2390975

NN search is regarded as a linear search scheme (O(N)),which is not scalable due to the large sample size in datasetsof practical applications. Later, to overcome this kind ofcomputational complexity problem, some tree-based searchschemes are proposed to partition the data space via varioustree structures. Among them, KD-tree and R-tree [6] aresuccessfully applied to index the data for fast query responses.However, these methods cannot operate with high-dimensionaldata and do not guarantee faster search compared to the linearscan. In fact, most of the vision-based tasks suffer from thecurse of dimensionality problems,1 because visual descriptorsusually have hundreds or even thousands of dimensions.Thus, some hashing schemes are proposed to effectivelyembed data from a high-dimensional feature space into asimilarity-preserving low-dimensional Hamming space wherean approximate nearest neighbor of a given query can be foundwith sub-linear time complexity.

One of the most well-known hashing techniques thatpreserve similarity information is Locality-SensitiveHashing (LSH) [7]. LSH simply employs random linearprojections (followed by random thresholding) to map datapoints close in a Euclidean space to similar codes. SpectralHashing (SpH) [8] is a representative unsupervised hashingmethod, in which the Laplace-Beltrami eigenfunctions ofmanifolds are used to determine binary codes. Moreover,principled linear projections like PCA Hashing (PCAH) [9]has been suggested for better quantization rather thanrandom projection hashing. Besides, another popular hashingapproach, Anchor Graphs Hashing (AGH) [10], is proposed tolearn compact binary codes via tractable low-rank adjacencymatrices. AGH allows constant time hashing of a newdata point by extrapolating graph Laplacian eigenvectors toeigenfunctions. More relevant hashing methods can be seenin [11]–[13].

However, single-view hashing is the main topic on whichthe previous exploration of hashing methods focuses. In theirarchitectures, only one type of feature descriptor is usedfor learning hashing functions. In practice, to make amore comprehensive description, objects/images are alwaysrepresented via several different kinds of features and each ofthem has its own characteristics. Thus, it is desirable to incor-porate these heterogenous feature descriptors into learninghashing functions, leading to multi-view hashing approaches.Multiview learning techniques [14]–[17] have been well

1The effectiveness and efficiency of these methods drop exponentially asthe dimensionality increases, which is commonly referred to as the curse ofdimensionality.

1057-7149 © 2015 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.

LIU et al.: MAH FOR EFFICIENT IMAGE SEARCH 957

explored in the past few years and widely applied to visualinformation fusion [18]–[20]. Recently, a number of multiviewhashing methods have been proposed for efficientsimilarity search, such as Multi-View Anchor GraphHashing (MVAGH) [21], Sequential Update for Multi-View Spectral Hashing (SU-MVSH) [22], Multi-ViewHashing (MVH-CS) [23], Composite Hashing with MultipleInformation Sources (CHMIS) [24] and Deep Multi-viewHashing (DMVH) [25]. These methods mainly depend onspectral, graph or deep learning techniques to achieve datastructure preserving encoding. Nevertheless, the hashingpurely with the above schemes are usually sensitive to datanoise and suffering from the high computational complexity.

The above drawbacks of prior work motivate us topropose a novel unsupervised mulitiview hashing approach,termed Multiview Alignment Hashing (MAH), which caneffectively fuse multiple information sources and exploit thediscriminative low-dimensional embedding via NonnegativeMatrix Factorization (NMF) [26]. NMF is a popular method indata mining tasks including clustering, collaborative filtering,outlier detection, etc. Unlike other embedding methodswith positive and negative values, NMF seeks to learn anonnegative parts-based representation that gives better visualinterpretation of factoring matrices for high-dimensional data.Therefore, in many cases, NMF may be more suitable for sub-space learning tasks, because it provides a non-global basis setwhich intuitively contains the localized parts of objects [26].In addition, since the flexibility of matrix factorization canhandle widely varying data distributions, NMF enables morerobust subspace learning. More importantly, NMF decomposesan original matrix into a part-based representation that givesbetter interpretation of factoring matrices for non-negativedata. When applying NMF to multiview fusion tasks,a part-based representation can reduce the corruption betweenany two views and gain more discriminative codes.

To the best of our knowledge, this is the first work usingNMF to combine multiple views for image hashing. It isworthwhile to highlight several contributions of the proposedmethod:

• MAH can find a compact representation uncovering thehidden semantics from different view aspects andsimultaneously respecting the joint probabilitydistribution of data.

• To solve our nonconvex objective function, a newalternate optimization has been proposed to get the finalsolution.

• We utilize multivariable logistic regression to generate thehashing function and achieve the out-of-sample extension.

The rest of this paper is organized as follows. In Section II,we give a brief review of NMF. The details of our method aredescribed in Section III. Section IV reports the experimentalresults. Finally, we conclude this paper in Section V.

II. A BRIEF REVIEW OF NMF

In this section, we mainly review some related algorithms,focusing on Nonnegative Matrix Factorization (NMF) andits variants. NMF is proposed to learn the nonnegative

parts of objects. Given a nonnegative data matrixX = [x1, . . . , xN ] ∈ R

D×N≥0 , each column of X is a

sample data. NMF aims to find two nonnegative matricesU ∈ R

D×d≥0 and V ∈ R

d×N≥0 with full rank whose product can

approximately represent the original matrix X, i.e., X ≈ U V .In practice, we always have d < min(D, N). Thus, weminimize the following objective function

LN M F = ‖X −U V ‖2, s.t. U, V ≥ 0, (1)

where ‖·‖ is Frobenius norm. To optimize the above objectivefunction, an iterative updating procedure was developed in [26]as follows:

Vij ← (U T X)i j

(U T U V )i jVi j , Uij ← (XV T )i j

(U V V T )i jUi j , (2)

and normalization

Uij ← Uij∑

i Ui j. (3)

It has been proved that the above updating procedure can findthe local minimum of LN M F . The matrix V obtained in NMFis always regarded as the low-dimensional representation whilethe matrix U denotes the basis matrix.

Furthermore, there also exists some variants of NMF. LocalNMF (LNMF) [27] imposes a spatial localized constrainton the bases. In [28], sparse NMF was proposed and later,NMF constrained with neighborhood preserving regulariza-tion (NPNMF) [29] was developed. Besides, researchersalso proposed graph regularized NMF (GNMF) [30], whicheffectively preserves the locality structure of data. Beyondthese methods, [31] extends the original NMF with the kerneltrick as kernelized NMF (KNMF), which could extract moreuseful features hidden in the original data through somekernel-induced nonlinear mappings. Moreover, it can deal withdata where only relationships (similarities or dissimilarities)between objects are known. Specifically, in their work, theyaddressed the matrix factorization by K ≈ U V , whereK is the kernel matrix instead of the data matrix X , and(U, V ) are similar with standard NMF. More related to ourwork, a multiple kernels NMF (MKNMF) [32] approach wasproposed, where linear programming is applied to determinethe combination of different kernels.

In this paper, we present a Regularized KernelNonnegative Matrix Factorization (RKNMF) frameworkfor hashing, which can effectively preserve the data intrinsicprobability distribution and simultaneously reduce theredundancy of low-dimensional representations. Rather thanlocality-based graph regularization, we measure the jointprobability of pairwise data by the Gaussian function, whichis defined over all the potential neighbors and has beenproved to effectively resist data noise [33]. This kind ofmeasurement is capable to capture the local structure of thehigh-dimensional data while also revealing global structuresuch as the presence of clusters at several scales. To thebest of our knowledge, this is the first time that NMF withmultiview hashing has been successfully applied to featureembedding for large-scale similarity search.


Fig. 1. Illustration of the working flow of MAH. During the training phase, the proposed optimization method is derived by an alternate optimizationprocedure which optimizes the kernel’s weight α and the factorization matrices (U, V ) alternately. Using the multivariable logistic regression, the algorithmthen outputs the projection matrix P and regression matrix � to generate the hash function, which are directly used in the test phase.

III. MULTIVIEW ALIGNMENT HASHING

In the section, we introduce our new Multiview AlignmentHashing approach, referred as MAH. Our goal is to learn ahash embedding function, which fuses the various alignmentrepresentations from multiple sources while preserving thehigh-dimensional joint distribution and obtaining the orthogo-nal bases simultaneously during the RKNMF. Originally, weneed to find the binary solution which, however, is first relaxedto a real-valued range so that a more suitable solution can begained. After applying the alternate optimization, we convertthe real-valued solutions into binary codes. Fig. 1 shows theoutline of the proposed MAH approach.

A. Objective Function

Given the i -th view training data X (i) = [x(i)1 , . . . , x(i)

N ] ∈R

Di×N , we construct the corresponding N × N kernelmatrix Ki using the Heat Kernel formulation:

Ki(x (i)

p , x (i)q

) = exp

(−||x(i)

p − x(i)q ||2

2τ 2i

)

, ∀p, q,

where τi is the related scalable parameter. Actually, ourapproach can work with any legitimate kernel. Without lossof generality, one of the most popular kernels, Heat Kernel, isused here in the purposed method. Our discussion in furthersection will only focus on the Heat Kernel.

Then multiple kernel matrices from each view data{K1, . . . , Kn} are computed and Ki ∈ R

N×N≥0 , ∀i . We further

define the fusion matrix

K =n∑

i=1

αi Ki , subject ton∑

i=1

αi = 1, αi ≥ 0, ∀i.

To obtain a meaningful low-dimensional matrix factorization,we then set a constraint for the binary representationV = [v1, . . . , vN ] as the similarity probability regularization,which is utilized to preserve the data distribution in theintrinsic objective space. The optimization is expressed as:

arg minV∈{0,1}d×N

1

2

∑

p,q

W (i)pq ‖vp − vq‖2, (4)

where W (i)pq = Pr(x(i)

p , x(i)q ) is the symmetric joint probability

between x(i)p and x(i)

q in the i -th view feature space. In thispaper, we use a Gaussian function2 [33] to measure it:

Pr(x(i)

p , x(i)q

) =⎧⎨

⎩

exp(−‖x(i)

p −x(i)q ‖2/2σ 2

i

)

∑k �=l exp

(−‖x(i)

k −x(i)l ‖2/2σ 2

i

) , if p �= q

0, if p = q(5)

where σi is the Gaussian smooth parameter. ‖x(i)p − x(i)

q ‖ isalways measured by the Euclidean distance. To refer theframework in [30], the similarity probability regularization forthe i-th view can be reduced to

arg minV∈{0,1}d×N

Tr(V Li VT ), (6)

where Li = D(i) − W (i) is the Laplacian matrix,W (i) = (

W (i)pq

) ∈ RN×N is the symmetric similarity matrix

and D(i) is a diagonal matrix with its entries D(i)pp =∑

p W (i)pq .

In addition, to reduce the redundancy of subspaces and obtainthe compact bases in NMF simultaneously, we hope thebasis matrix U of NMF should be as orthogonal as possible

2Our approach can work with any legitimate similarity probability measure,though we focus on Gaussian similarity probability in this paper.


to minimize the redundancy, i.e., U T U − I = 0. Here wealso minimize ‖U T U− I‖2 and make basis U near-orthogonalsince this objective function can be fused into the computationof the L2-norm loss function of NMF. Finally, we combine theabove-mentioned two constraints for U and V and constructour optimization problem as follows:

arg minU,V ,αi

‖n∑

i=1

αi Ki −U V ‖2

+γ

n∑

i=1

αi Tr(V Li VT )+ η‖U T U − I‖2,

s.t. V ∈ {0, 1}d×N , U, V ≥ 0,

n∑

i=1

αi = 1, αi ≥ 0, ∀i,(7)

where γ and η are two positive coefficients balancing thetradeoff between the NMF approximation error and additionalconstraints.

B. Alternate Optimization via Relaxation

In this section, we first relax the discreteness condition tothe continuous condition, i.e., we relax the binary domainV ∈ {0, 1}d×N in Eq. (7) to the real number domainV ∈ R

d×N but keep the NMF requirements (nonnegativeconditions). To the best of our knowledge, there is no directway to obtain its global solution for a linearly constrainednonconvex optimization problem. Thus, in this paper theoptimization procedure is implemented as an iterative proce-dure and divided into two steps by optimizing (U, V ) andα = (α1, . . . , αn) alternately [34]. For each step, one of (U, V )and α is optimized while the other is fixed and at the next step,we switch (U, V ) and α. The iteration procedure stops untilit converges.

1) Optimizing (U, V ): By first fixing α, we substituteK = ∑n

i=1 αi Ki and L = ∑ni=1 αi Li . A Lagrangian

multiplier function is introduced for our problem as:

L1(U, V ,�,) = ‖K −U V ‖2 + γ Tr(V LV T )

+ η‖U T U − I‖2 + Tr(�U T )+Tr(V T ),

(8)

where � and are two matrices in which all their entries arethe Lagrangian multipliers to constrain U, V ≥ 0 respectively.Then we let the partial derivatives of L1 with respectto U and V be zeroes, i.e., ∇U,VL1 = 0, we obtain

∇UL1 = 2(−K V T +U V V T + 2ηUU T U − 2ηU

)+�

= 0, (9)

∇VL1 = 2(−U T K +U T U V + γ V L

)+ = 0. (10)

Using the Karush-Kuhn-Tucker (KKT) conditions [35] sinceour objective function satisfies the constraint qualificationcondition, we have the complementary slackness conditions:

�i j Ui j = 0 and i j Vi j = 0, ∀i, j.

Multiplying Uij and Vij on the corresponding entriesof Eq. (9) and (10), we have the following equations

for Uij and Vij(−K V T + U V V T + 2ηUU T U − 2ηU

)

i jUi j = 0,

(−U T K +U T U V + γ V L

)

i jVi j = 0.

Hence, similar to the standard procedure of NMF [26], we getour updating rules

Uij ←(K V T + 2ηU

)i j

(U V V T + 2ηUU T U

)i j

Ui j , (11)

Vij ←(U T K + γ V W

)i j

(U T U V + γ V D

)i j

Vi j . (12)

where D = ∑ni=1 αi D(i) and W = ∑n

i=1 αi W (i). Thisallocation is to guarantee that all the entries of U and V arenonnegative. U is also needed to be normalized as Eq. (3). Theconvergence of U is based on [36] and the convergence of V isbased on [30]. Similar to [37], they have proven that aftereach update of U or V, the objective function is monotonicallynonincreasing.

2) Optimizing α: Due to the quadratic norm in our objectivefunction, the procedure for optimizing α is different from thegeneral linear programming. For fixed U and V, omitting theirrelevant terms, we define the Lagrangian function

L2(α, λ, β) = ‖n∑

i=1

αi Ki − U V‖2 + γ

n∑

i=1

αi Tr(V Li VT )

+ λ(

n∑

i=1

αi − 1)+ βαT, (13)

where λ and β = (β1, . . . , βn) are the Lagrangian multipliersfor the constraints and α = (α1, . . . , αn). Considering thepartial derivatives of L2 with respect to α, λ and β, we havethe equality ∇α,λL2 = 0 and the inequality ∇βL2 ≥ 0. Thenwe need

∇αL2 =(

2 Tr

(

(

n∑

i=1

αi Ki −U V )K j

)

+ γ Tr(V L j V T )+ λ+ β j

)

1≤ j≤n

= 0, (14)

∇λL2 =n∑

i=1

αi − 1 = 0, (15)

∇βL2 = α ≥ 0. (16)

And we also have complementary slackness conditions

β jα j = 0, j = 1, . . . , n. (17)

Suppose α j = 0 for some j , specifically denoteJ = { j |α j = 0}, then the optimal solution would include somezeroes. In this case, there is no difference with the optimizationprocedure for minimizing ‖∑ j �∈J α j K j−U V ‖2. Without lossof generality, we suppose α j > 0, ∀ j . Then β = 0. FromEq. (14), we have

n∑

i=1

αi Tr(Ki K j ) = Tr(U V K j )− γ Tr(V L j V T )/2− λ/2,

(18)


⎛

⎜⎜⎜⎝

Tr(K 21 )− Tr(K1 Kn) . . . Tr(Kn K1)− Tr(K 2

n )...

. . ....

Tr(K1Kn−1)− Tr(K1 Kn) . . . Tr(Kn Kn−1)− Tr(K 2n )

1 . . . 1

⎞

⎟⎟⎟⎠

⎛

⎜⎝

α1...

αn

⎞

⎟⎠ =

⎛

⎜⎜⎜⎝

T1 − Tn...

Tn−1 − Tn

1

⎞

⎟⎟⎟⎠

(20)

for j = 1, . . . , n. If we transform the above equations to thematrix form and denote Tj = Tr(U V K j )− γ Tr(V L j V T )/2,we obtain

⎛

⎜⎜⎜⎝

Tr(K 21 ) Tr(K2 K1) . . . Tr(Kn K1)

Tr(K1K2) Tr(K 22 ) . . . Tr(Kn K2)

......

. . ....

Tr(K1 Kn) Tr(K2 Kn) . . . Tr(K 2n )

⎞

⎟⎟⎟⎠

⎛

⎜⎜⎜⎝

α1α2...

αn

⎞

⎟⎟⎟⎠

=

⎛

⎜⎜⎜⎝

T1 − λ/2T2 − λ/2

...Tn − λ/2

⎞

⎟⎟⎟⎠

. (19)

Let us denote Eq. (19) by AαT = B . Note that matrix Ais actually the Gram matrix of Ki based on Frobeniusinner product 〈Ki , K j 〉 = Tr(Ki K T

j ) = Tr(Ki K j ). LetM = (vec(K1), . . . , vec(Kn)), where vec(Ki ) is thevectorization of Ki , then A = MT M . And rank(A) =rank(MT M) = rank(M) = n under the assumption that thekernel matrices K1, . . . , Kn from n different views are linearlyindependent. Combined with Eq. (16) and eliminated λ bysubtracting other rows from the first row in Eq. (19), wecan obtain the linear equations demonstrated in Eq. (20), asshown at the top of the page in a matrix form.

We denote Eq. (20) by AαT = B . According to the varietyof different views, 1 = (1, . . . , 1) and all of the rows in A arelinearly independent. Then we have rank( A) = rank(A)− 1+1 = n. Thus, the inverse of A exists and we have

αT = A−1 B.

3) Global Convergence: Let us denote our original objectivefunction in Eq. (7) by L(U, V , α). Then for any m-th step, thealternate iteration procedure will be:

(U (m), V (m))← arg minU,V

L(U, V , α(m−1))

and

α(m) ← arg minα

L(U (m), V (m), α).

Thus we have the following inequality:

L(U (m−1), V (m−1), α(m−1))

≥ L(U (m), V (m), α(m−1)) ≥ L(U (m), V (m), α(m))

≥ L(U (m+1), V (m+1), α(m))

≥ L(U (m+1), V (m+1), α(m+1)) ≥ . . . .

In other words, L(U (m), V (m), α(m)) are monotonic nonin-creasing as m → ∞. Note that we have L(U, V , α) ≥ 0.Then the alternate iteration converges.

Following the above optimization procedure on (U, V )and α, we compute them alternately by fixing another untilthe object function converges. In practice, we terminate theiteration process when the difference |L(U (m), V (m), α(m)) −L(U (m−1), V (m−1), α(m−1))| is less than a small threshold orthe number of iterations reaches a maximum value.

C. Hashing Function Generation

From the above section, we can easily compute the weightvector α = (α1, . . . , αn) and further get the fused kernelmatrix K and the combined joint probability Laplacianmatrix L. Thus, from Eq. (11) and Eq. (12) we can obtain themultiview RKNMF bases U ∈ R

N×d and the low-dimensionalrepresentation V ∈ R

d×N , where d � Di , i = 1, . . . , n.We now convert the above low-dimensional real-valuedrepresentation from V = [v1, . . . , vN ] into binary codesvia thresholding: if the l-th element of vp is larger thanthe specified threshold, the mapped bit v(l)

p = 1; otherwisev(l)

p = 0, where p = 1, . . . , N and l = 1, . . . , d .According to [38], a well-designed semantic hashing should

also be entropy-maximized to ensure efficiency. Moreover,from the information theory [39], the maximal entropy of asource alphabet is attained by having a uniform probabilitydistribution. If the entropy of codes over the bit-string issmall, it means that documents are mapped to only a smallpart of codes (hash bins), thereby rendering the hash tableinefficient. In our method, we set the threshold for eachelement in v(l)

p as the median value of v(l)p , which can meet the

entropy maximizing criterion mentioned above. Therefore, thep-th bit-string will be 1 for half of it and 0 for the other half.The above scheme gives each distinct binary code roughlyequal probability of occurring in the data collection, henceachieves the best utilization of the hash table. The binary codecan be represented as: V = [v1, . . . , vN ], where vp ∈ {0, 1}dand p = 1, . . . , N . A similar scheme has been usedin [40] and [41].

Up to now, this would only tell us how to compute thehash code of items in the training set, while for a newcoming sample we cannot explicitly find the correspondinghashing function. Inspired by [42], we determine our out-of-sample extension using a regression technique with multiplevariables. In this paper, instead of applying linear regressionas mentioned in [42], the binary coding environment forcesus to implement our tasks via the logistic regression [43],which can be treated as a type of probabilistic statisticalclassification model. The corresponding probabilities describethe possibilities of binary responses. In n distributionsYi |Xi ∼ Bernoulli(pi), i = 1, . . . , n, for the functionPr(Yi = 1| Xi = x) := hθ (x) with the parameter θ ,


Fig. 2. The illustration of the embedding via Eq. (23), i.e., the nearest integer function.

the likelihood function isn∏

i=1

Pr(Yi = yi |Xi = xi ) =n∏

i=1

hθ (xi )yi (1− hθ (xi ))

1−yi .

According to the maximum log-likelihood criterion, we defineour logistic regression cost function:

J (�) = − 1

N

( N∑

p=1

( ⟨vp, log(h�(vp))

⟩

+〈(1− vp), log(1− h�(vp))〉)+ ξ ||�||2

)

,

(21)

where

h�(vp) =(

1

1+ e−(�T vp)i

)T

1≤i≤d

is the regression function for each component in vp;log(x) = (log(x1), . . . , log(xn))

T for x = (x1, . . . , xn)T ∈ R

n;〈·, ·〉 represents the inner product; � is the correspondingregression matrix with the size d × d; 1 is denoted as N × 1all ones matrix and ξ is the parameter for regularization termξ ||�||2 in logistic regression, which is to avoid overfitting.

Further aiming to minimize J (�), a standard gradientdescent algorithm has been employed. According to [43], theupdated equation with the learning rate r can be written as:

�t+1 = �t − r

⎛

⎝ 1

N

N∑

p=1

(h�(vp)− vp)vTp +

ξ

N�t

⎞

⎠. (22)

We stop the update iteration when the norm of differencebetween �t+1 and �t , ||�t+1−�t ||2, is less than an empiricalsmall value (i.e., reach the convergence) and then output theregression matrix �.

In this way, given a sample, the related kernel matrices{K new

1 , . . . , K newn } for each view are first computed via Heat

Kernel, where K newi is a N ×1 matrix, ∀i . We then fuse these

kernels with optimized weights α:

K new =n∑

i=1

αi K newi

and obtain the low-dimensional real-value representation by alinear projection matrix:

P = (U T U)−1U T

Algorithm 1 Multiview Alignment Hashing (MAH)

via RKNMF, which is the pseudoinverse of U . Since h� isa Sigmoid function, finally the hash code for the new samplecan be calculated as:

vnew = �h�(P · K new)�, (23)

where function �·� is the nearest integer function for eachentry of h�. In fact, it follows the property of h� ∈ (0, 1)to binarize vnew by a threshold 0.5. If any bit ofh�(P · K new)’s output is larger than 0.5, we assign “1” tothis bit, otherwise “0”. In this way, we can obtain our finalMultiview Alignment Hashing (MAH) codes for any datapoints (see in Fig. 2).

It is noteworthy that our approach can be regarded as anembedding method. In practice, we map all the training andtest data in a same way to ensure that they are in the samesubspace by Eq. (23) with {P, α,�} which are obtained viamultiview RKNMF optimization and logistic regression. Thisprocedure is similar as a handcrafted embedding scheme,without re-training. The corresponding integrated MAHalgorithm is depicted in Algorithm 1.

D. Complexity Analysis

The cost of MAH learning mainly contains two parts.The first part is for the constructions of Heat Kernels andsimilarity probability regularization items for different views,i.e., Ki and Li . From Section III-A, the time complexity ofthis part is O(2(

∑ni=1 Di )N2). The second part is for the


alternate optimization. The time complexity of the matrixfactorization in updating (U, V ) step is O(N2d) accordingto [30]. And, the updating of α has the complexity of O(n2 N2)in MAH. In total, the time complexity of MAH learning isO(2(

∑ni=1 Di )N2 + T × (N2d + n2 N2)), where T is the

number of iterations for alternate optimization. Empirically,T is always less than 10, i.e., MAH converges within10 rounds.

IV. EXPERIMENTS AND RESULTS

In this section, the MAH algorithm is evaluated for the highdimensional nearest neighbor search problem. Three differentdatasets are used in our experiments, i.e., Caltech-256 [44],CIFAR-10 [45] and CIFAR-20 [45]. Caltech-256 consistsof 30607 images associated with 256 object categories.CIFAR-10 and CIFAR-20 are both 60, 000-image subsetscollected from the 80-million tiny images dataset [46] with10 class labels and 20 class labels, respectively. Followingthe experimental setting in [21] and [22], for each dataset, werandomly select 1000 images as the query set and the rest ofdataset is used as the training set. Given an image, we wouldlike to describe it with multiview features extracted fromit. The descriptors are expected to capture the orientation,intensity, texture and color information, which are the maincues of an image. Therefore, 512-dim Gist3 [47], 1152-dimhistogram of oriented gradients (HOG)4 [48], 256-dim localbinary pattern (LBP)5 [49] and 192-dim color histogram(ColorHist)6 are respectively employed for imagerepresentation.

In the test phase, a returned point is regarded as a trueneighbor if it lies in the top 100, 500 and 500 points closestto a query for Caltech-256, CIFAR-10 and CIFAR-20,respectively. For each query, all the data points in the databaseare ranked according to their Hamming distances to the query,since it is fast enough with short hash codes in practice.We then evaluate the retrieval results by the Mean AveragePrecision (MAP) and the precision-recall curve. Additionally,we also report the training time and the test time (the averagesearching time used for each query) for all the methods.All experiments are performed using Matlab 2013a on aserver configured with a 12-core processor and 128G of RAMrunning the Linux OS.

A. Compared Methods and Settings

We compare our method against six popularunsupervised multiview hashing algorithms, i.e., Multi-ViewAnchor Graph Hashing (MVAGH) [21], Sequential Updatefor Multi-View Spectral Hashing (SU-MVSH) [22],Multi-View Hashing (MVH-CS) [23], Composite Hashingwith Multiple Information Sources (CHMIS) [24],

3Gabor filters are applied on images with 8 different orientations and4 scales. Each filtered image is then averaged over 4 × 4 grid leading toa 512-dimensional vector (8× 4× 16 = 512).

436 4× 4 non-overlapping windows yield a 1152-dimensional vector.5The LBP labels the pixels of an image by thresholding a 3× 3 neighbor-

hood, and responses are mapped to a 256-dimensional vector.6For each R,G,B channel, a 64-bin histogram is computed and the total

length is 3× 64 = 192.

Deep Multi-view Hashing (DMVH) [25] with a 4-layersdeep-net and a derived version of MVH-CS, termedMAV-CCA, which is a special case of MAV-CS when theaveraged similarity matrix is fixed as the identity matrix [23].Besides, we also compare our method with two state-of-the-artsingle-view hashing methods, i.e., Spectral Hashing (SpH) [8]and Anchor Graphs Hashing (AGH) [10]. For single-viewhashing methods, data from multiviews are concatenated into asingle representation for hashing learning. It is noteworthy thatwe have normalized all the features into a same scale beforeconcatenating the multiple features into a single representationfor single-view hashing algorithms. All of the above methodsare then evaluated on six different lengths of codes(16, 32, 48, 64, 80, 96). Following the same experimentalsetting, all the parameters used in the compared methodshave been strictly chosen according to their original papers.

For our MAH, we apply Heat Kernel:

Ki (x (i)p , x (i)

q ) = exp

(−||x(i)

p − x(i)q ||2

2τ 2i

)

, ∀p, q,

to construct the original kernel matrices, where τi is set asthe median of pairwise distances of data points. The selectionof the smooth parameter σi used in Eq. (5) also followswith the same scheme of τi . The optimal learning rate r foreach dataset is selected from one of {0.01, 0.02, . . . , 0.10}with the step of 0.01 which yields the best performanceby 10-fold cross-validation on the training data. The choiceof three regularization parameters {γ, η, ξ} is also done viacross-validation on the training set and we finally fix γ = 0.15,η = 0.325 and ξ = 0.05 for all three datasets. To further speedup the convergence of the proposed alternate optimizationprocedure, in our experiments, we apply a small trick withthe following steps:

1) For the first time to calculate U and V in the stepof optimizing (U, V ) in Section III-B, we utilizeEq. (11) and Eq. (12) with random initialization,following the original NMF procedure [26]. We thenstore the obtained U and V .

2) From the second time, we optimize (U, V ) by usingthe stored U and V from the last time to initialize theNMF algorithm, instead of using random values.

This small improvement can effectively reduce the time ofconvergence in the training phase and make the proposedmethod more efficient for large-scale tasks. Fig. 3 showsthe retrieval performance of MAH when four descriptors(i.e., Gist, HOG, LBP, ColorHist) are used together via differ-ent kernel combinations, or when single descriptors are used.Specifically, MAH denotes that the kernels are combined bythe proposed method:

K =n∑

i=1

αi Ki ,

where the weight αi is obtained via alternate optimization.MAH (Average) means that the kernels are combined byarithmetic mean

K = 1

n

n∑

i=1

Ki ,


Fig. 3. Performance (MAP) of MAH when four descriptors are used together via different kernel combinations in comparison to that of single descriptors.

Fig. 4. Performance (MAP) comparison with different numbers of bits.

Fig. 5. The precision-recall curves of all compared algorithms on the three datasets with the code length of 96 bits.

and MAH (Product) indicates the combination of kernelsthrough geometric mean

K = (

n∏

i=1

Ki )1n .

The corresponding combinations of similarity probabilityregularization items Li also follow the above similar schemes.The results on three datasets demonstrate that integrating

multiple features achieves better performance than using singlefeatures and the proposed weighted combination improves theperformance compared with average and product schemes.Fig. 4 illustrates the MAP curves of all compared algorithmson three datasets. In its entirety, firstly the retrieval accu-racies on the CIFAR-10 dataset are obviously higher thanthat on the more complicated CIFAR-20 and Caltech-256datesets. Secondly, the multiview methods always achievebetter results than single-view schemes. Particularly, in most


Fig. 6. Some retrieval results on the Caltech-256 dataset. The top-left image in each block is the query and the rest are the 8-nearest images. The incorrectresults are marked by red boxes.

Fig. 7. Comparison of some retrieval results via compared hashing methods on the Caltech-256 dataset. The top-left image in each block is the query andthe rest are the 8-nearest images. The incorrect results are marked by red boxes.

cases, MVAGH achieves higher performance than SU-MVSH,MVH-CS, MVH-CCA, DMVH-4 layers and CHMIS. Theresults of MVH-CS always climb up then go down when thelength of codes increases. The same tendency also appearswith CHMIS. In general, our MAH outperforms all other com-pared methods (also see Table I). In additional, we have alsocompared our proposed method with other various baselines(with/without the probability regularization or the orthogonalconstraint) in Table II). It is obviously observed that theproposed method has significantly improved the effectivenessof NMF and its variants in terms of accuracies.

Furthermore, we present the precision-recall curves of allthe algorithms on three datasets with the code length of 96 bitsin Fig. 5. From this figure, we can further discover that MAHconsistently achieves the better results again by comparingthe Area Under the Curve (AUC). Some examples of retrievalresults on Caltech-256 are also shown in Fig. 6 and Fig. 7.

The proposed algorithm aims to generate a non-linearbinary code, which can better fit the intrinsic data structurethan linear ones. The only existing non-linear multiviewhashing algorithm is MVAGH. However, the proposedalgorithm can successfully find the optimal aggregation of


TABLE I

MEAN AVERAGE PRECISION (MAP) OF 32 BITS WITH TRAINING AND TEST TIME ON THREE DATASETS

TABLE II

COMPARISON OF DIFFERENT VARIANTS OF MAH TO PROVE

THE EFFECTIVENESS OF THE IMPROVEMENT

the kernel matrix, whereas MVAGH cannot. Therefore, it isexpected that the proposed algorithm performs better thanMVAGH and the existing linear multiview hashing algorithms.

Finally, we list the training time and the test time fordifferent algorithms on three datasets in Table I. Consideringthe training time, since DMVH involves learning of a 4-layerdeep-net, it spends the most time to train hashing functions.MVH-CCA spends the second longest time for training. OurMAH only costs slightly more time than MVAGH and CHMISbut significantly less than other methods for training. Forthe test phase, MVH-CS and MVH-CCA are the most effi-cient methods, while MAH has competitive searching time asSU-MVSH. DMVH needs the most time for testing as well.

V. CONCLUSION

In this paper, we have presented a novel unsupervisedhashing method called Multiview Alignment Hashing (MAH),where hashing functions are effectively learnt via kernelizedNonnegative Matrix Factorization with preserving data jointprobability distribution. We incorporate multiple visual fea-tures from different views together and an alternate way isintroduced to optimize the weights for different views andsimultaneously produce the low-dimensional representation.We address this as a nonconvex optimization problem and itsalternate procedure will finally converge at the locally opti-mal solution. For the out-of-sample extension, multivariablelogistic regression has been successfully applied to obtain theregression matrix for fast hash encoding. Numerical exper-iments have been systematically evaluated on Caltech-256,

CIFAR-10 and CIFAR-20 datasets. The results manifest thatour MAH significantly outperforms the state-of-the-art multi-view hashing techniques in terms of searching accuracies.

REFERENCES

[1] L. Shao, L. Liu, and X. Li, “Feature learning for image classification viamultiobjective genetic programming,” IEEE Trans. Neural Netw. Learn.Syst., vol. 25, no. 7, pp. 1359–1371, Jul. 2014.

[2] F. Zhu and L. Shao, “Weakly-supervised cross-domain dictionary learn-ing for visual recognition,” Int. J. Comput. Vis., vol. 109, nos. 1–2,pp. 42–59, Aug. 2014.

[3] J. Han et al., “Representing and retrieving video shots in human-centricbrain imaging space,” IEEE Trans. Image Process., vol. 22, no. 7,pp. 2723–2736, Jul. 2013.

[4] J. Han, K. N. Ngan, M. Li, and H.-J. Zhang, “A memory learningframework for effective image retrieval,” IEEE Trans. Image Process.,vol. 14, no. 4, pp. 511–524, Apr. 2005.

[5] J. Han, S. He, X. Qian, D. Wang, L. Guo, and T. Liu, “An object-oriented visual saliency detection framework based on sparse codingrepresentations,” IEEE Trans. Circuits Syst. Video Technol., vol. 23,no. 12, pp. 2009–2021, Dec. 2013.

[6] V. Gaede and O. Günther, “Multidimensional access methods,” ACMComput. Surv., vol. 30, no. 2, pp. 170–231, Jun. 1998.

[7] A. Gionis, P. Indyk, and R. Motwani, “Similarity search in highdimensions via hashing,” in Proc. Int. Conf. Very Large Data Bases,vol. 99. 1999, pp. 518–529.

[8] Y. Weiss, A. Torralba, and R. Fergus, “Spectral hashing,” in Proc. NeuralInf. Process. Syst., 2008, pp. 1753–1760.

[9] J. Wang, S. Kumar, and S.-F. Chang, “Semi-supervised hashing for large-scale search,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 34, no. 12,pp. 2393–2406, Dec. 2012.

[10] W. Liu, J. Wang, S. Kumar, and S.-F. Chang, “Hashing with graphs,” inProc. Int. Conf. Mach. Learn., 2011, pp. 1–8.

[11] K. Grauman and R. Fergus, “Learning binary hash codes for large-scale image search,” in Machine Learning for Computer Vision. Berlin,Germany: Springer-Verlag, 2013, pp. 49–87.

[12] J. Cheng, C. Leng, P. Li, M. Wang, and H. Lu, “Semi-supervisedmulti-graph hashing for scalable similarity search,” Comput. Vis. ImageUnderstand., vol. 124, pp. 12–21, Jul. 2014.

[13] P. Li, M. Wang, J. Cheng, C. Xu, and H. Lu, “Spectral hashingwith semantically consistent graph for image indexing,” IEEE Trans.Multimedia, vol. 15, no. 1, pp. 141–152, Jan. 2013.

[14] C. Xu, D. Tao, and C. Xu, “Large-margin multi-view informationbottleneck,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 36, no. 8,pp. 1559–1572, Aug. 2014.

[15] T. Xia, D. Tao, T. Mei, and Y. Zhang, “Multiview spectral embed-ding,” IEEE Trans. Syst., Man, Cybern. B, Cybern., vol. 40, no. 6,pp. 1438–1446, Dec. 2010.

[16] C. Xu, D. Tao, and C. Xu. (2013). “A survey on multi-view learning.”[Online]. Available: http://arxiv.org/abs/1304.5634

[17] B. Xie, Y. Mu, D. Tao, and K. Huang, “m-SNE: Multiview stochasticneighbor embedding,” IEEE Trans. Syst., Man, Cybern. B, Cybern.,vol. 41, no. 4, pp. 1088–1096, Aug. 2011.


[18] M. Wang, X.-S. Hua, R. Hong, J. Tang, G.-J. Qi, and Y. Song, “Unifiedvideo annotation via multigraph learning,” IEEE Trans. Circuits Syst.Video Technol., vol. 19, no. 5, pp. 733–746, May 2009.

[19] M. Wang, H. Li, D. Tao, K. Lu, and X. Wu, “Multimodal graph-basedreranking for web image search,” IEEE Trans. Image Process., vol. 21,no. 11, pp. 4649–4661, Nov. 2012.

[20] J. Yu, M. Wang, and D. Tao, “Semisupervised multiview distance metriclearning for cartoon synthesis,” IEEE Trans. Image Process., vol. 21,no. 11, pp. 4636–4648, Nov. 2012.

[21] S. Kim and S. Choi, “Multi-view anchor graph hashing,” in Proc. IEEEInt. Conf. Acoust., Speech, Signal Process., May 2013, pp. 3123–3127.

[22] S. Kim, Y. Kang, and S. Choi, “Sequential spectral learning to hash withmultiple representations,” in Proc. 12th Eur. Conf. Comput. Vis., 2012,pp. 538–551.

[23] S. Kumar and R. Udupa, “Learning hash functions for cross-viewsimilarity search,” in Proc. Int. Joint Conf. Artif. Intell., 2011, p. 1360.

[24] D. Zhang, F. Wang, and L. Si, “Composite hashing with multipleinformation sources,” in Proc. ACM SIGIR Conf. Res. Develop. Inf. Retr.,2011, pp. 225–234.

[25] Y. Kang, S. Kim, and S. Choi, “Deep learning to hash with multiplerepresentations,” in Proc. IEEE 12th Int. Conf. Data Mining, Dec. 2012,pp. 930–935.

[26] D. D. Lee and H. S. Seung, “Learning the parts of objects by non-negative matrix factorization,” Nature, vol. 401, no. 6755, pp. 788–791,Oct. 1999.

[27] S. Z. Li, X. Hou, H. Zhang, and Q. Cheng, “Learning spatially localized,parts-based representation,” in Proc. IEEE Conf. Comput. Vis. PatternRecognit., Dec. 2001, pp. I-207–I-212.

[28] P. O. Hoyer, “Non-negative matrix factorization with sparseness con-straints,” J. Mach. Learn. Res., vol. 5, pp. 1457–1469, Dec. 2004.

[29] Q. Gu and J. Zhou, “Neighborhood preserving nonnegative matrixfactorization,” in Proc. Brit. Mach. Vis. Conf., 2009, pp. 1–10.

[30] D. Cai, X. He, J. Han, and T. S. Huang, “Graph regularized nonnegativematrix factorization for data representation,” IEEE Trans. Pattern Anal.Mach. Intell., vol. 33, no. 8, pp. 1548–1560, Aug. 2011.

[31] D. Zhang, Z.-H. Zhou, and S. Chen, “Non-negative matrix factorizationon kernels,” in Proc. 9th Pacific Rim Int. Conf. Artif. Intell., 2006,pp. 404–412.

[32] S. An, J.-M. Yun, and S. Choi, “Multiple kernel nonnegative matrixfactorization,” in Proc. IEEE Int. Conf. Acoust., Speech, Signal Process.,May 2011, pp. 1976–1979.

[33] L. van der Maaten and G. Hinton, “Visualizing data using t-SNE,”J. Mach. Learn. Res., vol. 9, no. 11, pp. 2579–2605, Nov. 2008.

[34] J. C. Bezdek and R. J. Hathaway, “Some notes on alternating optimiza-tion,” in Proc. Adv. Soft Comput. (AFSS), 2002, pp. 288–300.

[35] S. Boyd and L. Vandenberghe, Convex Optimization. Cambridge, U.K.:Cambridge Univ. Press, 2004.

[36] W. Zheng, Y. Qian, and H. Tang, “Dimensionality reduction withcategory information fusion and non-negative matrix factorization fortext categorization,” in Proc. 3rd Int. Conf. Artif. Intell. Comput. Intell.,2011, pp. 505–512.

[37] D. D. Lee and H. S. Seung, “Algorithms for non-negative matrixfactorization,” in Proc. Neural Inf. Process. Syst., 2000, pp. 556–562.

[38] S. Baluja and M. Covell, “Learning to hash: Forgiving hash functionsand applications,” Data Mining Knowl. Discovery, vol. 17, no. 3,pp. 402–430, Dec. 2008.

[39] C. E. Shannon, “A mathematical theory of communication,” ACMSIGMOBILE Mobile Comput. Commun. Rev., vol. 5, no. 1, pp. 3–55,Jan. 2001.

[40] D. Zhang, J. Wang, D. Cai, and J. Lu, “Self-taught hashing for fastsimilarity search,” in Proc. ACM SIGIR Conf. Res. Develop. Inf. Retr.,2010, pp. 18–25.

[41] Y. Lin, R. Jin, D. Cai, S. Yan, and X. Li, “Compressed hashing,” in Proc.IEEE Conf. Comput. Vis. Pattern Recognit., Jun. 2013, pp. 446–451.

[42] D. Cai, X. He, and J. Han, “Spectral regression for efficient regular-ized subspace learning,” in Proc. IEEE 11th Int. Conf. Comput. Vis.,Oct. 2007, pp. 1–8.

[43] D. W. Hosmer, Jr., and S. Lemeshow, Applied Logistic Regression.New York, NY, USA: Wiley, 2004.

[44] G. Griffin, A. Holub, and P. Perona, “Caltech-256 object cat-egory dataset,” California Inst. Technol., Pasadena, CA, USA,Tech. Rep. CaltechAUTHORS:CNS-TR-2007-001, 2007.

[45] A. Krizhevsky and G. Hinton, “Learning multiple layers of features fromtiny images,” Dept. Comput. Sci., Univ. Toronto, Toronto, ON, Canada,Tech. Rep. TR-2009, 2009.

[46] A. Torralba, R. Fergus, and W. T. Freeman, “80 million tiny images:A large data set for nonparametric object and scene recognition,” IEEETrans. Pattern Anal. Mach. Intell., vol. 30, no. 11, pp. 1958–1970,Nov. 2008.

[47] A. Oliva and A. Torralba, “Modeling the shape of the scene: A holisticrepresentation of the spatial envelope,” Int. J. Comput. Vis., vol. 42,no. 3, pp. 145–175, May 2001.

[48] N. Dalal and B. Triggs, “Histograms of oriented gradients for humandetection,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit.,Jun. 2005, pp. 886–893.

[49] T. Ahonen, A. Hadid, and M. Pietikäinen, “Face recognition withlocal binary patterns,” in Proc. 8th Eur. Conf. Comput. Vis., 2004,pp. 469–481.

Li Liu received the B.Eng. degree in electronicinformation engineering from Xi’an Jiaotong Uni-versity, Xi’an, China, in 2011, and the Ph.D. degreefrom the Department of Electronic and ElectricalEngineering, University of Sheffield, Sheffield, U.K.,in 2014. He is currently a Research Fellow with theDepartment of Computer Science and Digital Tech-nologies, Northumbria University, Newcastle uponTyne, U.K. His research interests include computervision, machine learning, and data mining.

Mengyang Yu (S’14) received the B.S. andM.S. degrees from the School of MathematicalSciences, Peking University, Beijing, China, in 2010and 2013, respectively. He is currently pursuingthe Ph.D. degree with the Department of ComputerScience and Digital Technologies, Northumbria Uni-versity, Newcastle upon Tyne, U.K. His researchinterests include computer vision, machine learning,and data mining.

Ling Shao (M’09–SM’10)is currently a Professorwith the Department of Computer Science andDigital Technologies, Northumbria University,Newcastle upon Tyne, U.K. He was a Senior Lec-turer with the Department of Electronic and Electri-cal Engineering, University of Sheffield, Sheffield,U.K., from 2009 to 2014, and a Senior Scientistwith Philips Research, Eindhoven, The Netherlands,from 2005 to 2009. His research interests includecomputer vision, image/video processing, andmachine learning. He is an Associate Editor of the

IEEE TRANSACTIONS ON IMAGE PROCESSING, the IEEE TRANSACTIONS

ON CYBERNETICS, and several other journals. He is a fellow of the BritishComputer Society and the Institution of Engineering and Technology.

956 IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 24, NO. 3, MARCH 2015 Multiview...

Documents

Transcript of 956 IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 24, NO. 3, MARCH 2015 Multiview...