Lost in Binarization: Query-Adaptive Ranking for Similar Image...

Lost in Binarization: Query-Adaptive Ranking for SimilarImage Search with Compact Codes

Yu-Gang Jiang§, Jun Wang†, Shih-Fu Chang§

§Columbia University, New York, NY 10027, USA†IBM T.J. Watson Research Center, Yorktown Heights, NY 10598, USA

{yjiang, sfchang}@ee.columbia.edu [email protected]

ABSTRACTWith the proliferation of images on the Web, fast search ofvisually similar images has attracted significant attention.State-of-the-art techniques often embed high-dimensional vi-sual features into low-dimensional Hamming space, wheresearch can be performed in real-time based on Hammingdistance of compact binary codes. Unlike traditional met-rics (e.g., Euclidean) of raw image features that producecontinuous distance, the Hamming distances are discrete in-teger values. In practice, there are often a large number ofimages sharing equal Hamming distances to a query, result-ing in a critical issue for image search where ranking is veryimportant. In this paper, we propose a novel approach thatfacilitates query-adaptive ranking for the images with equalHamming distance. We achieve this goal by firstly offlinelearning bit weights of the binary codes for a diverse set ofpredefined semantic concept classes. The weight learningprocess is formulated as a quadratic programming problemthat minimizes intra-class distance while preserving inter-class relationship in the original raw image feature space.Query-adaptive weights are then rapidly computed by eval-uating the proximity between a query and the concept cat-egories. With the adaptive bit weights, the returned im-ages can be ordered by weighted Hamming distance at afiner-grained binary code level rather than at the originalinteger Hamming distance level. Experimental results ona Flickr image dataset show clear improvements from ourquery-adaptive ranking approach.

Categories and Subject DescriptorsH.3.3 [Information Storage and Retrieval]: InformationSearch and Retrieval—Search process

General TermsAlgorithm, Experimentation, Performance.

KeywordsQuery-adaptive ranking, weighted Hamming distance, bi-nary code, image search.

Permission to make digital or hard copies of all or part of this work forpersonal or classroom use is granted without fee provided that copies arenot made or distributed for profit or commercial advantage and that copiesbear this notice and the full citation on the first page. To copy otherwise, torepublish, to post on servers or to redistribute to lists, requires prior specificpermission and/or a fee.ICMR ’11, April 17-20, Trento, ItalyCopyright c⃝2011 ACM 978-1-4503-0336-1/11/04 ...$10.00.

1. INTRODUCTIONThe capability of searching similar images in large databases

has great potential in many real-world applications. Whiletraditional image search engines heavily rely on textual wordsassociated to the images, scalable content-based search tech-niques are receiving increasing attention and have recentlyappeared in some search engines such as Google and Bing.Apart from providing better image search experience for or-dinary users on the Web, scalable similar image search hasalso been shown to be helpful for solving traditionally veryhard problems in computer vision (e.g., image segmentationand categorization) [28].

Typically a scalable similar image search system shouldhave two major components: i) effective image feature rep-resentation, and ii) efficient data structure. It is well knownthat the quality of image search results largely relies on therepresentation power of image features. The latter, an effi-cient data structure, is necessary since existing image fea-tures are mostly of high dimensions, on top of which ex-haustively comparing a query with many database samplesis computationally very slow.

In this work, we represent images using the popular bag-of-visual-words (BoW) framework [25], in which local invari-ant image descriptors (e.g., SIFT [18]) are extracted andquantized into a set of visual words. The BoW featuresare then embedded into compact binary codes for efficientsearch. For this, we consider state-of-the-art binary embed-ding techniques including hashing [30] and deep learning [8].Binary embedding is preferable over the tree-based index-ing structures (e.g., the kd-tree [1]) as it generally requiresgreatly reduced memory and also works better in high di-mensions. With the binary codes, image similarity can bemeasured in Hamming space by Hamming distance – an in-teger value obtained by counting the number of bits withdifferent values. For large scale applications, the dimensionof the Hamming space is usually set as a small number (e.g.,less than a hundred) limited by the memory budget [29, 34].

Despite the success of using Hamming distance of binarycodes in scalable similar image search, it is important torealize that it lacks in providing good ranking that is cru-cial for image search – there can be Ci

d different binary codessharing equal distance i (i > 0) to a query in a d-dimensionalHamming space. For example, there are 1,128 different bi-nary codes with Hamming distance 2 to a query using 48-dcompact binary codes. As a result, hundreds or even thou-sands of images may share an equal Hamming distance inpractice, but are very unlikely to be equivalently relevant tothe query.

Hamming distance = 0 Hamming distance = 1

Query …

…

…

…

Baseline

Ours

Figure 1: Search results of a sunset scene query from a Flickr image dataset, using 48-bit binary codes generated

by deep learning. The top and bottom rows respectively show the most similar images based on query-independent

Hamming distance and the proposed query-adaptive Hamming distance. It can be clearly seen that our method returns

more relevant images. Note that our method does not permute images with exactly the same code to the query (three

in total for this query), i.e., Hamming distance = 0; see texts for more explanations. This figure is best viewed in color.

This paper proposes a novel approach to compute query-adaptive weights for each bit of the binary codes, which hastwo main advantages. First, with the bit-level weights, weare able to rank the returned images at a finer-grained bi-nary code level1, rather than at the traditional Hammingdistance level. In other words, we can push the resolution ofranking from d (Hamming distance level) up to 2d (binarycode level). Second, contrary to using a single set of weightsfor all the queries, our approach tailors a different and moresuitable set of weights for each query.To compute the query-adaptive weights in real-time, we

harness a set of semantic concept classes that cover most se-mantic elements of image content (e.g., scenes and objects).Bit-level weights for each of the classes are learned offlineusing a novel formulation that not only maximizes intra-class sample similarity but also preserves inter-class rela-tionships. We show that the optimal weights can be com-puted by iteratively solving quadratic programming prob-lems. These precomputed class-specific weights are then uti-lized for online computation of the query-adaptive weights,through rapidly evaluating the proximity of a query im-age to the samples from the semantic classes. With thequery-adaptive weights, weighted Hamming distance is fi-nally applied to evaluate similarities between the query andimages in a target database. We name our proposed ap-proach as query-adaptive Hamming distance, as opposed tothe query-independent Hamming distance widely used in ex-isting works. Notice that during online search it is unneces-sary to compute the weighted Hamming distance based onreal-valued vectors (weights imposed on the binary codes),which would bury one of the most important advantages ofbinary embedding. Instead the weights can be utilized asindicators to efficiently order the returned images at binarycode level.Figure 1 displays search results using a query from a dataset

of Flickr images. As can be seen, our approach obtainsclearly better result (bottom row) by reordering images withHamming distance larger than 0. It is worth noting that inmost cases there is typically a very small number, if notzero, of images having Hamming distance 0 to the queries(see Figure 3), as there is only one binary code satisfyingthis condition (in contrast to Ci

d, i > 0).Our main contributions are summarized as follows:

1. We propose a complete framework to learn adaptive

1Some existing approaches rerank top results at the finest-grainedimage level based on visual similarity of original raw features,which are not scalable since the raw features cannot fit in memoryand disk-read operations for loading the features are slow.

Hamming distance for each query in real-time, in or-der to provide better ranking for similar image searchwith compact binary codes. This is achieved by nov-elly exploring a set of predefined semantic classes. Tothe best of our knowledge, it is the first work on query-adaptive ranking for image search in Hamming space.

2. To learn the suitable weights for each of the semanticclasses, we propose a formulation that not only mini-mizes intra-class sample similarity but also maintainsinter-class proximity, which can be efficiently solved byquadratic programming.

The remaining sections are organized as follows. We re-view related works in Section 2 and give an overview of ourapproach in Section 3. Section 4 introduces two binary em-bedding methods, and Section 5 elaborates our formulationfor learning the query-adaptive Hamming distance. We con-duct experimental validations in Section 6 and conclude inSection 7.

2. RELATED WORKSearching visually similar images has been a longstanding

research issue, dating back at least to the early 1990s [26].See good surveys by Smeulders et al. [26] and Datta et al. [4]for related works in the past decade. Many works adoptedsimple features such as color and texture [27, 22], while moreeffective features such as GIST [21] and SIFT [18] have beenpopular in recent years [25, 20]. In this paper, we choose thepopular bag-of-visual-words (BoW) framework grounded onthe local SIFT features, whose effectiveness has been vali-dated in numerous applications. Since our work in this pa-per is more related to fast search, in the rest of this sectionwe mainly review existing works on efficient data structures,which are roughly divided into three categories: 1) invertedindex, 2) tree-based index, and 3) binary embedding.

Indexing data with inverted file was initially proposed andis still very popular for fast textual document retrieval inthe IR community [37]. It has been introduced to the fieldof image retrieval as recent image feature representationssuch as BoW are very analogous to the bag-of-words rep-resentation of textual documents. In this structure, a listof references to each document (image) for each text (vi-sual) word are created so that relevant documents (images)can be quickly located given a query with several words.A key difference of document retrieval from visual search,however, is that the textual queries usually contain very fewwords. For instance, on average there are 4 words per queryin Google web search. While in the BoW representation,a single image may contain hundreds of words, resulting

in a large number of candidate images (from the invertedlists) that need further verification – usually based on sim-ilarities of the original BoW features. This largely limitsthe application of inverted files for large scale image search.While increasing visual vocabulary size in BoW can reducethe number of candidates, it will also significantly increasememory usage [12]. For example, indexing 1 million BoWfeatures of 10,000 dimensions will need 1GB memory witha compressed version of the inverted file. In contrast, forthe binary embedding as will be discussed later, the mem-ory consumption is much lower (48MB for 1 million 48-bitbinary codes).Tree-based indexing techniques [1, 17, 19, 20] have been

frequently applied to fast visual search. Nister and Stewe-nius [20] used a visual vocabulary tree to perform real-timeobject retrieval in 40,000 images. Muja and Lowe [19] usedmultiple randomized kd-trees for SIFT feature matching.One drawback of the tree-based methods is that they arenot suitable for high-dimensional feature. For example, letthe dimensionality be d and the number of samples be n,one general rule is n ≫ 2d in order to have kd-tree workingmore efficiently than exhaustive search [10].In view of the limitations of both the inverted file and

the tree-based methods, embedding high-dimensional imagefeatures into compact binary codes has attracted much moreattention. Binary embedding satisfies both query-speed (viahash table or efficient bitwise operation) and memory re-quirements. Generally there are two ways for computing thebinary code: hashing and deep learning. The former uses agroup of projections to hash an input space into multiplebuckets such that similar images are likely to be mappedinto the same bucket. Most of the existing hash techniquesare unsupervised. Among them, one of the most well-knownhashing methods is Locality Sensitive Hashing (LSH) [11].Recently, Kulis and Grauman [16] extended LSH from in-put feature space to arbitrary kernel space, and Chum etal. [3] proposed min-Hashing to extend LSH for sets of fea-tures. Since these LSH-based methods use random projec-tions, when the dimension of the input space is high, manymore bits (random projections) are needed for satisfactoryperformance. In light of this, Weiss et al. [34] proposed aspectral hashing (SH) method that hashes the input spacebased on data distribution, ensuring that projections areorthogonal and sample number is balanced across differentbuckets. To further utilize image label information, sev-eral (semi-)supervised methods have been proposed to learngood hash functions [15, 30].Binary embedding by learning deep belief networks was

proposed by Hinton and Salakhutdinov [8]. While it was oc-casionally also viewed as a type of hashing methods [23], wediscuss it separately since its mechanism for generating thebinary codes is very different from the traditional hashingtechniques. Similar to the supervised hashing methods [15,30], the deep network also requires image labels in order tolearn a good mapping. In the network, multiple RestrictedBoltzmann Machines (RBMs) are stacked and trained togradually map image features at the bottom layer to bi-nary codes at the top (deepest) layer. Several recent workshave successfully applied deep learning for scalable imagesearch [29, 9].All these binary embedding methods, either unsupervised

or supervised, have one limitation when applied to imagesearch. As discussed in the introduction, the Hamming dis-

tance of binary codes cannot provide good ranking, which isvery important in practice. This paper introduces a meansto learn query-adaptive weights for each bit of the binarycodes, so that images can be efficiently ranked at a finerresolution based on weighted Hamming distance. Our workin this paper is not on the proposal of new data structuresor embedding techniques. Rather, our objective is to allevi-ate one weakness that all binary embedding methods shareparticularly in the context of image search.

There have been a few works using weighted Hammingdistance for image retrieval, including parameter-sensitivehashing [24], Hamming distance weighting [13], and the An-noSearch [32]. Each bit of the binary code is assigned with aweight in [24, 32], while in [13], the aim is to weigh the over-all Hamming distance of local features for image matching.These methods are fundamentally different from this work.They all used a single set of weights to measure either theimportance of each bit in Hamming space [24, 32], or torescale the final Hamming distance for better matching ofsets of features [13], while ours is designed to learn differentweights efficiently and adaptively for each query.

3. SYSTEM ARCHITECTUREFigure 2 gives an overview of our search system, where all

the image features in both target database and the semanticclasses are embedded into binary codes. The flowchart onthe right illustrates the process of online search. In this fig-ure, we assume that the class-specific bit-level weights of thebinary codes (left; under the names of the semantic classes)were already computed offline. This is done by an algo-rithm that lies in the very heart of our approach, which willbe discussed later in Section 5. During online search, binarycode of the query image will be used to search against la-beled images in the predefined semantic classes – we poola large set of images that are close to the query in theHamming space to predict its potential semantic content,and then compute query-adaptive weights using the precom-puted weights of the potentially relevant classes. Finally,images in the database are rapidly ranked using weighted(query-adaptive) Hamming distance.

In the next we first briefly introduce two popular binaryembedding methods. We then elaborate our algorithm forcomputing the bit-level weights for each of the predefinedsemantic classes. After that, we introduce a method forpredicting the query adaptive weights.

4. BINARY EMBEDDINGWe consider two state-of-the-art binary embedding tech-

niques: semi-supervised hashing and deep learning.

4.1 Semi-Supervised HashingRecently, Semi-Supervised Hashing (SSH) was proposed

for binary embedding. SSH leverages semantic similaritiesamong labeled data while remains robust to overfitting [30].The objective function of SSH consists of two main compo-nents, supervised empirical fitness and unsupervised infor-mation theoretic regularization. More specifically, SSH triesto minimize the empirical error on a small amount of labeleddata while an unsupervised term provides effective regular-ization by maximizing desirable properties like variance andindependence of individual bits. In the setting of SSH, oneis given a set of n points, P = {pi}, i = 1 . . . n, pi ∈ RD, inwhich a small fraction of pairs are associated with two cat-egories of label information, M and C. Specifically, a pair

11/11/2010

1

Query Query-independent Hamming distance

Rank list by query-adaptive

Hamming distance

……

…

…

…

sunset

water

person

cityscape

tree

plane

… …

Semantic classes - image compact codes and learned class-specific weights

[0 0 1 0 … 0 1 0]

Binary embedding to compact codes

[1 0 1 0… 0 0 0] [1 0 0 0… 0 0 0]

[1 0 0 0… 1 0 0][1 1 1 0… 0 0 0]

[1 0 1 0… 0 1 0][0 0 1 0… 0 0 0]

[1 0 1 0… 0 1 0][0 0 1 0… 0 1 0]

[1 0 1 0… 0 0 1][0 0 1 0… 0 0 1]

[1 0 0 0… 0 0 1][1 0 1 0… 0 0 1]

[0 1 1 0… 0 1 1]

[0 0 1 0… 0 1 0]

[0 1 0 0… 0 1 1]

[0 0 0 0… 0 1 1]

[1 1 1 0… 1 0 0]

[1 1 1 1… 0 0 0]

[1 1 1 1… 0 0 0]

[1 0 1 0… 0 1 0]

[0 0 0 0… 0 1 0] [1 0 0 0… 0 1 0]

[1 0 0 0… 0 1 0][1 0 0 0… 0 1 0]

[0.13 0.05 0.51 … 0.06]

Image database(compact codes)

Query-adaptive weights

[0.05 0.15 0.21 … 0.46][0.22 0.11 0.12 … 0.15]

[0.02 0.24 0.22 … 0.08] [0.22 0.04 0.62 … 0.02][0.08 0.17 0.02 … 0.19]

[0.12 0.11 0.42 … 0.10]

…

Feature extraction

……

……

Figure 2: Overview of our

image search framework using

query-adaptive Hamming dis-

tance. Given a query image,

we first compute bag-of-visual-

words feature and embed it into

compact binary code (Section

4). The binary code is then

used to predict query-adaptive

weights by harnessing a set

of semantic classes with pre-

computed class-specific weights

(Section 5). Finally, the query-

adaptive weights are applied to

generate a rank list of search

results using weighted (query-

adaptive) Hamming distance.

(pi,pj) ∈ M is denoted as a neighbor-pair when (pi,pj)are close in a metric space or share common class labels.Similarly, (pi,pj) ∈ C is called a nonneighbor-pair if twosamples are far away in metric space or have no commonclass labels. The goal of SSH is to learn d hash functions(H) that maximize the following objective function:

J(H) =

d∑k=1

{∑

(pi,pj)∈M

hk(pi)hk(pj)−∑

(pi,pj)∈C

hk(pi)hk(pj)}

+

d∑k=1

var[hk(p)],

where the first term measures the empirical accuracy overthe labeled sample pair sets M and C, and the second part,i.e., the summation of the variance of hash bits, realizes themaximum entropy principle [31]. The above optimization isnontrivial. However, after relaxation, the optimal solutioncan be approximated using eigen-decomposition. Further-more, a sequential learning algorithm called S3PLH was de-signed to implicitly learn bit-dependent hash codes with thecapability of progressively correcting errors made by pre-vious hash bits [31], where in each iteration, the weightedpairwise label information is updated by imposing higherweights on point pairs violated by the previous hash func-tion. In this work, the S3PLH algorithm is applied to gener-ate compact hash codes. We select 5000 labeled samples forlearning H. Both training and testing (binary embedding)of SSH are very efficient [30, 31].

4.2 Deep LearningLearning with a deep belief network (DBN) was initially

proposed for dimensionality reduction [8]. It was recentlyadopted for efficient search applications with binary embed-ding [23, 29, 9]. Similar to SSH, DBN also requires imagelabels during training phase, such that images with the samelabel are more likely to be mapped to the same bucket. Sincethe DBN structure gradually reduces the number of units ineach layer2, the high-dimensional input of the raw descrip-tors can be projected into a compact Hamming space.Broadly speaking, a general DBN is a directed acyclic

graph, where each node represents a stochastic variable.

2In some cases the number of units may increase or remain thesame for a few layers, and then decrease.

There are two critical steps in using DBN for binary em-bedding, i.e., learning interactions between variables and in-ferencing observations from inputs. The learning of a DBNwith multiple layers is very hard since it usually requires toestimate millions of parameters. Fortunately, it has beenshown in [7, 8] that the training process can be much moreefficient if a DBN is specifically structured based on theRBMs – each single RBM has two layers containing respec-tively output visible units and hidden units, and multipleRBMs can be stacked to form a deep belief net. Startingfrom the input layer with D dimension, the network canbe specifically designed to reduce the number of units, andfinally output compact d-dimensional binary codes.

To obtain the optimal weights in the entire network, thetraining process of a DBN has two critical stages: unsuper-vised pre-training and supervised fine-tuning. The greedypre-training phase is progressively executed layer by layerfrom input to output, aiming to place the network weights(and the biases) to suitable neighborhoods in the parame-ter space. After achieving convergence of the parametersof one layer via Contrastive Divergence, the outputs of thislayer are fixed and treated as inputs to drive the training ofthe next layer. During the fine-tuning stage, labeled datais adopted to help refine the network parameters throughback-propagation. Specifically, a cost function is defined toensure that points (binary codes) within a certain neigh-borhood share the same label [6]. The network parametersare then refined to maximize this objective function usingconjugate gradient descent.

In our experiments, the dimension of the image featureis fixed to 500. Similar to the network architecture used in[29], we use DBNs with five layers of sizes 500-500-500-256-d, where d is the dimension of the final binary code, rang-ing from 32 to 48. The training process requires to learn5002+5002+500 ·256+256 ·d weights in total – a minimumnumber of 636, 192 weights (d = 32) and a maximum num-ber of 640, 288 weights (d = 48). For the number of trainingsamples, we use a total number of 160, 000 samples in thepre-training stage, and 50 batches of neighborhood regionswith size 1000 in the fine-tuning stage. Based on the effi-cient algorithms described earlier [7, 8], the entire trainingprocess can be finished within 1 day using a fast computerwith a 2G Core-2 Quad CPU (15-24 hours depending on the

output code size). Since parameter training is an offline pro-cess, this computational cost is fairly acceptable. Comparedto training, generating binary codes with the learned param-eters is way faster – using the same computer it requires just0.7 milliseconds to compute a 48-bit code for each image.

5. IMAGE SEARCH BY QUERY-ADAPTIVEHAMMING DISTANCE

With binary embedding, scalable image search can be ex-ecuted in Hamming space using Hamming distance. Thecomputation of the Hamming distance is highly efficient viabitwise operations. For example, comparing a query witha database containing a few millions of samples can finishwithin a fraction of a second. By definition, the Hammingdistance between two binary codes is the total number of bitswith different values, where the specific indexes of the bitsare not considered, i.e., all the bits are treated equivalently.For example, given three binary codes x = 1100, y = 1111,and z = 0000, the Hamming distance of x and y is equal tothat of x and z, regardless of the fact that z differs from x inthe first two bits while y differs in the last two bits. Due tothis nature of the Hamming distance, practically there canbe hundreds or even thousands of samples having the samedistance to a query. Back to the example, suppose we knowthat the first two bits are more important (discriminative)for x, then y should be ranked higher than z if x is a query.In this section, we propose to learn query-adaptive weightsfor each bit of the binary code, so that images sharing thesame Hamming distance can be ordered at a finer resolution.

5.1 Learning Class-Specific WeightsTo quickly compute the query-adaptive weights, we pro-

pose to firstly learn class-specific weights for a number ofsemantic concept classes (e.g., scenes and objects). Assumethat we have a dataset of k semantic classes, each with a setof labeled images. We learn bit-level weights separately foreach class, with an objective of maximizing intra-class sim-ilarity as well as maintaining inter-class relationship. For-mally, for two binary codes x and y in classes i and j respec-tively, their proximity is measured by weighted Hammingdistance ∥ai ◦ x − aj ◦ y∥2, where ◦ denotes element-wise(Hadamard) product, and ai (aj) is the bit-level weight vec-tor for class i (j).Let X be a set of n binary codes in a d-dimensional Ham-

ming space, X = {x1, ...,xn}, xj ∈ Rd, j = 1, ..., n. DenoteXi ⊂ X as the subset of codes from class i, i = 1, ..., k. Ourgoal is to learn k weight vectors a1,...,ak, where ai ∈ Rd

corresponds to class i. The learned weights should satisfysome constraints. First, ai should be nonnegative (i.e., eachentry of ai is nonnegative), denoted as ai ≥ 0. To fix thescale, we enforce the summation of the entries of ai to 1, i.e.,a⊤i 1 = 1, where 1 denotes a vector of ones of d dimension.

Ideally, a good weight vector should ensure the weightedcodes from the same class to be close to each other. Thiscan be quantified as follows:

f(a1, ..., ak) =

k∑i=1

∑x∈Xi

∥ai ◦ x− ai ◦ c(i)∥2, (1)

where c(i) is the center of the binary codes in class i, i.e.,c(i) = 1

ni

∑x∈Xi

x, with ni being the total number of binary

codes in Xi. Notice that although the sample proximity, tosome extent, was considered in the binary embedding pro-

cess of SSH and DBN, Eq. (1) is still helpful since weightingthe binary codes is able to further condense samples from thesame class. More importantly, the optimal bit-level weightsto be learned here will serve as the key for computing thequery-adaptive Hamming distance.

In addition to intra-class similarity, we also want the inter-class relationship in the original image feature (BoW) spaceto be preserved in the weighted Hamming space. Let sij ≥ 0denote the proximity between classes i and j (sij = sji). sijcan be quantified using average BoW feature similarity ofsamples from classes i and j. Then it is expected that theweighted codes in classes i in j should be relatively moresimilar if sij is large. This maintains the class relationship,which is important since the semantic classes under our con-sideration are not exclusive – in fact some of them are highlycorrelated (e.g., tree and grass). We formalize this idea as aminimization problem of the following term:

g(a1, ..., ak) =

k∑i,j=1

sij∥ai ◦ c(i) − aj ◦ c(j)∥2. (2)

Based on above analysis, we propose the following opti-mization problem to learn the weights for each class:

mina1,...,ak

f(a1, ..., ak) + λg(a1, ..., ak) (3)

s.t. a⊤i 1 = 1, i = 1, ..., k, (4)

ai ≥ 0, i = 1, ..., k, (5)

where λ ≥ 0 is a parameter that controls the balance ofthe two terms. Next we show that the above optimizationproblem can be efficiently solved using an iterative quadraticprogramming (QP) scheme.

First, we can rewrite f in matrix form as

f(a1, ..., ak) =k∑

i=1

a⊤i Aiai (6)

where Ai is a symmetric positive semidefinite matrix:

Ai = diag(∑x∈Xi

(x− c(i)) ◦ (x− c(i))). (7)

Also, to simplify g, we have

∥ai ◦ c(i) − aj ◦ c(j)∥2 (8)

= a⊤i Ciiai − 2a⊤

i Cijaj + a⊤j Cjjaj

where Cij = diag(c(i) ◦ c(j)).With these preparations, we can expand Eq. (3) as

f(a1, ..., ak) + λg(a1, ..., ak)

=

k∑i=1

a⊤i Aiai + λ

k∑j,l=1

sjl(a⊤j Cjjaj − 2a⊤

j Cjlal + a⊤l Cllal)

= a⊤i (Ai + 2λ(

∑l

sil − sii)Cii)ai − (4λ∑j ̸=i

sjiCjiaj)⊤ai+

(

k∑j ̸=i

a⊤j Ajaj + 2λ

∑j ̸=i,l

sjla⊤j Cjjaj − 2λ

∑j ̸=i,l̸=i

sjla⊤j Cjlal)

=1

2a⊤i Qiai + p⊤

i ai + ti, (9)

Algorithm 1 Learning class-specific weights.

1: INPUT:Binary codes Xi and class similarity sij , i, j = 1, ..., k.

2: OUTPUT:Class-specific weights aj , j = 1, ..., k.

3: Compute c(i), and initialize aj = 1/d, j = 1, ..., k;4: Repeat5: For i = 1, ..., k6: Compute Qi, pi, and ti using Eq. (10) – Eq. (12);7: Solve the following QP problem:

a∗i = argmin

ai

1

2a⊤i Qiai + p⊤

i ai + ti

s.t. a⊤i 1 = 1 and ai ≥ 0;

8: Set ai = a∗i ;

9: End for10: Until convergence

where

Qi = 2Ai + 4λ(∑l

sil − sii)Cii, (10)

pi = −4λ∑j ̸=i

sjiCjiaj , (11)

ti =

k∑j ̸=i

a⊤j Ajaj + 2λ

∑j ̸=i,l

sjla⊤j Cjjaj − 2λ

∑j ̸=i,l̸=i

sjla⊤j Cjlal.

(12)

Up to this point we have derived the quadratic form ofEq. (3) w.r.t. ai. Notice that Qi is symmetric and pos-itive semidefinite as Ai and Cii are. Given all aj , j ̸= i,the quadratic program in Eq. (9) is convex, and thus ai

can be solved optimally. This analysis suggests an itera-tive procedure summarized in Algorithm 1 for solving theoptimization problem stated in Eq. (3) – Eq. (5). Thestop condition (convergence) is defined as the energy dif-ference of two states |Ecurrent − Eprevious| < ξ, where E =f(a1, ..., ak) + λg(a1, ..., ak) and ξ is set as a small value(10−6 in our experiments).

Algorithm Discussion. It is interesting to briefly discussthe difference and connection of our algorithm to some gen-eral machine learning methods such as LDA [5] and distancemetric learning, although ours is particularly designed forthis specific application. LDA is a well-known method thatlinearly projects data into a low-dimensional space wherethe sample proximity is reshaped to maximize class separa-bility. While LDA also tries to condense samples from thesame class, it learns a single uniform transformation ma-trix G ∈ Rd×s to map all original d-dimensional featuresto a lower s-dimensional space. In contrast, we learn a d-dimensional weight vector separately for each class.Distance metric learning tries to find a metric so that

samples of different classes in the learned metric space can bebetter separated. Typically a single Mahalanobis distancemetric is learned for the entire input space [35]. There arealso a few exceptions that consider multiple metrics, e.g., ametric is trained per category in [33]. Besides the differentformulations, our method aims to learn bit-level weights ofbinary codes for each class, which was not considered inthese relevant works.

5.2 Computing Query-Adaptive WeightsAs described in Figure 2, labeled images and the learned

weights of the predefined semantic concept classes form a se-mantic database for rapid computation of the query-adaptiveweights. Given a query image q and its binary code xq, theobjective is to adaptively determine a suitable weight vec-tor aq, so that we can apply weighted Hamming distance tocompare xq to each binary code xt in the target database:

d(xq,xt) = ∥aq ◦ xq − aq ◦ xt∥2. (13)

To compute aq, we query xq against the semantic databasebased on Hamming distance. Semantic labels of the top kmost similar images are collected to predict the labels ofthe query, which are then utilized to compute the query-adaptive weights. Specifically, denote T as the set of themost relevant semantic classes to the query q, and mi asthe number of images (within the top k list) from the ith

class in T . The query adaptive weights are computed vialinear combination as aq =

∑i∈T (miai)/

∑i∈T mi, where

ai is the precomputed weight vector of the correspondingclass. We empirically choose 500 for k (random selection forimages with equal Hamming distances) and 3 for |T |.

It is important to note that there is no need to computethe weighted Hamming distance based on the form shownin Eq. (13), which can be rewritten as (aq ◦ aq)

⊤(xq ⊕ xt),where ⊕ means the xor of the two binary codes. While theweighting can now be achieved by an inner-product opera-tion, it is actually possible to avoid this computational cost –by computing traditional Hamming distance and then rank-ing the images at binary code level rather than Hammingdistance level. Since we have computed the weight of eachbit, the orders of the binary codes (relevance to the query)could be obtained with very minor additional cost. Similarobservations on this were also given in [24].

6. EXPERIMENTSDataset and Setup: We conduct image search experi-ments using the NUS-WIDE dataset [2]. NUS-WIDE con-tains 269,648 Flickr images, which are divided into a train-ing set (161,789 images) and a test set (107,859 images).It is fully labeled with 81 semantic concepts/classes, cover-ing a wide range of semantic topics from objects (e.g., birdand car) to scenes (e.g., mountain and harbor). Notice thatNUS-WIDE is to our knowledge the largest available datasetthat is fully labeled with a large number of classes. Otherwell-known and larger datasets such as ImageNet and MITTinyImages are either not fully labeled3 or unlabeled, cre-ating difficulties in performance evaluation.

The concepts in NUS-WIDE are very suitable for con-structing the semantic database in our approach – thereforewe use the training set to learn class-specific weights basedon the algorithm introduced in Section 5. We compute a setof weights for each type of the binary codes, where the bal-ance parameter λ is uniformly set as 1. In our experiments,we tried a range of λ and found the performance remainsfairly stable.

Local features are extracted from all the images based onLowe’s DoG detector and SIFT descriptor [18]. We thencompute soft-weighted BoW features [14] using a visual vo-

3Each image in ImageNet has only one label. E.g., a car imagemay be labeled to a vehicle-related category, but not to road orbuilding categories, although both may co-occur with car.

0 3 6 9 12 15 18 21 24 27 30 33 36 39 42 45 4810

0

101

102

103

104

Hamming distance

avera

ge #

of im

ages p

er

query

Figure 3: Average number of returned images at each

Hamming distance, computed using 20k queries and 48-

bit binary codes generated by DBN. Note that the y-axis

is plotted in log-scale.

32−bit code 48−bit code0

0.03

0.06

0.09

0.12

0.15

∆M

AP

10.1%6.2%


0.03

0.06

0.09

0.12

0.15

∆M

AP

9.5%6.9%

(a) DBN, entire list (b) DBN, upper half


0.03

0.06

0.09

0.12

0.15

∆M

AP

5.1% 5.2%


0.03

0.06

0.09

0.12

0.15

∆M

AP

5.0%4.6%

(c) SSH, entire list (d) SSH, upper half

Figure 4: Comparison of our query-adaptive Hamming

distance (red bars) and traditional Hamming distance

(blue bars). Results are measured by ∆MAP over 8,000

randomly selected queries. We show ∆MAP over entire

and upper half of the result rank lists using binary codes

from both DBN and SSH. The performance gains are

marked on top of the red bars.

cabulary of 500 visual words. For evaluation, we rank all im-ages in the test dataset according to their (weighted) Ham-ming distances to each query. An image will be treated asa correct hit if it shares at least one common class label tothe query. The performance is then evaluated using ∆AP,an extended version of the average precision where priorprobability of each query result is taken into account [36].To aggregate performance of multiple queries, mean ∆AP(∆MAP) is used.

Characteristics of Binary Codes for Search: We firstanalyze the number of images returned at each Hammingdistance. The 48-bit binary codes from DBN are used in thisexperiment. Note that we will not specifically investigatethe effect of code-length in this paper, since several previousworks on binary embedding have already shown that codesof 32-50 bits work well for search [29, 15]. In general, usingmore bits may lead to higher precision, but at the price oflarger memory usage and longer search time.Figure 3 visualizes the results, averaged over 20,000 ran-

domly selected queries. As shown in the figure, the numberof returned images rapidly increases when the Hamming ra-dius (distance) goes up. This shows one nature of the Ham-ming distance – as mentioned in the introduction, there canbe Ci

d different binary codes sharing the same integer dis-tance i (i > 0) to a query in a d-dimensional Hamming space.Consequently, the number of binary codes (and correspond-ingly the number of images) at each specific distance will

increase dramatically as i grows until i = d/2. For somequeries, there can be as many as 103-104 images sharingequal Hamming distance. This clearly motivates the needof our proposed approach, which provides ranking at a finergranularity. On the other hand, although our approach doesnot permute images with Hamming distance 0 to the queries,from the analysis we see that this is not a critical issue sincemost queries have none or just a few such images (2.4 onaverage in the evaluated queries).

Results and Analysis: We now evaluate how much per-formance gain can be achieved from the proposed query-adaptive Hamming distance. Figure 4 displays the resultsusing 32-bit and 48-bit binary codes from both DBN andSSH. In this experiment, we randomly pick 8,000 queries(each contains at least one semantic label) and compute av-eraged performance over all the queries. As shown in Figure4(a,c), our approach significantly outperforms traditionalHamming distance. For the DBN codes, it improves the32-bit baseline by 6.2% and the 48-bit baseline by 10.1%over the entire rank lists. A little lower but very consis-tent improvements (about 5%) are obtained for the SSHcodes. These results clearly validate the effectiveness oflearning query-adaptive weights for image search with com-pact codes.

Figure 4(b,d) further gives the performance over upperhalf of the ranked search results, using the same set of queries.The aim of this evaluation is to verify whether our approachis able to improve the ranking of top images, i.e., those withrelatively smaller Hamming distances to the queries. As ex-pected, we see similar performance gain to that over theentire list.

As for the baseline performance, DBN codes are better be-cause more labeled training samples are used (50k for DBNvs. 5k for SSH). Note that comparing DBN to SSH is be-yond the focus of this work, as the latter is a semi-supervisedmethod that prefers and is more suitable for cases with lim-ited training samples. Direct comparison of the two withequal training set size can be found in [30].

To see whether the improvement is consistent over theevaluated queries, we group the queries into 81 categoriesbased on their associated labels. Results from the 48-bitDBN codes are displayed in Figure 5. Steady performancegain is obtained for almost all the query categories, and noneof them suffers from performance degradation, which againvalidates the effectiveness of our approach. Figure 6 showstop-5 images for two example queries, where our approachshows better results by replacing the clearly irrelevant im-ages with more suitable ones.

7. CONCLUSIONSearching images with binary codes and Hamming dis-

tance has great potential in practice, as its efficiency allowsus to rapidly browse thousands of images on a regular lap-top computer (e.g., for personal photo organization), andmillions or even billions of images on a decent server (e.g.,for Web scale applications). This paper proposed a novelframework for scalable image search using query-adaptiveHamming distance, by efficiently harnessing a large set ofpredefined semantic concept classes. Compared with tradi-tional Hamming distance of binary codes, our approach isable to adaptively produce finer-grained ranking of searchresults. Experiments on a Flickr image dataset have con-firmed the effectiveness of our approach. One important

0.00

0.05

0.10

0.15

0.20

0.25

0.30

0.35

air

port

anim

al

beach

bear

bir

ds

boats

book

bri

dg

e

build

ing

s

cars

castle

cat

citys

cap

e

clo

ud

s

com

pute

r

cora

l

cow

dancin

g

dog

eart

hq

uake

elk

fire

fish

flag

s

flow

ers

food

fox

frost

gard

en

gla

cie

r

gra

ss

harb

or

hors

es

house

lake

leaf

map

mili

tary

moon

mounta

in

nig

httim

e

ocean

pers

on

pla

ne

pla

nts

polic

e

pro

test

railr

oad

rain

bow

reflection

road

rocks

runnin

g

sand

sig

n

sky

snow

soccer

sp

ort

s

sta

tue

str

eet

sun

sunset

surf

sw

imm

ers

tattoo

tem

ple

tig

er

tow

er

tow

n

toy

train

tree

valle

y

vehic

le

wate

r

wate

rfall

wed

din

g

whale

s

win

dow

zeb

ra

M

AP

traditional Hamming distance query-adaptive Hamming distance

Figure 5: Per-category ∆MAP comparison using 48-bit DBN codes. The queries are grouped into 81 categories based

on their associated labels.

Baseline

Ours

Query

Baseline

Ours

Query

Figure 6: Top 5 returned images of two example queries,

using the 48-bit DBN codes. Our approach shows better

results than the baseline traditional Hamming distance,

by removing irrelevant images (red dotted boxes) from

the top-5 lists. Others are all visually relevant (night-

view/outdoor scene for the first query, and water-view for

the second one).

future work is to further enlarge the number of semanticclasses for more accurate prediction of the query-adaptivebit weights.

AcknowledgementThis work is supported in part by NSF Award No CNS-07-51078.We thank Dr. Zhenguo Li for his help on this work.

8. REFERENCES[1] J. L. Bentley. Multidimensional binary search trees used for

associative searching. Communications of the ACM,18(9):509–517, 1975.

[2] T.-S. Chua, J. Tang, R. Hong, H. Li, Z. Luo, and Y.-T. Zheng.NUSWIDE: A real-world web image database from nationaluniversity of singapore. In CIVR, 2009.

[3] O. Chum, M. Perdoch, and J. Matas. Geometric min-hashing:Finding a (thick) needle in a haystack. In CVPR, 2009.

[4] R. Datta, D. Joshi, J. Li, and J. Z. Wang. Image retrieval:ideas, influences, and trends of the new age. ACM ComputingSurveys, 40(2), 2008.

[5] R. A. Fisher. The use of multiple measurements in taxonomicproblems. Annals of Eugenics, 7:179–188, 1936.

[6] J. Goldberger, S. Roweis, G. Hinton, and R. Salakhutdinov.Neighbourhood components analysis. In NIPS. 2004.

[7] G. Hinton. Training products of experts by minimizingcontrastive divergence. Neural Computation, 14(8):1771–1800,2002.

[8] G. Hinton and R. Salakhutdinov. Reducing the dimensionalityof data with neural networks. Science, 313(5786):504–507, 2006.

[9] E. Horster and R. Lienhart. Deep networks for image retrievalon large-scale databases. In ACM Multimedia, 2008.

[10] P. Indyk. Nearest neighbors in high-dimensional spaces. In J. E.Goodman and J. O’Rourke, editors, Handbook of Discrete andComputational Geometry, chapter 39. CRC press, 2004.

[11] P. Indyk and R. Motwani. Approximate nearest neighbors:Towards removing the curse of dimensionality. In Symposiumon Theory of Computing, 1998.

[12] H. Jegou, M. Douze, and C. Schmid. Packing bag-of-features.In CVPR, 2009.

[13] H. Jegou, M. Douze, and C. Schmid. Improving bag-of-featuresfor large scale image search. IJCV, 87:191–212, 2010.

[14] Y. G. Jiang, C. W. Ngo, and J. Yang. Towards optimalbag-of-features for object categorization and semantic videoretrieval. In CIVR, 2007.

[15] B. Kulis and T. Darrell. Learning to hash with binaryreconstructive embeddings. In NIPS, 2009.

[16] B. Kulis and K. Grauman. Kernelized locality-sensitive hashingfor scalable image search. In ICCV, 2009.

[17] N. Kumar, L. Zhang, and S. Nayar. What is a good nearestneighbors algorithm for finding similar patches in images? InECCV, 2008.

[18] D. Lowe. Distinctive image features from scale-invariantkeypoints. IJCV, 60(2):91–110, 2004.

[19] M. Muja and D. G. Lowe. Fast approximate nearest neighborswith automatic algorithm configuration. In VISAPP, 2009.

[20] D. Nister and H. Stewenius. Scalable recognition with avocabulary tree. In CVPR, 2006.

[21] A. Oliva and A. Torralba. Modeling the shape of the scene: aholistic representation of the spatial envelope. IJCV,42:145–175, 2001.

[22] T. Quack, U. Monich, L. Thiele, and B. Manjunath. Cortina: Asystem for large-scale, content-based web image retrieval. InACM Multimedia, 2004.

[23] R. Salakhutdinov and G. Hinton. Semantic hashing. SIGIRWorkshop, 2007.

[24] G. Shakhnarovich, P. Viola, and T. Darrell. Fast poseestimation with parameter-sensitive hashing. In ICCV, 2003.

[25] J. Sivic and A. Zisserman. Video google: A text retrievalapproach to object matching in videos. In ICCV, 2003.

[26] A. W. Smeulders, M. Worring, S. Santini, A. Gupta, andR. Jain. Content-based image retrieval at the end of the earlyyears. PAMI, 22(12):1349–1380, 2000.

[27] J. R. Smith and S.-F. Chang. VisualSEEk: a fully automatedcontent-based image query system. In ACM Multimedia, 1996.

[28] A. Torralba, R. Fergus, and W. Freeman. 80 million tinyimages: A large data set for nonparametric object and scenerecognition. PAMI, 30(11):1958–1970, 2008.

[29] A. Torralba, R. Fergus, and Y. Weiss. Small codes and largeimage databases for recognition. In CVPR, 2008.

[30] J. Wang, S. Kumar, and S.-F. Chang. Semi-supervised hashingfor scalable image retrieval. In CVPR, 2010.

[31] J. Wang, S. Kumar, and S.-F. Chang. Sequential projectionlearning for hashing with compact codes. In ICML, 2010.

[32] X.-J. Wang, L. Zhang, F. Jing, and W.-Y. Ma. Annosearch:image auto-annotation by search. In CVPR, 2006.

[33] K. Weinberger and L. Saul. Fast solvers and efficientimplementations for distance metric learning. In ICML, 2008.

[34] Y. Weiss, A. Torralba, and R. Fergus. Spectral hashing. InNIPS, 2008.

[35] E. Xing, A. Ng, M. Jordan, and S. Russell. Distance metriclearning with application to clustering with side-information.NIPS, pages 521–528, 2003.

[36] J. Yang and A. G. Hauptmann. (Un)reliability of video conceptdetection. In CIVR, 2008.

[37] J. Zobel and A. Moffat. Inverted files for text search engines.ACM Computing Surveys, 38(2), 2006.

Lost in Binarization: Query-Adaptive Ranking for Similar Image...

Documents

Transcript of Lost in Binarization: Query-Adaptive Ranking for Similar Image...