Query-dependent visual dictionary adaptation for …wliu/MM13_rerank.pdf · Query-Dependent Visual...

4
Query-Dependent Visual Dictionary Adaptation for Image Reranking Jialong Wang Xidian University Xi’an, 710071 China [email protected] Cheng Deng Xidian Univerisity Xi’an, 710071 China [email protected] Wei Liu IBM T. J. Watson Research Center, Yorktown Heights, NY, USA [email protected] Rongrong Ji Xiamen University Xiamen, 361005 China [email protected] Xiangyu Chen I2R, Astar Singapore, 138632 Singapore [email protected] Xinbo Gao Xidian University Xi’an, 710071 China [email protected] ABSTRACT Although text-based image search engines are popular for ranking images of user’s interest, the state-of-the-art rank- ing performance is still far from satisfactory. One major issue comes from the visual similarity metric used in the ranking operation, which depends solely on visual features. To tackle this issue, one feasible method is to incorporate semantic concepts, also known as image attributes, into im- age ranking. However, the optimal combination of visual features and image attributes remains unknown. In this paper, we propose a query-dependent image reranking ap- proach by leveraging the higher level attribute detection a- mong the top returned images to adapt the dictionary built over the visual features to a query-specific fashion. We start from offline learning transposition probabilities between vi- sual codewords and attributes, then utilize the probabilities to online adapt the dictionary, and finally produce a query- dependent and semantics-induced metric for image ranking. Extensive evaluations on several benchmark image datasets demonstrate the effectiveness and efficiency of the proposed approach in comparison with state-of-the-arts. Categories and Subject Descriptors H.3.3 [Information Search and Retrieval]: Retrieval models General Terms Algorithms, Experimentation, Performance Keywords Image Reranking, Query Dependent, Dictionary Adaptation Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full cita- tion on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or re- publish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]. MM’13 October 21–25, 2013, Barcelona, Spain Copyright 2013 ACM 978-1-4503-2404-5/13/10 ...$15.00. Figure 1: Framework of our proposed approach. 1. INTRODUCTION Most popular web image search engines, such as Google and Microsoft Bing, are built under the “query by keyword” scenario, in which the related images are returned by using the associated textual information from web pages, includ- ing title, description, and surrounding caption of the related images [1]. However, text-based search schemes are known to be unsatisfying because textual information does not al- ways describe image content accurately. Besides, mismatch- es between images and their associated text will inevitably introduce irrelevant images appearing in search results. To boost the precision of text-based image search, image reranking [2], which is referred to as refining an initially ranked image list queried by an input keyword, has received increasing attention in recent years. By asking a user to se- lect a query image that reflects the user’s search intention, the initial ranking list is reordered by exploiting a similarity metric based on visual features to achieve better search per- formance. One basic assumption of image reranking is that visually similar images tend to be ranked together. Although there have been many image reranking approach- es from different perspectives, the major challenge of well capturing the user’s search intention is still unsolved. In order to cope with this challenge, the recent efforts focus on “query expansion” which could correlate low-level visual features with high-level semantic meanings of images by ab- sorbing more image cues. The existing query expansion in image reranking can be classified into two main categories: 769

Transcript of Query-dependent visual dictionary adaptation for …wliu/MM13_rerank.pdf · Query-Dependent Visual...

Query-Dependent Visual Dictionary Adaptation for ImageReranking

Jialong WangXidian University

Xi’an, 710071 [email protected]

Cheng DengXidian Univerisity

Xi’an, 710071 [email protected]

Wei LiuIBM T. J. Watson Research

Center, Yorktown Heights, NY,USA

[email protected] Ji

Xiamen UniversityXiamen, 361005 China

[email protected]

Xiangyu ChenI2R, Astar

Singapore, 138632 [email protected]

Xinbo GaoXidian University

Xi’an, 710071 [email protected]

ABSTRACTAlthough text-based image search engines are popular forranking images of user’s interest, the state-of-the-art rank-ing performance is still far from satisfactory. One majorissue comes from the visual similarity metric used in theranking operation, which depends solely on visual features.To tackle this issue, one feasible method is to incorporatesemantic concepts, also known as image attributes, into im-age ranking. However, the optimal combination of visualfeatures and image attributes remains unknown. In thispaper, we propose a query-dependent image reranking ap-proach by leveraging the higher level attribute detection a-mong the top returned images to adapt the dictionary builtover the visual features to a query-specific fashion. We startfrom offline learning transposition probabilities between vi-sual codewords and attributes, then utilize the probabilitiesto online adapt the dictionary, and finally produce a query-dependent and semantics-induced metric for image ranking.Extensive evaluations on several benchmark image datasetsdemonstrate the effectiveness and efficiency of the proposedapproach in comparison with state-of-the-arts.

Categories and Subject DescriptorsH.3.3 [Information Search and Retrieval]: Retrievalmodels

General TermsAlgorithms, Experimentation, Performance

KeywordsImage Reranking, Query Dependent, Dictionary Adaptation

Permission to make digital or hard copies of all or part of this work for personal orclassroom use is granted without fee provided that copies are not made or distributedfor profit or commercial advantage and that copies bear this notice and the full cita-tion on the first page. Copyrights for components of this work owned by others thanACM must be honored. Abstracting with credit is permitted. To copy otherwise, or re-publish, to post on servers or to redistribute to lists, requires prior specific permissionand/or a fee. Request permissions from [email protected]’13 October 21–25, 2013, Barcelona, SpainCopyright 2013 ACM 978-1-4503-2404-5/13/10 ...$15.00.

Figure 1: Framework of our proposed approach.

1. INTRODUCTIONMost popular web image search engines, such as Google

and Microsoft Bing, are built under the “query by keyword”scenario, in which the related images are returned by usingthe associated textual information from web pages, includ-ing title, description, and surrounding caption of the relatedimages [1]. However, text-based search schemes are knownto be unsatisfying because textual information does not al-ways describe image content accurately. Besides, mismatch-es between images and their associated text will inevitablyintroduce irrelevant images appearing in search results.

To boost the precision of text-based image search, imagereranking [2], which is referred to as refining an initiallyranked image list queried by an input keyword, has receivedincreasing attention in recent years. By asking a user to se-lect a query image that reflects the user’s search intention,the initial ranking list is reordered by exploiting a similaritymetric based on visual features to achieve better search per-formance. One basic assumption of image reranking is thatvisually similar images tend to be ranked together.

Although there have been many image reranking approach-es from different perspectives, the major challenge of wellcapturing the user’s search intention is still unsolved. Inorder to cope with this challenge, the recent efforts focuson “query expansion” which could correlate low-level visualfeatures with high-level semantic meanings of images by ab-sorbing more image cues. The existing query expansion inimage reranking can be classified into two main categories:

769

visual expansion and semantic expansion [3]. The goal ofvisual expansion is to obtain multiple positive instances tolearn a robust similarity metric which is expected to be spe-cific to a particular query image. As a conventional means,relevance feedback is used to expand the positive instances[4], where a user is asked to label multiple relevant and irrel-evant image instances. However, relevance feedback requiresan extra burden of user’s labeling. To reduce the user’s bur-den, pseudo relevance feedback [5][6] is adopted to expandquery images by taking top-ranked results as the positive in-stances and bottom-ranked results as the negative instances.These instances are further leveraged into training a classi-fier that outputs the ranking scores. Unfortunately, pseu-do relevance feedback based methods are not guaranteed towork well due to the existence of false relevant images.

The idea of semantic expansion is to expand the originalquery image with additional query terms that are relevantto the query keyword [7]. Semantic expansion was initiallyproposed for document retrieval. The lexical-based methodsleveraged the linguistic word relationships to expand querykeywords, e.g., synonyms and hypernyms. The statisticalapproaches, such as term clustering [8] and Latent SemanticIndexing (LSA) [9], attempted to discover the term rela-tionships based on term-document co-occurrence statistics.The other methods attempted to reduce the topic drift byseeking frequently co-occurred patterns with the same con-text instead of the entire document. However, the additionalterms expanded by these methods do not always consist withthe semantic concepts of the query images, which makes theranking results unsatisfying.

In this paper, we propose a novel image reranking ap-proach with query-dependent visual dictionary adaptation.Given an image retrieval list, we first construct a query-specific transposition probability dictionary between visu-al features and semantic concepts. Different from most ofthe existing methods, we represent semantic concepts as im-age attributes which can be obtained by using some off-the-shelf approaches, such as Classeme1 and ObjectBank2. Fora query image, its salient visual words and attributes canbe acquired through performing encoding with the learneddictionary. On one hand, the semantic concepts of the queryimage are well expanded; on the other hand, the obtainedsemantic concepts are not only query-specific but also re-lated to common-specific concepts belonging to the wholedataset. Hence, these obtained image cues can well reflectthe user’s search intention. To further preserve the similari-ties between the query image and the reranked top-K imagesvisually and semantically, we then learn a similarity metricunder which the query image and its related images are keptas close as possible in a new feature space, and meanwhilethose irrelevant images are successfully filtered out.

The contributions of this work are summarized as follows:

(1) We construct a corpus-oriented dictionary which explic-itly captures the latent consistency between visual fea-tures and semantic concepts. In our work, we use imageattributes to represent semantic concepts. As far as weknow, it is the first time to build such a co-occurringdictionary in the context of image reranking.

(2) We simultaneously accomplish visual expansion and se-

1http://www.cs.dartmouth.edu/~lorenzo/projects/classemes/2http://vision.stanford.edu/projects/objectbank/

Figure 2: Similarity metric learning supervised bysemantic expansion.

mantic expansion for any query image with the learnedco-occurring dictionary, which results in more flexibleand accurate image reranking.

(3) We learn a similarity metric which yields a new feature s-pace, where visual similarities and semantic correlationscan be well preserved for any query image and its relatedtop-K images. By doing so, the reranking performancecan be further boosted.

The rest of the paper is organized as: we propose ourapproach in Section 2; we describe the experiments includingthe experimental settings, the experimental results, and thediscussions in Section 3; we conclude the paper in Section 4.

2. THE APPROACHFigure 1 illustrates the proposed image reranking frame-

work, which consists of offline transposition probability learn-ing and online similarity metric learning. We detail bothcomponents respectively as below:

2.1 FormulationGiven a query image q, its retrieval set is denoted as Iq ={I1, · · · , IN}. For each image Ii in Iq, we extract its visual

vector V(i) = [v(i)1 , v

(i)2 , · · · , v(i)m ], and its attribute vector

A(i) = [a(i)1 , a

(i)2 , · · · , a(i)n ].

In our work, Bag-of-Words (BoW) is used to describe vi-sual feature, and Classeme is used to extract attribute vectorto describe semantic concepts. We form visual dictionary Vand semantic dictionary A on the retrieval set Iq, respec-tively. Therefore, the transposition probability matrix Wcan be built based on the co-occurence of entries betweenV and A. For the query image q, the visual and semanticquery expansion correspond to the significant visual and se-mantic elements selected according to W respectively. Thisprocedure can be formulated as

V(q) = f(V(q),W

), A(q) = g

(A(q),W

). (1)

Here, f(·) and g(·) are the selection functions respectively,which would be detailed later.

For image reranking, we hope the reordered top-K im-ages are more related to the query image in terms of bothvisual and semantic similarity. To that effect, we propose tolearn a similarity metric matrix M online, which is used toadapt the original visual features into a new subspace where

770

semantics are preserved. Figure 2 shows an exemplar illus-tration of such a space, where an image close to the queryimages should be both visually and semantically relevan-t. In other word, we target at learning a semantic-inducedmanifold structure in the visual feature space to capture theessence of both metrics. This corresponds to minimizingboth visual and semantic dissimilarities:

In terms of visual similarity, the accumulated distancesbetween the top-K images and the query image should beminimized as

minM

K∑k=1

‖V(k)M−V(q)M‖22. (2)

In terms of semantic relevance, the accumulated distances ofattribute vectors between the top-K images and the queryimage should be minimized. We have

minM

K∑k=1

‖V(k)WM−V(q)WM‖22. (3)

Jointing Equation 2 and Equation 3, we derive the overallobjective function as

O = minM

K∑k=1

‖V(k)M−V(q)M‖22

+

K∑k=1

‖V(k)WM−V(q)WM‖22 + λ‖M‖1,

(4)

where ‖ · ‖1 is the `1-norm of a matrix, λ is a constant tocontrol the sparsity degree. Here, the main role of the sparsematrix M lies in only selecting those significant elements.Thus, the expanded visual features and semantic conceptsare not only query-dependent but also corpus-oriented.

Once obtaining the sparse metric matrix M, we calculatethe similarity in the new space, and use the similarity scoreas the reranking score to reorder the search results. Thesimilarity between query q and image Ij is defined as

sim(q, Ij) =

∑di=1 min(xqi, xji)∑di=1 max(xqi, xji)

. (5)

2.2 OptimizationThe smoothing nature of the first two terms in Equa-

tion 4 guarantees a convex and smoothing objective func-tion. Therefore, a general Smoothing Proximal Gradient(SPG) approach is adopted in the optimization step. Morespecifically, we solve this `1-norm penalized sparse learn-ing problem by the Fast Iterative Shrinkage-ThresholdingAlgorithm (FISTA) [10]. SPG has been proven to achieveO( 1

ε) convergence rate for a desired accuracy ε. The FISTA

method is presented in Algorithm 1.

3. EXPERIMENTSIn this section, we first present the experimental settings,

and then we illustrate and discuss the experimental results.Moreover, we will compare our approach with state-of-the-art reranking methods, such as noise resistant graph-basedimage reranking (NRGIR) [11], random walk-based imagereranking (RWIR) [2], to demonstrate the effectiveness ofour approach.

Algorithm 1: FISTA Algorithm.

1 Input: V(k), W, M0, λ.2 Initialization: set θ0 = 1, X0 = M0 = I.3 for t = 0, 1, 2, . . . until convergence of Mt do4 Compute ∇O(Xt).5 Compute Lipschitz constant L = λmax(∇O(Xt)),

where λmax is the largest eigenvalue.6 Perform the generalized gradient update step:

Mt+1 =argminM12

∥∥M−(Xt− 1L∇H(Xt)

)∥∥22

+ γL‖M‖1.

Set θt+1 = 2t+3

.

7 Set Xt+1 = Mt+1 + 1−θtθt

θt+1(Mt+1 −Mt).

8 end

9 Output: M = Mt+1.

Figure 3: mAP under different number of candi-dates K for three datasets.

3.1 Experimental SettingsDatasets: Experiments are conducted on four popular

datasets: Oxford Building3, Paris4, INRIA Holidays5, andUKBench6. Oxford and Paris respectively contain 5,062 and6,412 images, which are all provided with ground truth bymanual annotation. INRIA includes 1,491 relevant imagesof 500 scenes or objects, where the first image is used as aquery in each group. UKBench contains 10,200 images thatalways show the same object.

Feature: We use dense SIFT descriptor [12] computedfrom 16× 16 sized image patches with a stepsize of 8 pixelsusing VLFeat library. Then, 1,024-dimensional visual wordsis constructed with 1M descriptors. We use 1 × 1, 2 × 2,3×1 sub-regions to compute a BoW as the final image visualfeature. Besides, we extract attribute vector for each imagewith Classeme.

Baseline: We design two baselines to show the improve-ment of the retrieval accuracy, which includes: (I) visualfeature based reranking: we independently use BoW as vi-sual feature to evaluate the similarity between images; (II)semantic feature based reranking: we independently use at-tribute as semantic feature to evaluate the image similarity.

Evaluation Metric: We use mean Averaged Precision(mAP) to evaluate the performance on the first three dataset-s, while the performance measure on the UKBench is the av-erage number of correct returning in top-4 images, denotedas (Ave. Top Num.).

3http://www.robots.ox.ac.uk/~vgg/data/oxbuildings/4http://www.robots.ox.ac.uk/~vgg/data/parisbuildings/5http://lear.inrialpes.fr/~jegou/data.php6http://vis.uky.edu/~stewe/ukbench/

771

(a) (b)

Figure 4: Comparison results and visual results on different dataset.

3.2 Results and AnalysisParameter Tuning: In our method, the top-K dataset

candidates for the query image Iq are considered to evaluatereranking performance. In the objective function learning,we set λ = 0.05.

We first evaluate the performance of our approach givendifferent numbers of top dataset candidates K on baseline I.Figure 3 shows the performance on the first three datasetswhen we change K. When K becomes larger, the mAP valueson Oxford decrease. With K increasing from 20 to 300, asshown in Figure 3, the mAP drops from 0.87 to 0.65.

A similar situation is also observed with all other dataset-s. We use the same setting (K from 20 to 300) in all thesedatasets. Specially, since the queries in UKBench only havethree relevant images, K is set to 4. In the subsequent exper-iments, without specification, we fix K = 200 for all datasetsbut UKBench.

Comparison Results: Figure 4(a) illustrates the mAPof the baselines and some state-of-the-art methods on fourdatasets. As shown in Figure 4(a), the performances of allreranking methods are superior to the baseline I, i.e., directreranking based on BoW visual feature. The mAP of our ap-proach is about 0.14 and 1.01 higher than that of baseline Ion the first three datasets and UKBench, respectively. Sim-ilarly, compared with NRGIR and RWIR, our approach ob-tains 0.23 and 0.1 improvements on the first three datasets,and 1.43 and nearly 1.0 improvements on UKBench, respec-tively. Figure 4(b) shows some visual results of our imagereranking on datasets, in which three rows for each queryare the results of NRGIR, RWIR, and our approach.

4. CONCLUSIONSIn this paper, we propose an image reranking approach us-

ing query-dependent visual dictionary adaptation. Throughthe offline visual-semantic co-occurring dictionary learning,we can not only effectively capture the query-specific re-lationship between low-level visual features and high-levelsemantic concepts but also wisely extend the query image’ssemantic space, thus capturing the user’s search intention.Furthermore, we conduct similarity metric learning super-vised by the attribute-based semantic concepts, which makesthe query image and its relevant images closer in a new fea-ture space. The experimental results show that our approachperforms significantly better than the state-of-the-arts.

Although our approach achieves good performance, it ismore suitable for constrained datasets but not for uncon-strained web-scale image collections. The main drawback ofour approach lies in that a limited number of semantic con-cepts cannot cover numerous images present on the Internet.

Therefore, our future work will focus on gaining more diverseand accurate semantic concepts through mining more textu-al information from massive Internet images. We also planto explore scalable similarity metric learning using robustlarge graphs [13] and accelerating the reranking operationusing principled hashing methods like [14][15].

5. ACKNOWLEDGMENTSWe want to thank the helpful comments and suggestions

from the anonymous reviewers. This research was support-ed partially by the National Natural Science Foundation ofChina (Nos. 61125204 and 61101250), the Program for NewCentury Excellent Talents in University (NCET-12-0917),the Program for New Scientific and Technological Star ofShaanxi Province (No. 2012KJXX-24).

6. REFERENCES[1] X. Tian, L. Yang, J. Wang, Y. Yang, X. Wu, and X.-S. Hua.

Bayesian video search reranking. In Proc. ACM Multimedia,2008.

[2] W. Hsu, L. Kennedy, and S.-F. Chang. Reranking methods forvisual search. IEEE Multimedia, 14(3):14–22, 2007.

[3] X. Tang, K. Liu, J. Cui, F. Wen, and X. Wang. Intentsearch:Capturing user intention for one-click internet image search.IEEE Trans. PAMI, 34(7):1342–1353, 2012.

[4] Y. Lu, H. Zhang, L. Wenyin, and C. Hu. Joint semantics andfeautre based image retrieval using relevance feedback. IEEETrans. Multimedia, 5(3):339–347, 2003.

[5] R. Yan, A. G. Hauptmann, and R. Jin. Negativepseudo-relevance feedback in content-based video retrieval. InProc. ACM Multimedia, 2003.

[6] N. Morioka and J. Wang. Robust visual reranking via sparsityand ranking constraints. In Proc. ACM Multimedia, 2011.

[7] A. Natsev, A. Haubold, J. Tesic, L. Xie, and R. Yan. Semanticconcept-based query expansion and re-ranking for multimediaretrieval. In Proc. ACM Multimedia, 2007.

[8] K. Sparck Jones. Automatic Keyword Classification forInformation Retrieval. Archon Books, 1971.

[9] S. Deerwester, S. Dumais, G. Furnas, T. Landauer, andR. Harshman. Indexing by latent semantic analysis. Journal ofthe American Society of Information Science, 41(6):391–407,1990.

[10] A. Beck and M. Teboulle. A fast iterativeshrinkage-thresholding algorithm for linear inverse problems.SIAM Journal on Imaging Sciences, 2(1):183–202, 2009.

[11] W. Liu, Y.-G. Jiang, J. Luo, and S.-F. Chang. Noise resistantgraph ranking for improved web image search. In Proc. CVPR,2011.

[12] S. Lazebnik, C. Schmid, and J. Ponce. Beyond bags of features:Spatial pyramid matching for recognizing natural scenecategories. In Proc. CVPR, 2006.

[13] W. Liu, J. He, and S.-F. Chang. Large graph construction forscalable semi-supervised learning. In Proc. ICML, 2010.

[14] W. Liu, J. Wang, S. Kumar, and S.-F. Chang. Hashing withgraphs. In Proc. ICML, 2011.

[15] W. Liu, J. Wang, R. Ji, Y.-G. Jiang, and S.-F. Chang.Supervised hashing with kernels. In Proc. CVPR, 2012.

772