130 IEEE JOURNAL ON EMERGING AND SELECTED TOPICS IN ......complementary clues: 1) the center visual...

12
130 IEEE JOURNAL ON EMERGING AND SELECTED TOPICS IN CIRCUITS AND SYSTEMS, VOL. 4, NO. 1, MARCH 2014 Embedding Multi-Order Spatial Clues for Scalable Visual Matching and Retrieval Shiliang Zhang, Qi Tian, Senior Member, IEEE, Qingming Huang, Senior Member, IEEE, and Yong Rui, Fellow, IEEE Abstract—Matching duplicate visual contents among images serves as the basis of many vision tasks. Researchers have pro- posed different local descriptors for image matching, e.g., oating point descriptors like SIFT, SURF, and binary descriptors like ORB and BRIEF. These descriptors either suffer from relatively expensive computation or limited robustness due to the compact binary representation. This paper studies how to improve the matching efciency and accuracy of oating points descriptors and the matching accuracy of binary descriptors. To achieve this goal, we embed the spatial clues among local descriptors to a novel local feature, i.e., multi-order visual phrase which contains two complementary clues: 1) the center visual clues extracted at each image keypoint and 2) the neighbor visual and spatial clues of multiple nearby keypoints. Different from existing visual phrase features, two multi-order visual phrases are exibly matched by rst matching their center visual clues, then estimating a match condence by checking the spatial and visual consistency of their neighbor keypoints. Therefore, multi-order visual phrase does not scarify the repeatability of classic visual word and is more robust to the quantization error than existing visual phrase features. We extract multi-order visual phrases from both SIFT and ORB and test them in image matching and retrieval tasks on UKbench, Oxford5K, and 1 million distractor images collected from Flickr. Comparisons with recent retrieval approaches clearly demon- strate the competitive accuracy and signicantly better efciency of our approaches. Index Terms—Image local descriptor, image matching, large-scale image retrieval, visual vocabulary. I. INTRODUCTION M ATCHING duplicate visual contents among images is an important research issue in computer vision research led, and it serves as a basis in various applications such as Manuscript received September 09, 2013; revised November 20, 2013; ac- cepted December 29, 2013. Date of publication January 23, 2014; date of cur- rent version March 07, 2014. This work was supported in part by the Army Re- search Ofce (ARO)under Grant W911NF-12-1-0057, and in part by Faculty Research Awards by NEC Laboratories of America, 2012 UTSA START-R Re- search Award, National Science Foundation of China (NSFC) 61128007, in part by National Basic Research Program of China(973 Program): 2012CB316400, and Natural Science Foundation of China (NSFC): 61025011 and 61332016, respectively. This paper was recommended by Guest Editor B. Yan. S. Zhang and Q. Tian are with the Department of Computer Science, Univer- sity of Texas at San Antonio, San Antonio, TX 78249 USA (e-mail: slzhang. [email protected]; [email protected]). Q. Huang is with University of Chinese Academy of Science, Beijing 100049, China (e-mail: [email protected]). Y. Rui is with Microsoft Research Asia, Beijing 100080, China (e-mail: [email protected]). Color versions of one or more of the gures in this paper are available online at http://ieeexplore.ieee.org. Digital Object Identier 10.1109/JETCAS.2014.2298272 image alignment [18], [3], [15], [29], 3-D object reconstruc- tion [2], duplicate detection [13], [33], large-scale partial-dupli- cate visual search [10], [8], [42], [40], [43], [41], [6], and many other image search works [34]–[36], [16], [44]. Discriminative image local descriptors are extracted from local image patches surrounding stable keypoints [18] or regions [20], hence they are robust to image occlusions and could be repeatedly detected from duplicate contents on images. These desirable prosperities make local descriptors well suited to match the duplicate visual contents or same objects among images. Currently, there are two kinds of image local descriptors, i.e., the oating point descriptors and binary descriptors. As one of the most popular oating point descriptors [18], [20], [12], the 128-dimensional SIFT [18] is computed by concatenating the 8-D histograms of gradient computed on 4 4 image subre- gions. VGA sized image commonly contains more than 1000 SIFT descriptors. Extracting and matching SIFT on VGA sized images may cost more than 1 s on modern CPUs. Binary de- scriptors like ORB [26] and BRIEF [4] are commonly gener- ated by comparing the intensities of pixels sampled at different locations. Hence they are far more efcient to compute than the SIFT, but are less discriminative and robust to image transfor- mations. In Section II, we give more detailed summarization to existing local descriptors. In this paper, we study how to im- prove the efciency of oating points descriptors and the accu- racy of binary descriptors by embedding the spatial clues among local descriptors. To improve the matching efciency of SIFT and utilize it in large-scale image search, a commonly used strategy is to quantize SIFT [22] as their nearest visual words in a pre-trained visual vocabulary [22]. Images hence could be represented as bag-of-visual words (BoWs) models, and the identical visual words among images would be considered as matched SIFT descriptors. It is easy to infer that the compact BoWs model al- lows for fast SIFT matching [30]. Meanwhile, with information retrieval approaches like inverted le indexes, BoWs model is well-suited to large-scale image search task [22]. However, in lots of works [8], [9], [23], [40], [45] traditional BoWs model shows limited discriminative power. As discussed in these works [8], [9], [23], [40], [45], the poor performance of BoWs model is mainly due to two reasons: 1) representing a high dimensional SIFT descriptor with a single visual word results in large quantization error, and 2) single visual word could not model and preserve the spatial clues among image keypoints, which have been proven important in image matching [30], [40], [45]. 2156-3357 © 2014 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.

Transcript of 130 IEEE JOURNAL ON EMERGING AND SELECTED TOPICS IN ......complementary clues: 1) the center visual...

Page 1: 130 IEEE JOURNAL ON EMERGING AND SELECTED TOPICS IN ......complementary clues: 1) the center visual clues extracted at each image keypoint and 2) the neighbor visual and spatial clues

130 IEEE JOURNAL ON EMERGING AND SELECTED TOPICS IN CIRCUITS AND SYSTEMS, VOL. 4, NO. 1, MARCH 2014

Embedding Multi-Order Spatial Clues for ScalableVisual Matching and Retrieval

Shiliang Zhang, Qi Tian, Senior Member, IEEE, Qingming Huang, Senior Member, IEEE, andYong Rui, Fellow, IEEE

Abstract—Matching duplicate visual contents among imagesserves as the basis of many vision tasks. Researchers have pro-posed different local descriptors for image matching, e.g., floatingpoint descriptors like SIFT, SURF, and binary descriptors likeORB and BRIEF. These descriptors either suffer from relativelyexpensive computation or limited robustness due to the compactbinary representation. This paper studies how to improve thematching efficiency and accuracy of floating points descriptorsand the matching accuracy of binary descriptors. To achieve thisgoal, we embed the spatial clues among local descriptors to a novellocal feature, i.e., multi-order visual phrase which contains twocomplementary clues: 1) the center visual clues extracted at eachimage keypoint and 2) the neighbor visual and spatial clues ofmultiple nearby keypoints. Different from existing visual phrasefeatures, two multi-order visual phrases are flexibly matched byfirst matching their center visual clues, then estimating a matchconfidence by checking the spatial and visual consistency of theirneighbor keypoints. Therefore, multi-order visual phrase does notscarify the repeatability of classic visual word and is more robustto the quantization error than existing visual phrase features. Weextract multi-order visual phrases from both SIFT and ORB andtest them in image matching and retrieval tasks on UKbench,Oxford5K, and 1 million distractor images collected from Flickr.Comparisons with recent retrieval approaches clearly demon-strate the competitive accuracy and significantly better efficiencyof our approaches.

Index Terms—Image local descriptor, image matching,large-scale image retrieval, visual vocabulary.

I. INTRODUCTION

M ATCHING duplicate visual contents among images isan important research issue in computer vision research

filed, and it serves as a basis in various applications such as

Manuscript received September 09, 2013; revised November 20, 2013; ac-cepted December 29, 2013. Date of publication January 23, 2014; date of cur-rent version March 07, 2014. This work was supported in part by the Army Re-search Office (ARO)under Grant W911NF-12-1-0057, and in part by FacultyResearch Awards by NEC Laboratories of America, 2012 UTSA START-R Re-search Award, National Science Foundation of China (NSFC) 61128007, in partby National Basic Research Program of China(973 Program): 2012CB316400,and Natural Science Foundation of China (NSFC): 61025011 and 61332016,respectively. This paper was recommended by Guest Editor B. Yan.S. Zhang and Q. Tian are with the Department of Computer Science, Univer-

sity of Texas at San Antonio, San Antonio, TX 78249 USA (e-mail: [email protected]; [email protected]).Q. Huang is with University of Chinese Academy of Science, Beijing 100049,

China (e-mail: [email protected]).Y. Rui is with Microsoft Research Asia, Beijing 100080, China (e-mail:

[email protected]).Color versions of one or more of the figures in this paper are available online

at http://ieeexplore.ieee.org.Digital Object Identifier 10.1109/JETCAS.2014.2298272

image alignment [18], [3], [15], [29], 3-D object reconstruc-tion [2], duplicate detection [13], [33], large-scale partial-dupli-cate visual search [10], [8], [42], [40], [43], [41], [6], and manyother image search works [34]–[36], [16], [44]. Discriminativeimage local descriptors are extracted from local image patchessurrounding stable keypoints [18] or regions [20], hence theyare robust to image occlusions and could be repeatedly detectedfrom duplicate contents on images. These desirable prosperitiesmake local descriptors well suited to match the duplicate visualcontents or same objects among images.Currently, there are two kinds of image local descriptors, i.e.,

the floating point descriptors and binary descriptors. As one ofthe most popular floating point descriptors [18], [20], [12], the128-dimensional SIFT [18] is computed by concatenating the8-D histograms of gradient computed on 4 4 image subre-gions. VGA sized image commonly contains more than 1000SIFT descriptors. Extracting and matching SIFT on VGA sizedimages may cost more than 1 s on modern CPUs. Binary de-scriptors like ORB [26] and BRIEF [4] are commonly gener-ated by comparing the intensities of pixels sampled at differentlocations. Hence they are far more efficient to compute than theSIFT, but are less discriminative and robust to image transfor-mations. In Section II, we give more detailed summarization toexisting local descriptors. In this paper, we study how to im-prove the efficiency of floating points descriptors and the accu-racy of binary descriptors by embedding the spatial clues amonglocal descriptors.To improve the matching efficiency of SIFT and utilize it

in large-scale image search, a commonly used strategy is toquantize SIFT [22] as their nearest visual words in a pre-trainedvisual vocabulary [22]. Images hence could be represented asbag-of-visual words (BoWs) models, and the identical visualwords among images would be considered as matched SIFTdescriptors. It is easy to infer that the compact BoWs model al-lows for fast SIFT matching [30]. Meanwhile, with informationretrieval approaches like inverted file indexes, BoWs model iswell-suited to large-scale image search task [22]. However, inlots of works [8], [9], [23], [40], [45] traditional BoWs modelshows limited discriminative power. As discussed in theseworks [8], [9], [23], [40], [45], the poor performance of BoWsmodel is mainly due to two reasons: 1) representing a highdimensional SIFT descriptor with a single visual word resultsin large quantization error, and 2) single visual word could notmodel and preserve the spatial clues among image keypoints,which have been proven important in image matching [30],[40], [45].

2156-3357 © 2014 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.

Page 2: 130 IEEE JOURNAL ON EMERGING AND SELECTED TOPICS IN ......complementary clues: 1) the center visual clues extracted at each image keypoint and 2) the neighbor visual and spatial clues

ZHANG et al.: EMBEDDING MULTI-ORDER SPATIAL CLUES FOR SCALABLE VISUAL MATCHING AND RETRIEVAL 131

Fig. 1. An example of matched multi-order visual phrases between twoimages. Red: One-order match. Green: Two-order match. Blue Three-ordermatch. White: Four-order match. Higher order means more similar spatial andvisual clues of neighbor keypoints, hence corresponds to larger confidence ofcorrectness.

To improve the matching performance of BoWs model, thereare mainly two lines of research: 1) decrease the quantizationerror by softly quantizing each SIFT descriptor into multiple vi-sual words [24], [39] or embedding the quantization loss [8] inimage index file, and 2) capture more spatial clues in image re-trieval by verifying the matched visual words [30], [47], [7], [6]or embedding spatial configurations directly into visual wordcombinations and visual phrases [40], [45], [38] which pro-duce stronger discriminative power and fewer false matches.These approaches have shown substantial improvements overthe baseline BoWs model. However, these approaches eithercan not jointly take these two issues into consideration, i.e.,capture more spatial clues without magnifying the quantizationerror, or significantly degrade the efficiency or memory cost byintroducing extra computation or storage. We briefly summa-rize these approaches and emphasize our difference with themin Section II. A more comprehensive review could be found in[31].To solve these issues mentioned above, we extract the

multi-order visual phrase from each image keypoint by 1)quantizing the SIFT descriptor at the image keypoint intocenter visual word, and 2) quantizing and recording the spatialand visual clues of multiple neighbor keypoints. To achieveflexible matching and avoid aggregating quantization error, wematch two multi-order visual phrases by 1) firstly matchingtheir center visual words, 2) then checking the number theirneighbor keypoints showing similar visual and spatial clues,i.e., the number of matched neighbor keypoints. If two oftheir neighbor keypoints are matched, we call this match as atwo-order match. Similarly there are three-order and four-ordermatches. A typical example of matched multi-order visualphrases is illustrated in Fig. 1, where the lines with differentcolors denote different match orders. It is easy to infer that,higher order match corresponds to more consistent spatialconfigurations, thus reasonably reflects higher confidence ofcorrectness. Note that, the worst case of multi-order visualphrase matching is the zero-order match. This case onlymatches the center visual words, thus equals to matching thebaseline BoWs models. Therefore, multi-order visual phrasedoes not scarify the repeatability of BoWs model, but couldsignificantly improve the discriminative power by embeddingand estimating multi-order matches.

Multi-order visual phrase could be indexed with invertedindexes for large-scale image search. We use center visualwords to build inverted indexes, and record the spatial andvisual clues of neighbor keypoints in inverted indexes foronline verification. During online retrieval, multi-order visualphrases in query and database images are matched in a cascadeway by 1) comparing the center visual words, 2) matchingthe visual clues of neighbor keypoints, and 3) matching thespatial clues of neighbor keypoints. High order matches wouldcontribute more to the image similarity. Note that, each stepsimply compares the quantized features and could discard alarge portion of false matches. Therefore, multi-order visualphrase performs efficiently in large-scale image search.Existing approaches improve the BoWsmodel either by com-

pensating the quantization error or capturing extra spatial cluesamong visual words. Few of them consider and study these twoissues jointly. For instance, existing visual phrases capture extraspatial clues but significantly magnify the quantization error byconsidering each visual phrase as a whole cell and only visualphrases containing the identical visual words could be matched[42], [45]. Therefore, they are commonly used as auxiliary fea-tures to visual word. Differently, multi-order visual phrase is anovel alternative to visual word and can be flexibly matched.It shows obvious advantages over existing visual phrases in theaspects of flexibility, efficiency, and repeatability. Our exten-sive experiments and comparisons with state-of-the-arts clearlyverify these advantages.Another advantage of multi-order visual phrase is that it can

be extracted based onORB for accurate and fast imagematchingand large-scale image search. The substantial advantages of effi-ciency and compactness make binary descriptors like ORB [26]and BRIEF [4] well-suited to computation sensitive scenarioslike mobile based panorama generation, mobile image search,etc. However, to the best of our knowledge, there is limited re-search effort studying how to improve their matching accuracyand utilize these binary descriptors in large-scale image search.Our experiments show that the ORB based multi-order visualphrase shows competitive retrieval accuracy and significantlyfaster speed than traditional BoWs model based image searchmethods.The remainder of this paper is organized as follows. Section II

introduces the related work on local descriptor, imagematching,and retrieval. Section III presents the extraction of multi-ordervisual phrase (MVP). Section IV discusses the indexing and re-trieval algorithms. Section V analyzes experimental results, fol-lowed by the conclusions in Section VI. In the followings, wealso call multi-order visual phrase as MVP for short.

II. RELATED WORK

In this section, we summarize the related works on imagelocal descriptor and the state-of-the-arts utilizing local descrip-tors in image matching and retrieval.

A. Image Local Descriptor

SIFT is extracted from image keypoints detected with differ-ence-of-Gaussian (DoG) detector, which detects the scale, ori-entation, and location of each keypoint. With the scale and ori-entation clues, the image patches surrounding keypoints could

Page 3: 130 IEEE JOURNAL ON EMERGING AND SELECTED TOPICS IN ......complementary clues: 1) the center visual clues extracted at each image keypoint and 2) the neighbor visual and spatial clues

132 IEEE JOURNAL ON EMERGING AND SELECTED TOPICS IN CIRCUITS AND SYSTEMS, VOL. 4, NO. 1, MARCH 2014

be normalized into fixed orientation and size, hence scale andorientation invariant local descriptors could be extracted after-ward. Based on SIFT, some other similar floating point descrip-tors like SURF [1], PCA-SIFT [12], gradient location and ori-entation histogram (GLOH) [21] are proposed. According to thereported results in [21], SIFT and one of its extensions, i.e., theGLOH generally outperform the other descriptors.Other than the above mentioned floating point descriptors,

many efforts have been made to design efficient and compactdescriptors alternative to SIFT or SURF in recent years. Someresearchers propose low-bitrate descriptors such as BRIEF [4],BRISK [14], CHoG [5], Edge-SIFT [43], and ORB [26] whichare fast both to build and match.Each bit of BRIEF [4] is computed by considering signs of

simple intensity difference tests between pairs of points sampledfrom the image patches. Despite the clear advantage in speed,BRIEF suffers in terms of reliability and robustness as it haslimited tolerance to image distortions and transformations. TheBRISK descriptor [14] first efficiently detects interest points inthe scale-space pyramid based on a detector inspired by FAST[25] and AGAST [19]. Then given a set of detected keypoints,BRISK descriptor is composed as a binary string by concate-nating the results of simple brightness comparison tests. Theadopted detector of BRISK gets location, scale, and orienta-tion clues for each interest point. Hence BRISK achieves orien-tation-invariance and scale-invariance. ORB descriptor is builtbased on BRIEF but is extracted with a novel corner point de-tector, i.e., o-FAST [26], and is more robust to the rotation andimage noises. The authors demonstrate that ORB is significantlyfaster than SIFT, while performs as well in many situations [26].Compared with SIFT, despite of the advantage in speed, thesecompact descriptors show limitations in the aspects of descrip-tive power, robustness, or generality.

B. Local Feature Based Image Matching and Retrieval

Local descriptors like SIFT [18], SURF [1], ORB [26], etc.,are originally proposed for image matching. As discussed byLowe in [18], SIFT descriptors can be matched with two cri-teria, i.e., 1) the cosine similarity of matched SIFT should belarger than a threshold, e.g., 0.83; 2) similarity ratio betweenthe closest match and the second-closest match should be largerthan a threshold, e.g., 1.3. However, this two criteria ignores thespatial clues among keypoints, which has been proven importantfor visual matching. Some works improve the accuracy of localfeature matching by verifying the spatial consistency amongmatched local features, i.e., the correctly matched local descrip-tors should satisfy consistent spatial relationship. As discussedin [11], the matching accuracy of SIFT, SURF, and PCA-SIFTis substantially improved with RANSAC [7], which estimatesan affine transformation between two images and rejects thematched local descriptors violating this transformation.Matching raw local descriptors is expensive to com-

pute because each image may contains over one thousand ofhigh-dimensional descriptors. Meanwhile, exploiting geometricrelationships with full geometric verification like RANSAC [7]improves retrieval precision, but is computationally expensive.Therefore, raw SIFT and RANSAC are not ideal for large-scale

applications. An efficient solution is to quantize local descrip-tors into visual words and generate BoWs model. It has beenillustrated that single visual word can not preserve the spatialinformation in images, which has been proven important forvisual matching and recognition [30], [42], [40], [47], [46],[45], [41], [6]. To combine BoWs with spatial information, lotsof works are conducted to combine multiple visual words withspatial information [30], [42], [40], [47], [46], [45], [41], [6].This may be achieved, for example, by using feature pursuitalgorithms such as AdaBoosting [32], as demonstrated byLiu et al. [17]. Visual word correlogram and correlation [27],which are leveraged from the color correlogram, are utilized tomodel the spatial relationships among visual words for objectrecognition. In [38], visual words are bundled and the corre-sponding image indexing and visual word matching algorithmsare proposed for large-scale near-duplicate image retrieval. In[47], the spatial configurations of visual words are recorded inindex, which are hence utilized for spatial verification duringonline retrieval. In a recent work [6], Chu et al. utilize a spatialgraph model to estimate the spatial consistency among matchedvisual words during online retrieval. Hence, the visual wordsviolating the spatial consistency could be effectively identifiedand discarded. Visual phrases preserve extra spatial informationby involving multiple visual words, thus generally present moreadvantages in large-scale image retrieval [8], [42], [45]. Forexample, descriptive visual phrase (DVP) is generated in [42]by grouping and selecting two nearby visual words. Similarly,visual phrases proposed in [45] are extracted by discoveringmultiple spatially stable visual words. Generally, consideringvisual words in groups rather than single visual word capturesstronger discriminative power.Not withstanding the success of existing visual phrase fea-

tures, they are still limited in flexility and repeatability. Cur-rently, each visual phrase is commonly treated as a whole celland two of them arematched only if they contain identical singlevisual words [42], [45]. Firstly, this means only visual phrasescontaining the same number of visual words could be consid-ered formatching. Secondly, because of quantization error, visu-ally similar descriptors might be quantized into different visualwords. Such quantization error would be aggregated in visualword combinations, and makes visual phrase matching moredifficult to occur.Different from existing visual phrase features, the proposed

multi-order visual phrase captures rich spatial clues and allowsmore flexible matching without scarifying the repeatability ofBoWs model. Thus, it is superior in the aspects of flexibility,quantization error, and efficiency. Our approach also differswith existing spatial verification methods in that, it does notcompute graph models [6] or needs multiple iterations withmatrix operation as in [47], and is finished in a cascade manor.Hence, it is potential to achieve better efficiency. Besides that,our approach does not keep or discard a candidate match, butreasonably estimates a confidence of correctness for it. This ishelpful to improve the recall rate and decrease the quantizationerror. In the following section, we introduce how we extractmulti-order visual phrase from SIFT descriptors and binaryORB descriptors.

Page 4: 130 IEEE JOURNAL ON EMERGING AND SELECTED TOPICS IN ......complementary clues: 1) the center visual clues extracted at each image keypoint and 2) the neighbor visual and spatial clues

ZHANG et al.: EMBEDDING MULTI-ORDER SPATIAL CLUES FOR SCALABLE VISUAL MATCHING AND RETRIEVAL 133

III. MULTI-ORDER VISUAL PHRASE

The proposed MVP could be extracted from both SIFT andORB descriptor. We use SMVP to denote the MVP extractedfrom SIFT and OMVP to denote the MVP extracted from ORB.In the following part, we first introduce how we extract SMVPand them show the extraction steps of OMVP.

A. Multi-Order Visual Phrase From SIFT

Before SMVP extraction, we first detect image keypointswith DoG detector, and extract a SIFT descriptor from eachof them [18]. From each keypoint, we capture four aspects ofclues [18], i.e., the 128 dimensional SIFT descriptor: ,the location of the keypoint: , the scale of the keypoint: ,and the orientation of the keypoint: . We hence denote a DoGkeypoint in image as

(1)

where is the collection of DoG keypoints in image .We extract a SMVP from each keypoint with two steps: 1)

quantize the SIFT descriptor at the keypoint into visual wordwith a pre-trained visual vocabulary, i.e., center visual word ex-traction, 2) neighbor keypoint extraction and neighbor visualand spatial clues extraction. Each SMVP thus contains three as-pects of clues, i.e., the center visual word, the visual and spatialclues of neighbor keypoints. We define a SMVP extracted atkeypoint , i.e., as

(2)

where is one of the neighbor keypoints of , is the totalnumber of neighbor keypoints considered in SMVP computa-tion, , and are the center visual word, visual and spa-tial clues of neighbor keypoint , respectively. We proceed tointroduce the extraction of these clues in the following parts.1) Center Visual Word Extraction: We quantize the SIFT de-

scriptor extracted at each keypoint into center visual word witha pre-trained visual vocabulary tree. A visual vocabulary treeis obtained by hierarchical K-means clustering of local descrip-tors (extracted from a separate dataset), with a branch factorand depth . This tree typically is deep and contains mil-

lions of leaf nodes, e.g., and , corresponding tomillion leaf nodes, in order to achieve fast quantization

and high distinctiveness [22].For a descriptor extracted at keypoint , we quantize it

along to the leaf nodes, i.e., the th level, as

(3)

where denotes the maximum level of the vocabulary tree, andis the visual word quantized from in the th level of the

vocabulary tree.We take the as the center visual word for keypoint in

(2). It is easy to infer that the generation of center visual wordis identical with the generation of traditional visual word in[22]. Meanwhile, matching center visual words is identical tomatching the traditional visual words in [22]. In the followingsection, we introduce how we capture neighbor keypoints andextract multi-order spatial clues from them.

Fig. 2. Illustration of the neighbor spatial and visual clues extraction.

2) Neighbor Feature Extraction: To improve the discrimi-native power of SMVP, we further extract the spatial and visualclues from the neighbor keypoints of each center visual word.Because correct matches show similar neighbor spatial and vi-sual configurations [30], [47], [38], [7], we could compare theneighbor spatial and visual clues of two center visual words tocheck if they are correctly matched.As shown in Fig. 2, for a keypoint , we consider its neighbor

region to extract neighbor keypoints. To achieve scale invari-ance, we compute the radius of the region based on the scale of, i.e.,

(4)

where parameter controls the size of the region. Larger in-volves more keypoints and captures richer spatial information.However, too large means large search space and degradesthe efficiency. We set as 12, which produces a good tradeoffin our experiments.Within the neighbor region of , we select at most key-

points with closest scales to for spatial and visual clues ex-traction. DoG keypoints with similar scales commonly locateon the nearby scale spaces and show similar responses [18].Therefore, the co-occurrence relationships between them couldbe more stable. For each selected keypoint , we extract its vi-sual feature and two aspects of spatial features, i.e.,

(5)

where and are the orientation and distance re-lationships between keypoints and .To compute the visual feature of neighbor keypoint , we

also quantize its SIFT descriptor into visual words with vi-sual vocabulary . Different from the center visual word, isgenerated with a smaller codebook to decrease the quantizationerror. Specifically, we quantize with

(6)

According to (6), we quantize along to the th levelof the vocabulary tree, which contains visual words. Theth level produces fewer division hyper planes in feature spacethan the th level. Therefore, fewer feature points would beclose to the division planes in (6), i.e., th level produces smallerquantization error than the th level in the vocabulary tree. Asillustrated in Fig. 3, (3) naturally computes (6) because it hasto go through the th level. Thus, we do not need to quantizeeach descriptor for two times. Compared with traditional BoWs

Page 5: 130 IEEE JOURNAL ON EMERGING AND SELECTED TOPICS IN ......complementary clues: 1) the center visual clues extracted at each image keypoint and 2) the neighbor visual and spatial clues

134 IEEE JOURNAL ON EMERGING AND SELECTED TOPICS IN CIRCUITS AND SYSTEMS, VOL. 4, NO. 1, MARCH 2014

Fig. 3. Illustration of the quantization of center local descriptors and neighborlocal descriptors with vocabulary tree.

Fig. 4. Illustration of the matched SMVPs with different match orders, the redlines denote the matched center visual words, the blue lines denote the matchedneighbor local descriptors showing similar visual and spatial clues.

model, no extra computational overhead is introduced in featurequantization.For computation, we quantize the orientation differ-

ence between and into intervals, i.e.,

(7)

where is the difference of orientation between and(Fig. 2) within [0, ). reflects the distance betweenand . It is computed as

(8)

where is the distance between and (Fig. 2).Different from the existing visual phrases, each of which

is treated as a whole cell, two SMVPs are matched by firstlymatching their center visual words, and then verifying theirneighbor spatial and visual clues. For two SMVPs with iden-tical center visual words, if c of their neighbor keypoints showsimilar spatial and visual features, we call this match as a-order match. Matched SMVPs with different match ordersare illustrated in Fig. 4, obviously, higher order match needs tosatisfy more strict spatial constraint, making it more difficult tooccur. Therefore, it is not necessary to set in (2) as too largevalues, which also consume more memory and storage space inimage retrieval. We hence set as 4, which at most producesfour-order matches.After SMVP extraction, we build inverted indexes according

to the center visual word, and record the neighbor spatialand visual clues in the index file for online verification (willbe discussed in Section IV). To generate compact index, therecorded neighbor visual and spatial clues should be as compact

as possible. We set parameter in (7) and (8) as 16. Therefore,and could be preserved by 4 bits, respectively.

As for the vocabulary tree , we set , whichproduces about 1 million leaf nodes. For , we set in (6) as2, corresponding to 256 visual words. Therefore, the couldbe preserved by 8 bits, which represents visual wordIDs. For each neighbor keypoint, we only needbits to preserve its spatial and visual clues.

B. Multi-Order Visual Phrase From ORB

The extraction of OMVP is similar to the extraction of SMVP.Differently, we do not quantize ORB descriptor with vocabularytree because binary ORB allows fast matching using Hammingdistance. From each ORB feature detected by o-FAST detector,we also extract four aspects of information: the 256 bit ORBdescriptor: , the location of the keypoint: , the scale ofthe keypoint: , and the orientation of the keypoint: [26]. Wehence denote the ORB feature extracted at an o-FAST keypointin image as

(9)

where is the collection of o-FAST keypoints in image .Then from each keypoint, we extract OMVP with two steps:

1) extract the ORB descriptor at the keypoint as center ORBdescriptor, 2) neighbor keypoint extraction and neighbor visualand spatial clues extraction. Each OMVP also contains three as-pects of clues, i.e., the center ORB descriptor, the visual and spa-tial clues of neighbor keypoints. We define a OMVP extractedat keypoint , i.e., as

(10)

where is one of the neighbor keypoints of is the totalnumber of neighbor keypoints considered in OMVP computa-tion, , and are the center ORB descriptor, ORB de-scriptors and spatial clues of neighbor keypoints, respectively.According to (10), we directly preserve the ORB descriptors ofcenter keypoint and neighbor keypoints in OMVP. This is be-cause ORB allows fast matching with Hamming Distance.In (10), we only need to compute the spatial clues, i.e., .

The definition of in OMVP is identical to the one in SMVP in(5). Hence it is also computed with (7) and (8), respectively, i.e.,quantize and preserve the orientation and distance relationshipsbetween center keypoint and neighbor keypoints as and

, respectively. We set parameter in (7) and (8) as 16.Therefore, could be preserved by 8 bits. In the following sec-tion, we show howwe use SMVP andOMVP in imagematchingand image retrieval.

IV. APPLICATIONS: IMAGE MATCHING AND RETRIEVAL

A. Image Matching

In this part, we show how we match the extracted SMVPsand OMVPs. We first show the matching process of SMVPs,then proceed to introduce the matching of OMVPs.From each DoG keypoint, we extract one SMVP. Supposeand are two collections of DoG keypoints extracted from

images and , respectively. Then, we match the SMVPs be-tween and .

Page 6: 130 IEEE JOURNAL ON EMERGING AND SELECTED TOPICS IN ......complementary clues: 1) the center visual clues extracted at each image keypoint and 2) the neighbor visual and spatial clues

ZHANG et al.: EMBEDDING MULTI-ORDER SPATIAL CLUES FOR SCALABLE VISUAL MATCHING AND RETRIEVAL 135

For two SMVPs and , we firstcheck if their center visual words are identical, i.e., .We call SMVPs with identical center visual words as candidatematches. Suppose two SMVPs and compose a candi-date match. To check if this candidate match is correct, we see ifthe neighbor keypoints of and are similar enough. Wehence use the neighbor spatial and visual clues defined in (2)to conduct this checking. We first verify their neighbor visualwords, then check the orientation and distance relationships.Denote and as two neighbor keypoints in and ,respectively. The verifications are computed in a cascade waywith three steps

(11)

(12)

(13)

where the two parameters , and control the strictness of thespatial verification. Note that in (12), normalizes theorientation relationship difference within [0, ].The maximum number of neighbor keypoint is 4. Therefore,

to verify two SMVPs, we at most conduct timesof verification. However, this rarely occurs because the centervisual vocabulary checking could filter a large portion of falsematches. Besides that, the spatial and visual verifications arealso very fast due to the cascade structure. Hence, our verifica-tion strategy is similar to the spatial verification in VideoGoogle[30], but is far more efficient.The matching process of OMVP is almost identical to the one

of SMVP because they share similar definitions and representa-tions. The difference is that, we use hamming distance to matchthe center ORB descriptors and neighbor ORB descriptors, re-spectively. Suppose and are twoOMVPs from image and , respectively. They could com-pose a candidate match if, i.e.,

(14)

where controls the threshold of center visual descriptormatching. We set it as 40 for 256 bit ORB descriptorSuppose and compose a candidate match, we also

verify the spatial and visual clues of their neighbor keypoints.We use (12) and (13) to check their spatial relationship. Sup-pose and are two neighbor interest points in and ,respectively, we use Hamming distance to check the their visualconsistency, i.e.,

(15)

where controls the threshold of neighbor visual descriptormatching, we set it as 50 for 256 bit ORB descriptor.It is easy to observe that the matching process of SMVP and

OMVP are almost identical. The difference is mainly about theverification of visual clues. Since we set the maximum numberof neighbor keypoints in multi-order visual phrase, i.e., as 4,between two SMVPs or two OMVPs, at most 4 of their neigh-bors could pass the cascade spatial and visual verifications. Be-cause high order matches are difficult to occur, we avoid settingstrict , and to assure the recall rate of high order matches.

Fig. 5. Examples of matched SMVPs with different match orders. Red:One-order matches. Green: Two-order matches. Blue: Three-order matches.White: Four-order matches.

Fig. 6. Illustration of the index structure based on SMVP.

We experimentally set them as 2, and 3, respectively, which pro-duce reliable matching with different match orders. An exampleof matched SMVPs with different match orders is illustratedin Fig. 5. It is clear that high order matches are more reliableand more sparse than the low order matches. We will test thematching performance of SMVP and OMVP in Section V.

B. Image Indexing and Retrieval

After extracting a collection of SMVPs froma database image , we index in inverted indexes illustratedin Fig. 6. Specifically, for a SMVP ,we record the image ID of and its neighbor spatial and visualclues, i.e., , in the inverted index corresponding toits center visual word . This is slightly different from the tra-ditional approach, which records the image ID and the term fre-quency (TF) of visual word in inverted index [22]. More detailsabout inverted file indexing and term frequency-inverse docu-ment frequency (TF-IDF) weighting in image similarity com-putation could be found in [22].Based on the inverted indexes in Fig. 6, we propose our re-

trieval strategy. We also use IDF to weight the image similaritycomputation as in [22]. Instead of using TF, we use match ordercomputed by spatial and visual verification to measure the im-portance of matched SMVPs to image similarity, i.e., high ordermatches between query and a database image are more im-portant for image similarity computation. Accordingly, imagesimilarity is computed as

(16)

where and are two DoG keypoints in and , whosecenter visual words are identical, and returnsthe match order of and computed by spatial andvisual verification. controls the importance of high order

Page 7: 130 IEEE JOURNAL ON EMERGING AND SELECTED TOPICS IN ......complementary clues: 1) the center visual clues extracted at each image keypoint and 2) the neighbor visual and spatial clues

136 IEEE JOURNAL ON EMERGING AND SELECTED TOPICS IN CIRCUITS AND SYSTEMS, VOL. 4, NO. 1, MARCH 2014

Fig. 7. Illustration of the index structure based on OMVP.

matches in image similarity computation. It will be discussedin experiments.We use a similar index structure to index OMVP. As illus-

trated in Fig. 7, we use the center ORB descriptor to build in-verted indexes. The 256 bit ORB descriptor represents in-tegers, which is too large to be indexable. Therefore, we use thefirst 24 bits in ORB for indexing, which represents invertedindexes. We use the first 24 bits because the binary bins in ORBare selected with a greed search strategy, and the first selectedbins show larger discriminative ability and smaller correlation[26]. Similar to SMVP, we also record the neighbor visual andspatial clues in OMVP in inverted indexes. To make the indexfile as compact as possible, we only record the first 8 bits ofneighbor ORB descriptors as visual clues. Hence, the recordedspatial and visual clues for each OMVP is bits,which is identical to the one of SMVP. Therefore, we furthersimplify and compress the OMVP for indexing and retrieval,making it different from the OMVP used for image matching.Based on the above mentioned indexing structure of OMVP,

the similarity between a database image and query imagecan be computed with, i.e.,

(17)

where denotes the first 24 bits of center ORB descriptor,and are two o-FAST keypoints in and , whose ORB

features are matched, i.e., their hamming distance is less than. returns the match order of andcomputed by spatial and visual verification in (12), (13), and(15). In retrieval, we only record the first 8 bits of neighborORB descriptors as the visual clues. We hence set the visualverification parameter in (15) as 2. controls the strictness ofcenter ORB feature matching, and it largely affects the precisionand recall rate of OMVP based image search. It will be discussedin experiments.

V. EXPERIMENTS

A. Experimental Setup

We evaluate the proposed method in a image matching taskand three image retrieval tasks, i.e., object search, landmarksearch, and large-scale image search. In UKbench, there com-monly exists obvious transformations among the four images ofthe same object, like viewpoint changes, illumination changes,etc. Therefore, UKbench is suitable to test the robustness of de-scriptors in image matching. We thus use UKbench as the test

set for image matching and object search. For landmark search,we use the Oxford5K dataset which contains 5063 annotatedlandmark images. As for the large-scale similar image search,we mix the two datasets with 1 million images collected fromFlickr and conduct the retrieval with the original queries in thetwo datasets. We adopt the metrics in the original papers of thetwo datasets to evaluate the retrieval performance: the recallrate for the top-4 candidates (referred as the N-S score) for theUKbench, and the mAP (mean average precision) for the Ox-ford5K. All experiments are conducted on a PC with a 4-core3.4 GHz processor and 16 GB memory.

B. Image Match

To test the image matching performance of SMVP andOMVP, we first collect all the correctly matched DoG ando-FAST keypoints as a groundtruth set. To collect this groundtruth set, we first match raw SIFT descriptors and raw ORBdescriptors with relatively loose constraints. Then, we useRANSAC [7] to filter the false matches. Specifically, amongthe four images of each object in UKbench, we first matchtheir SIFT and ORB descriptors between the first image tothe other three images with cosine similarity threshold 0.83and Hamming distance threshold 40, respectively. Then, weconduct RANSAC to the initial matching results. All thematches passing the RANSAC are recorded as groundtruth.In the subsequent experiments, we also match the first imagewith the other three images with different algorithms, and thencompute the precision and recall rates based on the collectedgroundtruth set. Note that, with the matching criteria suggestedby Lowe [18], raw SIFT descriptor could produce more reliablematching than the quantized BoWs model. However, SIFTneeds more than 1000 ms to match two images, which is tooslow for real world applications.Our spatial and visual verification identifies matched SMVPs

and OMVPs with different match orders. In Fig. 8, we presentthe precision and recall rate of matched SMVPs and OMVPsof 0–4 order matches. Obviously in the figure, high ordermatches are constantly more accurate than the low order ones.Meanwhile, the zero-order matches, i.e., candidate matches failin spatial and visual verification, present both poor precisionand recall rates. This result reasonably implies that most of thecorrect candidate matches could pass the cascade verification,while the false matches are effectively identified and rejected.Besides that, in Fig. 8, the high order (e.g., four-order)

matches show lower recall rate than the low order (e.g.,one-order) matches. This also validates that it is not necessaryto set in (2) and (10) as larger values because higher ordermatches are more difficult to occur. From Fig. 8, we can alsoobserve that OMVP performs better than SMVP in imagematching. This is mainly because we preserve raw ORB de-scriptors in OMVP for image matching but preserve quantizedSIFT descriptors in SMVP.To intuitively show the matching performance of multi-order

visual phrase, we compare SMVPs and OMVPs with 1–4 ordermatches against raw SIFT, BoWs model, and raw ORB, respec-tively. The comparisons of precision, recall and efficiency are

Page 8: 130 IEEE JOURNAL ON EMERGING AND SELECTED TOPICS IN ......complementary clues: 1) the center visual clues extracted at each image keypoint and 2) the neighbor visual and spatial clues

ZHANG et al.: EMBEDDING MULTI-ORDER SPATIAL CLUES FOR SCALABLE VISUAL MATCHING AND RETRIEVAL 137

Fig. 8. (a) Matching performance of SMVP with different match orders.(b) Matching performance of OMVP with different match orders.

TABLE ICOMPARISON OF MATCHING PRECISION, RECALL, AND EFFICIENCY

illustrated in Table I. Note that the compared time includes fea-ture extraction time.As illustrated in Table I, raw SIFT and raw ORB are utilized

for ground truth set generation, hence their recall rate are 100%.By matching the identical visual words between two images,the precision of BoWs model achieves 0.761. Considering extramulti-order spatial clues, SMVP and OMVP outperform theirbaseline BoWs model and raw ORB by large margins, respec-tively. This clearly validates the importance of spatial clues inimage matching. It is necessary to point that SMVP and OMVPare also fast to match. SMVP is one time faster than raw SIFT,and achieves similar efficiency with BoWs model. OMVP is aabout 1 order faster than raw SIFT. Compared with raw ORB,the matching time overhead of OMVP is almost unnoticeable.This comparison clearly demonstrates the validity of our pro-posed multi-order visual phrase in image matching.From this experiment, we observe that OMVP achieves

the best trade-off between accuracy and efficiency in imagematching. We also tested it by matching partial-duplicate im-ages. Some example of matched partial-duplicate images basedon OMVP are illustrated in Fig. 9.

C. Image Search

In this experiment, we implement the BoWs model basedretrieval method as a baseline and compare SMVP and OMVPwith state-of-the-art retrieval approaches. Note that our indexstructure in Fig. 7 and our retrieval strategy in (17) also allowusing 24-bit ORB without spatial clues for image retrieval.Therefore, we also use 24-bit ORB as a baseline.Both SMVP and OMVP based retrieval is related to one pa-

rameters in (16) and (17), i.e., which controls the importanceof high order matches. It is easy to infer that larger puts largerweight to higher order matches. OMVP based retrieval is alsorelated to another parameter, i.e., in (17). In the followings,we first make comparison with baselines and test the influenceof different parameter settings.

Fig. 9. Examples of the matched partial-duplicate images based on OMVP.

The influence of and the comparison between SMVPand baseline BoWs model are illustrated in Fig. 10. It can beobserved that a proper significantly improves the retrievalperformance on both the two datasets. High order matchessatisfy more strict spatial constraints, hence would be moreaccurate than the low order matches. Therefore, it is reasonableto emphasize high order matches in image similarity compu-tation. However, overemphasizing the high-order matches bysetting a too large also degrades the performance. This mightbe because high order matches are more difficult to occur thanthe lower order ones due to the more strict matching constraint.Therefore, they are sparse and cannot perform well alone. Weobserve that setting as 0.4 achieves the best performanceon both the two datasets for SMVP. We hence set as 0.4for SMVP in the large-scale image retrieval experiment inSection V-D. Examples of matched SMVPs between query anddatabase images are illustrated in Fig. 11.To show the validity of SMVP, we further compare and

show the individual retrieval performance of different ordersof matches in Fig. 12. It is easy to observe that, the zero-ordermatch performs worse than the one-order match and two-ordermatch. This validates that lots of false matches are recognizedby SMVP as zero-order matches. It is also can be observedthat the one-order match outperforms the baseline BoWsmodel, even when it is used in retrieval as the single feature.

Page 9: 130 IEEE JOURNAL ON EMERGING AND SELECTED TOPICS IN ......complementary clues: 1) the center visual clues extracted at each image keypoint and 2) the neighbor visual and spatial clues

138 IEEE JOURNAL ON EMERGING AND SELECTED TOPICS IN CIRCUITS AND SYSTEMS, VOL. 4, NO. 1, MARCH 2014

Fig. 10. Influence of and comparison between SMVP and baseline BoWsmodel.

Fig. 11. Examples of matched SMVPs between query and database images.

Fig. 12. Retrieval performance of SMVP with different match orders.

In addition, the four-order match shows the worst performanceif it is used alone in retrieval. This is reasonable, becausehigh order matches produce high precision but low recall rate.This experiment clearly show that flexibly estimating differentmatch orders is important for visual phrase feature. Fig. 12also validates why setting too larger degrades the retrievalperformance.The influence of and to OMVP and the comparison be-

tween OMVP and 24-bit ORB are illustrated in Fig. 13. Sim-ilar to the observation of SMVP, larger significantly improvesthe retrieval performance but setting a too large degradesthe performance. It also can be observed that also influencesthe retrieval performance. According to (17), larger wouldvisit more inverted indexes, hence would improve the recallrate. However, too large may introduce many noisy candidatematches that are hard to be filtered by spatial and visual ver-ification. We observe that setting as three achieves the bestperformance for both OMVP and ORB. Meanwhile, settingas 0.4 and 0.5 for OMVP achieve the best performance on UK-bench and Oxford5K, respectively. We also use these parametersettings for OMVP in the large-scale image retrieval experimentin Section V-D.

Fig. 13. Influence of and and comparison between OMVP and baseline24-bit ORB.

TABLE IIRETRIEVAL PERFORMANCE COMPARISON WITH STATE-OF-THE-ARTS

TABLE IIIEXPERIMENTAL RESULT OF LARGE-SCALE IMAGE SEARCH ON UKBENCH

PLUS 1 MILLION FLICKR IMAGES

The comparison between SMVP and OMVP with recent re-trieval methods (without re-ranking) are shown in Table II. Thecomparison demonstrates that our approach shows competitiveperformance. Note that, the extraction of SMVPs and OMVPdoes not introduce noticeable computation. The online retrievalalso could be finished efficiently by verifying the neighbor key-points in a cascade way, i.e., each step rejects a large portionof false matches. Therefore, SMVP shows similar efficiencywith the BoWs model, and thus would be more efficient thansome of the state-of-the-arts, which either need to fuse the re-trieval results of multiple features [37], or need complicated spa-tial clue computation [40] and online spatial verification [45].Detailed comparison of retrieval efficiency will be made in thelarge-scale retrieval experiment.

D. Large-Scale Image Retrieval

In large-scale image retrieval experiment, we use 1 millionFlickr images as distractors and conduct the retrieval with theoriginal queries in UKbench and Oxford5K. We make compar-ison with the BoWs model, 24-bit ORB, Hamming embedding(HE) [9], and descriptive visual phrase (DVP) [42] in the aspectsof retrieval accuracy, efficiency, and memory consumption. Ex-perimental results are summarized in Tables III and IV.In Tables III and IV, SMVP shows significantly better perfor-

mance than BoWs model on both the two datasets. SMVP alsooutperforms the HE [9] and DVP [42]. This is consistent withour observations in Section V-C. We also observe that althoughSMVP considers multiple neighbor keypoints and needs onlineverification in image similarity computation, it shows similar re-trieval efficiency with the BoWs model and is faster than Ham-ming Embedding and DVP. This proves the high efficiency ofSMVP extraction and verification.

Page 10: 130 IEEE JOURNAL ON EMERGING AND SELECTED TOPICS IN ......complementary clues: 1) the center visual clues extracted at each image keypoint and 2) the neighbor visual and spatial clues

ZHANG et al.: EMBEDDING MULTI-ORDER SPATIAL CLUES FOR SCALABLE VISUAL MATCHING AND RETRIEVAL 139

TABLE IVEXPERIMENTAL RESULT OF LARGE-SCALE IMAGE SEARCH ON

OXFORD5K PLUS 1 MILLION FLICKR IMAGES

It is also obvious that the OMVP outperforms BoWs modelby a large margin, even it only uses the first 24 bits in raw ORBdescriptor. Moreover, the retrieval speed of OMVP is about fourtimes faster than BoWs model and five times faster than Ham-ming Embedding and DVP. This clearly demonstrates the ad-vantages of binary descriptors in the aspect of efficiency. Gen-erally, we can conclude that OMVP achieves competitive re-trieval performance, with significantly faster speed than existingretrieval approaches.In our setting, for each SMVP and OMVP, we record 64 bits

of neighbor spatial and visual clues. This is larger than the 32bit TF (Term Frequency) for each visual word stored in classicinverted file indexes [22]. Therefore, SMVP and OMVP con-sume more memory in retrieval than the baseline BoWs model.However, SMVP and OMVP still outperform several recentapproaches like Hamming embedding [8], descriptive visualphrase [42], bundled feature [38], etc., in the aspect of memoryconsumption. Moreover, for applications in memory cost sensi-tive scenarios, we can flexibly set in (2) and (10) as 2 or 1,which captures more spatial clues but corresponds to the sameor even better memory performance than the baseline BoWsmodel.Generally, from this large-scale experiment, we could con-

clude that our proposed multi-order visual phrase shows sig-nificantly better accuracy than baseline BoWs model and out-performs several recent approaches in the aspects of accuracy,memory consumption, and efficiency. Moreover, the memoryconsumption of multi-order visual phrase could be flexibly con-trolled and adjusted for different applications.

E. Discussions and Future Work

In this section, we test the performance of SMVP and OMVPin image matching and image search tasks. Experimental resultsshow that SMVP and OMVP perform accurately and efficientlyin these tasks and outperform several recent approaches. Theseexperiments clearly validate the idea of multi-order visualphrase, i.e., combining more neighbor keypoints, flexibly,and efficiently matching neighbor keypoints to estimate thematching confidence, and utilizing matching confidence toweight the image similarity.Besides summarizing the advantages of our approach, it is

necessary to analyze its potential issues and possible solutionsin future work. It is necessary to note that OMVP outperformsSMVP in image matching but SMVP performs better thanOMVP in image retrieval tasks. This is mainly because we use256-bit ORB to construct OMVP for image matching. But tomake OMVP indexable in inverted indexes, in image retrievalwe use the first 24 bits in 256-bit ORB to construct OMVP.

Hence, it is reasonable to observe the performance drop ofOMVP in image retrieval. This observation also raises an inter-esting research issues valuable for further study: learning ultracompact, e.g., 24-bit, binary descriptor with high discriminativepower. The 24-bit descriptor corresponds to unique IDs,hence could be directly utilized as visual word in large-scaleimage search without visual codebook training and featurequantization. Possible solutions include 1) using supervisedhashing or dimensionality reduction strategies to compress the256-bit ORB into 24 bits without largely degrading its discrim-inative power, or 2) designing a more discriminative raw binarydescriptor and then selecting the most informative 24 bits fromit. This problem will be further studied in our future work.Each multi-order visual phrase contains one center visual

word and multiple neighbor keypoints. Besides the spatialclues in (7) and (8), many other spatial clues like the relativeposition, standard deviation of scale, orientation, etc., in thelocal neighborhood, could also be calculated. Hence, anothervaluable topic is to test the validity of different spatial cluesand only preserve the most discriminative and complementaryones for multi-order visual phrase generation. This study isalso potential to further decrease the size of multi-order visualphrase.

VI. CONCLUSION

In this paper, we propose multi-order visual phrase for imagematching and large-scale image search. Multi-order visualphrase contains three aspects of information, i.e., the visualclue of center keypoint, the visual and spatial clues of neighborkeypoints. Multi-order visual phrase effectively captures richspatial clues and conquers the issues in existing visual phrasefeatures, i.e., low flexibility and low repeatability. Moreover,multi-order visual phrase could be extracted from both floatingpoint local descriptors like SIFT, i.e., SMVP, and binary de-scriptors like ORB, i.e., OMVP. Extensive experiments validatethat SMVP and OMVP perform accurately and efficiently inimage matching task.In image retrieval tasks, SMVPs are indexed with inverted

indexes which also record the neighbor spatial and visual cluesfor online verification. During online retrieval, we match twoSMVPs flexibly by 1) firstly matching their center visual words,and 2) then verifying the neighbor spatial and visual clues to es-timate the matching correctness. Hence, SMVP differs from ex-isting visual phrases in that it allows more flexible matching andis more robust to quantization error. Comparisons with BoWsmodel and recent retrieval approaches manifest the competitiveaccuracy and high efficiency of SMVP.To the best of our knowledge, there is limited research ef-

forts studying how to utilize binary descriptors like ORB inlarge-scale image search. OMVP utilizes ORB in large-scaleimage search without the need to generate visual codebook. Ourexperiments show that although OMVP only preserves the first24 bits in ORB descriptor for image retrieval, it still achievescompetitive precision and significantly faster speed than ex-isting BoWs model based retrieval methods. Thus, OMVP ismore suitable for mobile applications that are sensitive to com-putational cost.

Page 11: 130 IEEE JOURNAL ON EMERGING AND SELECTED TOPICS IN ......complementary clues: 1) the center visual clues extracted at each image keypoint and 2) the neighbor visual and spatial clues

140 IEEE JOURNAL ON EMERGING AND SELECTED TOPICS IN CIRCUITS AND SYSTEMS, VOL. 4, NO. 1, MARCH 2014

REFERENCES

[1] H. Bay, T. Tuytelaars, and L. V. Gool, “Speeded-up robust features(SURF),” Comput. Vis. Image Undetstand., vol. 110, no. 3, pp.346–359, 2008.

[2] M. Brown and D. Lowe, “Unsupervised 3-D object recognition andreconstruction in unordered datasets,” in Proc. IEEE Int. Conf. 3-DDigital Imag. Model., Jun. 2005, pp. 56–63.

[3] M. Brown and D. G. Loww, “Automatic panoramic image stitchingusing invariant features,” Int. J. Comput. Vis., vol. 74, no. 1, pp. 59–73,2007.

[4] M. Calonder, V. Lepetit, C. Strecha, and P. Fua, “Brief: Binary robustindependent elementary features,” in Eur. Conf. Comput. Vis., 2010,pp. 778–792.

[5] V. Chandrasekhar, G. Takacs, D. Chen, S. Tsai, R. Grzeszczuk, and B.Girod, “CHOG: Compressed histogram of gradients a low bit-rate fea-ture descriptor,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit.,2009, pp. 2504–2511.

[6] L. Chu, S. Jiang, S. Wang, Y. Zhang, and Q. Huang, “Robust spatialconsistency graph model for partial duplicate image retrieval,” IEEETrans. Multimedia, vol. 15, no. 8, pp. 1982–1996, Aug. 2013.

[7] M. Fischler and R. Bolles, “Random sample consensus: A paradigmfor model fitting with applications to image analysis and automatedcartography,” Commun. ACM, vol. 24, no. 6, pp. 381–391, 1981.

[8] H. Jégou, M. Douze, and C. Schmid, “Hamming embedding and weakgeometric consistency for large scale image search,” in Eur. Conf.Comput. Vis., 2008, pp. 304–317.

[9] H. Jégou, M. Douze, and C. Schmid, “Improving bag-of-feature forlarge scale image search,” Int. J. Comput. Vis., vol. 87, no. 3, pp.316–336, 2010.

[10] H. Jégou, M. Douze, C. Schmid, and P. Pérez, “Aggregating local de-scriptor into a compact image representation,” in Proc. IEEE Conf.Comput. Vis. Pattern Recognit., Jun. 2010, pp. 3304–3311.

[11] L. Juan and O. Gwun, “A comparison of SIFT, PCA-SIFT and SURF,”Int. J. Image Process., vol. 3, no. 4, pp. 143–152.

[12] Y. Ke and R. Sukthankar, “PCA-SIFT: A more distinctive represen-tation for local image descriptors,” in Proc. IEEE Conf. Comput. Vis.Pattern Recognit., 2004, vol. 2, pp. 506–513.

[13] Y. Ke, R. Sukthankar, and L. Huston, “Efficient near-duplicated detec-tion and sub-image retrieval,” in ACMMultimedia, 2004, vol. 4, no. 1,pp. 869–876.

[14] S. Leutenegger, M. Chli, and R. Y. Siegwart, “BRISK: Binary robustinvariant scalable keypoints,” in Proc. IEEE Int. Conf. Comput. Vis.,2011, pp. 2548–2555.

[15] A. Levin, A. Zomet, S. Peleg, and Y.Weiss, “Seamless image stitchingin the gradient domain,” in Proc. Eur. Conf. Comput. Vis., 2004, pp.377–389.

[16] G. Li, M. Wang, Z. Lu, R. Hong, and T.-S. Chua, “In-video productannotation with web information mining,” ACM Trans. MultimediaComput., Commun., Appl., vol. 8, no. 4, p. 55, Dec. 2012.

[17] D. Liu, G. Hua, P. Viola, and T. Chen, “Integrated feature selectionand higher-order spatial feature extraction for object categorization,”in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2008, pp. 1–8.

[18] D. G. Lowe, “Distinctive image features from scale invariant key-points,” Int. J. Comput. Vis., vol. 60, no. 2, pp. 91–110, 2004.

[19] E.Mair, G. D. Hager, D. Burschka,M. Suppa, andG. Hirzinger, “Adap-tive and generic corner detection based on the accelerated segmenttest,” in Proc. Eur. Conf. Comput. Vis., 2010, pp. 183–196.

[20] J. Matas, O. Chum, M. Urban, and T. Pajdla, “Robust wide baselinestereo from maximally stable extremal regions,” Image Vis. Comput.,vol. 22, no. 10, pp. 761–767, 2004.

[21] K. Mikolajczyk and C. Schmid, “A performance evaluation of localdescriptors,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 27, no. 10,pp. 1615–1630, Oct. 2005.

[22] D. Nistér and H. Stewénius, “Scalable recognition with a vocabularytree,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2006, vol.2, pp. 2161–2168.

[23] J. Philbin, O. Chum, M. Isard, J. Sivic, and A. Zisserman, “Objectretrieval with large vocabularies and fast spatial matching,” in Proc.IEEE Conf. Comput. Vis. Pattern Recognit., 2007, pp. 1–8.

[24] J. Philbin, O. Chum, M. Isard, J. Sivic, and A. Zisserman, “Lost inquantization: Improving particular object retrieval in large scale imagedatabases,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2008,pp. 1–8.

[25] E. Rosten and T. Drummond, “Machine learning for high-speed cornerdetection,” in Proc. Eur. Conf. Comput. Vis., 2006, pp. 430–443.

[26] E. Rublee, V. Rabaud, K. Konolige, and G. Bradski, “ORB: An effcientalternative to SIFT or SURF,” in Proc. IEEE Int. Conf. Comput. Vis.,2011, pp. 2564–2571.

[27] S. Savarese, J. Winn, and A. Criminisi, “Discriminative object classmodels of appearance and shape by correlatons,” in Proc. IEEE Conf.Comput. Vis. Pattern Recognit., 2006, vol. 2, pp. 2033–2040.

[28] X. Shen, Z. Lin, J. Brandt, S. Avidan, and Y. Wu, “Object retrieval andlocalization with spatially-constrained similarity measure and k-NNreranking,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., Jun.2012, pp. 3013–3020.

[29] H. Y. Shum and R. Szeliski, “Systems and experiment paper: Construc-tion of panoramic image mosaics with global and local alignment,” Int.J. Comput. Vis., vol. 36, no. 2, pp. 101–130, 2000.

[30] J. Sivic and A. Zisserman, “Video Google: A text retrieval approach toobject matching in videos,” in Proc. IEEE Int. Conf. Comput. Vis., Oct.2003, pp. 1470–1477.

[31] Q. Tian, S. Zhang, W. Zhou, R. Ji, B. Ni, and N. Sebe, “Building de-scriptive and discriminative visual codebook for large-scale image ap-plications,” Int. J. Multimedia Tools Appl., vol. 51, no. 2, pp. 441–477,2011.

[32] P. Viola and M. J. Jones, “Robust real-time face detection,” Int. J.Comput. Vis., vol. 57, no. 2, pp. 137–154, 2004.

[33] B. Wang, Z. Li, M. Li, and W. Y. Ma, “Large-scale duplicate detectionfor web image search,” in Proc. IEEE Int. Conf. Multmedia Expo, Jul.2006, pp. 353–356.

[34] M. Wang, G. Li, Z. Lu, Y. Gao, and T.-S. Chua, “When Amazon meetsGoogle: Product visualization by exploring multiple web sources,”ACM Trans. Internet Technol., vol. 12, no. 4, p. 12, 2013.

[35] M. Wang, H. Li, D. Tao, K. Lu, and X. Wu, “Multimodal graph-basedreranking for web image search,” IEEE Trans. Image Process., vol. 21,no. 11, pp. 4649–4661, Nov. 2012.

[36] M. Wang, K. Yang, X. Hua, and H. Zhang, “Towards a relevant anddiverse search of social images,” IEEE Trans. Multimedia, vol. 12, no.8, pp. 829–842, Dec. 2010.

[37] X. Wang, M. Yang, T. Cour, S. Zhu, K. Yu, and T. X. Han, “Contextualweighting for vocabulary tree based image retrieval,” inProc. IEEE Int.Conf. Comput. Vis., 2011, pp. 209–216.

[38] Z. Wu, Q. Ke, M. Isard, and J. Sun, “Bundling feature for large scalepartial-duplicated web image search,” in Proc. IEEE Conf. ComputerVis. Pattern Recognit., 2009, pp. 25–32.

[39] J. Yang, K. Yu, Y. Gong, and T. Huang, “Linear spatial pyramidmatching using sparse coding for image classification,” in Proc. IEEEConf. Comput. Vis. Pattern Recognit., 2009, pp. 1794–1801.

[40] S. Zhang, Q. Huang, G. Hua, S. Jiang, W. Gao, and Q. Tian, “Buildingcontextual visual vocabulary for large-scale image applications,” inACM Multimedia, 2010, pp. 501–510.

[41] S. Zhang, Q. Huang, Y. Lu, W. Gao, and Q. Tian, “Building pair-wisevisual word tree for efficient image re-ranking,” in Proc. IEEE Int.Conf. Acoust. Speech Signal Process., Mar. 2010, pp. 794–797.

[42] S. Zhang, Q. Tian, G. Hua, Q. Huang, and S. Li, “Descriptive visualwords and visual phrases for image applications,” in ACMMultimedia,2009, pp. 75–84.

[43] S. Zhang, Q. Tian, K. Lu, Q. Huang, and W. Gao, “Edge-SIFT:Discriminative binary descriptor for scalable partial-duplicate mobilesearch,” IEEE Trans. Image Process., vol. 22, no. 7, pp. 2889–2902,Jul. 2013.

[44] S. Zhang, M. Yang, X. Wang, Y. Lin, and Q. Tian, “Semantic-awareco-indexing for image retrieval,” in Proc. IEEE Int. Conf. Comput. Vis.,2013.

[45] Y. Zhang, Z. Jia, and T. Chen, “Image retrieval with geometry-pre-serving visual phrases,” in Proc. IEEE Conf. Comput. Vis. PatternRecognit., Jun. 2011, pp. 809–816.

[46] Y. Zheng,M. Zhao, S.-Y. Neo, T.-S. Chua, and Q. Tian, “Visual synset:Towards a higher-level visual representation,” in Proc. IEEE Conf.Comput. Vis. Pattern Recognit., 2008, pp. 1–8.

[47] W. Zhou, Y. Lu, H. Li, Y. Song, and Q. Tian, “Spatial coding for largescale partial-duplicate web image search,” in ACM Multimedia, 2010,pp. 511–520.

Page 12: 130 IEEE JOURNAL ON EMERGING AND SELECTED TOPICS IN ......complementary clues: 1) the center visual clues extracted at each image keypoint and 2) the neighbor visual and spatial clues

ZHANG et al.: EMBEDDING MULTI-ORDER SPATIAL CLUES FOR SCALABLE VISUAL MATCHING AND RETRIEVAL 141

Shiliang Zhang received the Ph.D. degree incomputer science from Institute of ComputingTechnology, Chinese Academy of Sciences, Beijing,China, in 2012.He is currently a Postdoctoral Research Fellow in

Department of Computer Science at the Universityof Texas at San Antonio, San Antonio, TX, USA.His research interests include large-scale image andvideo retrieval and multimedia content affectiveanalysis. He has authored or co-authored over20 papers in journals and conferences including

IEEE TRANSACTIONS ON IMAGE PROCESSING, IEEE TRANSACTIONS ON

MULTIMEDIA, Computer Vision and Image Understanding, ACM Multimedia,and ICCV.Dr. Zhangwas awarded Outstanding Doctoral Dissertation Award by Chinese

Academy of Sciences, President Scholarship by Chinese Academy of Sciences,ACMMultimedia Student Travel Grants, and the Microsoft Research Asia Fel-lowship 2010. He is the recipient of Top 10% Paper Award in IEEE MMSP2011.

Qi Tian (M’96–SM’03) received the B.E. degreein electronic engineering from Tsinghua University,Beijing, China, in 1992, the M.S. degree in electricaland computer engineering from Drexel University,Philadelphia, PA, USA, in 1996, and the Ph.D.degree in electrical and computer engineering fromthe University of Illinois at Urbana-Champaign,Champaign, IL, USA, in 2002.He is currently a Professor in the Department of

Computer Science at the University of Texas at SanAntonio (UTSA), San Antonio, TX, USA. He took a

one-year faculty leave at Microsoft Research Asia (MSRA) during 2008–2009.His research interests include multimedia information retrieval and computervision. He has published over 210 refereed journal and conference papers. Hisresearch projects were funded by NSF, ARO, DHS, SALSI, CIAS, and UTSA.He has served as a Guest Editor for the Journal of Computer Vision and ImageUnderstanding, Pattern Recognition Letter, EURASIP Journal on Advances inSignal Processing, and the Journal of Visual Communication and Image Repre-sentation. He is on the Editorial Board ofMultimedia Systems Journal, Journalof Multimedia, and Journal of Machine Visions and Applications.Dr. Tian has received faculty research awards from Google, NEC Laborato-

ries of America, FXPAL, Akiira Media Systems, and HP Labs. He received theBest Paper Awards in PCM 2013, MMM 2013 and ICIMCS 2012, the Top 10%

Paper Award in MMSP 2011, the Best Student Paper in ICASSP 2006, and theBest Paper Candidate in PCM 2007. He received 2010 ACM Service Award.He is a Guest Editor of IEEE TRANSACTIONS ON MULTIMEDIA. He is on theEditorial Board of IEEE TRANSACTIONS ON CIRCUIT AND SYSTEMS FOR VIDEOTECHNOLOGY (TCSVT).

Qingming Huang (M’04–SM’08) received thePh.D. degree in computer science from HarbinInstitute of Technology, Harbin, China, in 1994.He was a Postdoctoral Fellow with the National

University of Singapore from 1995 to 1996, andworked in Institute for Infocomm Research, Singa-pore, as a Member of Research Staff from 1996 to2002. He joined the Chinese Academy of Sciencesunder Science100 Talent Plan in 2003, and currentlyis a Professor with the Graduate University ofChinese Academy of Sciences. His current research

areas are image and video analysis, video coding, pattern recognition, andcomputer vision.

Yong Rui (F’10) received the B.S. degree fromSoutheast University, Nanjing, China, the M.S.degree from Tsinghua University, Beijing, China,and the Ph.D. degree from the University of Illinoisat Urbana-Champaign, Champaign, IL, USA.He is currently a Senior Director with Microsoft

Research Asia, Beijing, China. Hi is an AssociateEditor of the ACM Transaction on MultimediaComputing, Communication and Application anda Founding Editor of the International Journalof Multimedia Information Retrieval. He was an

Associate Editor of the ACM/Springer Multimedia Systems Journal from 2004to 2006, and the International Journal of Multimedia Tools and Applicationsfrom 2004 to 2006.Dr. Rui is a Fellow of the IAPR and the International Society of Optics and

Photonics, and a Distinguished Scientist of the ACM. He is an Associate Ed-itor-in-Chief of the IEEE MULTIMEDIA MAGAZINE. He was an Associate Editorof the IEEE TRANSACTION ON MULTIMEDIA from 2004 to 2008, and the IEEETRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGIES from2006 to 2010.