Linear Distance Coding for Image Classification

12
IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 22, NO. 2, FEBRUARY 2013 537 Linear Distance Coding for Image Classification Zilei Wang, Jiashi Feng, Shuicheng Yan, Senior Member, IEEE, and Hongsheng Xi Abstract— The feature coding-pooling framework is shown to perform well in image classification tasks, because it can generate discriminative and robust image representations. The unavoidable information loss incurred by feature quantization in the coding process and the undesired dependence of pooling on the image spatial layout, however, may severely limit the classification. In this paper, we propose a linear distance coding (LDC) method to capture the discriminative information lost in traditional coding methods while simultaneously alleviating the dependence of pooling on the image spatial layout. The core of the LDC lies in transforming local features of an image into more discriminative distance vectors, where the robust image- to-class distance is employed. These distance vectors are further encoded into sparse codes to capture the salient features of the image. The LDC is theoretically and experimentally shown to be complementary to the traditional coding methods, and thus their combination can achieve higher classification accuracy. We demonstrate the effectiveness of LDC on six data sets, two of each of three types (specific object, scene, and general object), i.e., Flower 102 and PFID 61, Scene 15 and Indoor 67, Caltech 101 and Caltech 256. The results show that our method generally outperforms the traditional coding methods, and achieves or is comparable to the state-of-the-art performance on these data sets. Index Terms— Image classification, image-to-class distance, linear distance coding (LDC). I. I NTRODUCTION G ENERATING compact, discriminative and robust image representations is undoubtedly critical to image classifi- cation [1], [2]. Recently, several local features, e.g., SIFT [3] and HOG [4], are quite popular in representing images due to their ability to capture distinctive details of the images. However, the local features are rarely directly fed into image classifiers due to the computational complexity and their sensitiveness to noise. A common strategy is to integrate the local features into a global image representation at first. To this end, various methods [1], [2], [5], [6] have been proposed, Manuscript received February 16, 2012; revised August 30, 2012; accepted August 30, 2012. Date of publication September 13, 2012; date of current version January 10, 2013. This work was supported in part by the National Natural Science Foundation of China under Grant 61203256 and the Singapore Ministry of Education under Grant MOE2010-T2-1-087. The associate editor coordinating the review of this manuscript and approving it for publication was Prof. Erhardt Barth. Z. Wang is with the Department of Automation, University of Science and Technology of China (USTC), Hefei 230027, China, and also with the Department of Electrical and Computer Engineering, National University of Singapore, 117576 Singapore (e-mail: [email protected]). J. Feng and S. Yan are with the Department of Electrical and Computer Engineering, National University of Singapore, 117576 Singapore (e-mail: [email protected]; [email protected]). H. Xi is with the School of Information Science and Technology, Uni- versity of Science and Technology of China, Hefei 230027, China (e-mail: [email protected]). Color versions of one or more of the figures in this paper are available online at http://ieeexplore.ieee.org. Digital Object Identifier 10.1109/TIP.2012.2218826 among which the Bag of Words (BoW) based ones [1], [2], [5] present outstanding simplicity and effectiveness. BoW image representation is typically generated via fol- lowing three steps: 1) extract local features of an image on the interest points; 2) generate a dictionary/codebook and then quantize/encode the local features into codes accordingly; and 3) pool all the codes together to generate the global image representation. Such a process can be summarized as a feature extraction-coding-pooling pipeline. And it has been widely used in recent image classification methods and achieves impressive performance [1], [2], [7]. Within the above framework, the coding process will inevitably introduce information loss due to the feature quan- tization. Such undesirable information loss severely damages the discriminative power of the generated image representa- tion and thus decreases the image classification performance. Therefore, various coding methods are proposed to more accurately encode local features with less information loss. Most of these methods are developed from the Vector Quanti- zation (VQ) which conducts hard assignment in the coding process [5]. In spite of great simplicity, its inherent large coding error 1 often leads to unrecoverable loss of discrimi- native information and severely limits the classification per- formance [8]. To alleviate this issue, various coding methods have been proposed. For example, soft-assignment [6], [9], [10] estimates memberships of each local feature to multi- ple visual words instead of a single one. Another modified method is Super Vector (SV) coding [11], which additionally incorporates the difference between local feature and selected visual word. Thus SV captures the higher-order information and shows improved performance. Though many coding methods [1], [2], [10], [11] are proposed to accurately represent the input features, the infor- mation loss in the feature quantization for coding is still inevitable. In fact, Boiman et al. [8] have pointed out that the local features from long-tail distribution are inherently inap- propriate for quantization, and the lost information in feature quantization is quite important for good image classification performance. To tackle this issue, the Naive Bayes Nearest Neighbor (NBNN) method is proposed to avoid the feature coding process, by employing the image-to-class distance for image classification [8]. Benefiting from alleviating the information loss, NBNN is able to achieve competitive clas- sification performance on multiple datasets with coding based methods. Motivated by its success, several methods [12]–[14] are developed to further improve the NBNN. However, all variants of NBNN practically employ uniform summation to aggregate image-to-class distances calculated based on local 1 Or called the coding residual, which refers to the difference between original local feature and the reconstructed feature from the produced codes. 1057–7149/$31.00 © 2012 IEEE

description

Image classification Techniques

Transcript of Linear Distance Coding for Image Classification

Page 1: Linear Distance Coding for Image Classification

IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 22, NO. 2, FEBRUARY 2013 537

Linear Distance Coding for Image ClassificationZilei Wang, Jiashi Feng, Shuicheng Yan, Senior Member, IEEE, and Hongsheng Xi

Abstract— The feature coding-pooling framework is shownto perform well in image classification tasks, because it cangenerate discriminative and robust image representations. Theunavoidable information loss incurred by feature quantizationin the coding process and the undesired dependence of poolingon the image spatial layout, however, may severely limit theclassification. In this paper, we propose a linear distance coding(LDC) method to capture the discriminative information lost intraditional coding methods while simultaneously alleviating thedependence of pooling on the image spatial layout. The core ofthe LDC lies in transforming local features of an image intomore discriminative distance vectors, where the robust image-to-class distance is employed. These distance vectors are furtherencoded into sparse codes to capture the salient features of theimage. The LDC is theoretically and experimentally shown tobe complementary to the traditional coding methods, and thustheir combination can achieve higher classification accuracy. Wedemonstrate the effectiveness of LDC on six data sets, two ofeach of three types (specific object, scene, and general object),i.e., Flower 102 and PFID 61, Scene 15 and Indoor 67, Caltech 101and Caltech 256. The results show that our method generallyoutperforms the traditional coding methods, and achieves or iscomparable to the state-of-the-art performance on these data sets.

Index Terms— Image classification, image-to-class distance,linear distance coding (LDC).

I. INTRODUCTION

GENERATING compact, discriminative and robust imagerepresentations is undoubtedly critical to image classifi-

cation [1], [2]. Recently, several local features, e.g., SIFT [3]and HOG [4], are quite popular in representing images dueto their ability to capture distinctive details of the images.However, the local features are rarely directly fed into imageclassifiers due to the computational complexity and theirsensitiveness to noise. A common strategy is to integrate thelocal features into a global image representation at first. To thisend, various methods [1], [2], [5], [6] have been proposed,

Manuscript received February 16, 2012; revised August 30, 2012; acceptedAugust 30, 2012. Date of publication September 13, 2012; date of currentversion January 10, 2013. This work was supported in part by the NationalNatural Science Foundation of China under Grant 61203256 and the SingaporeMinistry of Education under Grant MOE2010-T2-1-087. The associate editorcoordinating the review of this manuscript and approving it for publication wasProf. Erhardt Barth.

Z. Wang is with the Department of Automation, University of Scienceand Technology of China (USTC), Hefei 230027, China, and also with theDepartment of Electrical and Computer Engineering, National University ofSingapore, 117576 Singapore (e-mail: [email protected]).

J. Feng and S. Yan are with the Department of Electrical and ComputerEngineering, National University of Singapore, 117576 Singapore (e-mail:[email protected]; [email protected]).

H. Xi is with the School of Information Science and Technology, Uni-versity of Science and Technology of China, Hefei 230027, China (e-mail:[email protected]).

Color versions of one or more of the figures in this paper are availableonline at http://ieeexplore.ieee.org.

Digital Object Identifier 10.1109/TIP.2012.2218826

among which the Bag of Words (BoW) based ones [1], [2],[5] present outstanding simplicity and effectiveness.

BoW image representation is typically generated via fol-lowing three steps: 1) extract local features of an image onthe interest points; 2) generate a dictionary/codebook and thenquantize/encode the local features into codes accordingly; and3) pool all the codes together to generate the global imagerepresentation. Such a process can be summarized as a featureextraction-coding-pooling pipeline. And it has been widelyused in recent image classification methods and achievesimpressive performance [1], [2], [7].

Within the above framework, the coding process willinevitably introduce information loss due to the feature quan-tization. Such undesirable information loss severely damagesthe discriminative power of the generated image representa-tion and thus decreases the image classification performance.Therefore, various coding methods are proposed to moreaccurately encode local features with less information loss.Most of these methods are developed from the Vector Quanti-zation (VQ) which conducts hard assignment in the codingprocess [5]. In spite of great simplicity, its inherent largecoding error1 often leads to unrecoverable loss of discrimi-native information and severely limits the classification per-formance [8]. To alleviate this issue, various coding methodshave been proposed. For example, soft-assignment [6], [9],[10] estimates memberships of each local feature to multi-ple visual words instead of a single one. Another modifiedmethod is Super Vector (SV) coding [11], which additionallyincorporates the difference between local feature and selectedvisual word. Thus SV captures the higher-order informationand shows improved performance.

Though many coding methods [1], [2], [10], [11] areproposed to accurately represent the input features, the infor-mation loss in the feature quantization for coding is stillinevitable. In fact, Boiman et al. [8] have pointed out that thelocal features from long-tail distribution are inherently inap-propriate for quantization, and the lost information in featurequantization is quite important for good image classificationperformance. To tackle this issue, the Naive Bayes NearestNeighbor (NBNN) method is proposed to avoid the featurecoding process, by employing the image-to-class distancefor image classification [8]. Benefiting from alleviating theinformation loss, NBNN is able to achieve competitive clas-sification performance on multiple datasets with coding basedmethods. Motivated by its success, several methods [12]–[14]are developed to further improve the NBNN. However, allvariants of NBNN practically employ uniform summation toaggregate image-to-class distances calculated based on local

1Or called the coding residual, which refers to the difference betweenoriginal local feature and the reconstructed feature from the produced codes.

1057–7149/$31.00 © 2012 IEEE

Page 2: Linear Distance Coding for Image Classification

538 IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 22, NO. 2, FEBRUARY 2013

features. This introduces two inherent drawbacks, namely theyare sensitive to noisy features and easy to be dominated byoutlier features.

In essence, the BoW-based methods and the NBNN-basedmethods are using different visual characteristic statistics toperform image classification. The former depends on salientfeatures of an image, while the latter equally treats all the localfeatures. In addition, the NBNN ones replace the image-levelsimilarities with the image-to-class distance on performingclassification in order to generate more robust results. There-fore, the BoW and NBNN based methods may be suitablefor different types of images. For example, for the imageswith cluttered background, the BoW based ones show betterclassification performance due to its ability to capture thesalient features. Therefore, it is reasonable to propose thatif we can combine the advantages of both of them, namelycapturing the saliency of images without information loss, theclassification performance can be improved further.

Besides reducing the information loss of feature coding,how to more effectively explore spatial context is also crucialfor achieving good classification performance. In most ofthe coding-pooling based methods, Spatial Pyramid Matching(SPM) [7] has been widely adopted in the pooling proceduredue to its effectiveness and simplicity. However, SPM strictlyrequires the involved images to present similar spatial layoutto ensure that the generated image representations can matchwell in element-wise manner [15]. This requirement originatesfrom the fact that the used local features are often representingthe object-specific visual patterns. However, such requirementhas negative effect on classification accuracy because realisticimages usually show various spatial layout even within thesame category. Alternatively, if the elements of adopted fea-tures can be transformed to bear the class-specific semantic,such requirement would be greatly relieved.

In this paper, we propose a novel Linear Distance Coding(LDC) method to simultaneously inherit the nice propertiesof BoW and NBNN and meanwhile relieve the image spatialalignment requirement of SPM. LDC also works under thefeature extraction-coding-pooling framework, i.e., it generatesthe image representations from the salient characteristic localfeatures for the classification, as shown in Figure 1. Theproposed LDC particularly focuses on utilizing the discrim-inative information lost by the traditional coding methods andmore effectively exploiting the spatial information. In practice,LDC transforms each local feature into a distance vector,which is an alternative discriminative pattern of local feature,in the class-manifold coordinate system. Compared with theoriginal local features, each element of the distance vectorsrepresents certain class-specific semantic which consists of thedistances of local features to class-specific manifolds. Thusthe strict requirement of image layout similarity in originalSPM can be effectively relieved, since the embedded classsemantic in each feature element robustifies the similaritycalculation between the objects posing differently, as detailedlater. Comprehensive experiments on various types of datasetsconsistently show that the image representation produced byLDC achieve better or competitive performance comparedwith state-of-the-arts. Furthermore, the image representations

Images Local features

Class Manifolds

Class 1 Class 2 Class K

Distance Transforma�on

LinearCoding

+

Max-pooling

Distance to Class Manifold Manifold Coordinate System

Image Representa�on

Coding & Pooling

SPM

Fig. 1. Illustration of linear distance coding. The local features extracted fromvarious classes of training images are first used to generate a manifold for eachclass that is represented by a set of local features (i.e., anchor points). Basedon the obtained class manifolds, the local feature xi is transformed into a morediscriminative distance vector di = [di,1, di,2, . . . , di,K ]T , where K denotesthe class number. On these transformed distance vectors, the linear codingand max-pooling are performed to produce the final image representation.The principle of the distance transformation from original local feature xito distance feature di is to form a class-manifold coordinate system withthe K obtained class manifolds, where each class corresponds to one axis.For the kth class manifold Mk , the coordinate value di,k of local feature xicorresponds to the distance between xi and this class manifold. Image bestviewed in color.

produced by LDC are proven to be complementary to the onesfrom the original coding methods. Thus their combination,even a direct concatenation of resulting image representations,can yield remarkable performance improvement as expected.

The main contributions of this work can be summarized asfollows:

1) We propose a novel distance pattern of local featuresthrough constructing the class-manifold coordinate sys-tem. The produced distance vectors are quite discrimina-tive and is able to relieve the strict requirement of SPMon image spatial layout, benefiting from the adoptedmore robust image-to-class distance.

2) We propose a linear distance coding (LDC) method,which conducts the linear coding and max-pooling onthe transformed distance vectors to elegantly aggregatethe salient features of images. Compared with the NBNNmethods, such process can avoid the undesired casewhere the discriminative features are dominated byoutlier or noisy features, especially for the images withcluttered background.

3) From both theoretical analysis and experimental verifi-cation, the image representations produced by LDC arecomplementary to the one from the traditional codingmethod. And their combination is shown to outperformeach individual of them and achieve the state-of-the-artperformance on various benchmark datasets.

This paper is organized as follows. Section II intro-duces the related works, including the linear coding models

Page 3: Linear Distance Coding for Image Classification

WANG et al.: LINEAR DISTANCE CODING FOR IMAGE CLASSIFICATION 539

and the NBNN methods. Section III proposes the distancepattern by introducing the class-manifold coordination system.Section IV applies the linear coding and max-pooling on thetransformed distance vectors, and the combination of LDC andthe original coding method is discussed. The experiments onthree types of datasets are presented in Section V, meanwhilethe sensitiveness of the key parameters to classification per-formance is also discussed. Finally, Section VI concludes thiswork.

II. RELATED WORKS

The proposed Linear Distance Coding (LDC) utilizes simul-taneously the linear coding methods and the image-to-classdistance adopted in NBNN [8]. In this section, we brieflydiscuss the conventional coding methods and the NBNNmethods.

1) Linear Coding Models: Linear coding is to approximatethe input feature by a linear combination of the basis in agiven dictionary. Through the coding process, input featuresare transformed into more discriminative codes. The popularlinear coding models include Vector Quantization (VQ) [5],Soft-assignment Coding [6], Sparse Coding (SC) [1], Locality-constrained Linear Coding (LLC) [2] and their variants [16].

Given a dictionary B = [b1, b2, . . . , bp] ∈ Rd×p consisting

of p basis features with dimensionality d , linear coding com-putes a reconstruction coefficient vector v ∈ R

p to representthe input feature x ∈ R

d by minimizing the following lossfunction:

L(v) = 1

2‖x − Bv‖2�2

+ λR(v) (1)

where the first term measures the approximation error andthe second one serves as regularization. In fact, existingcoding models mainly differ from each other at imposingdifferent prior structures on the generated code v via a specificregularization R(·).

In particular, LLC [2] considers that locality is more essen-tial than sparsity for the feature coding. It adopts a localityadaptor in the regularization R(·) to replace the �1-normused in SC. The locality regularization takes into accountthe underlying manifold structure of local features and thusensures good approximation. Inspired by LLC, Liu et al. [10]propose to inject locality into the soft-assignment coding anddevise the Localized Soft-Assignment (LSA) coding method.For any local feature x, its membership estimation is restrictedto only certain number of nearest basis in the dictionary. LSAdiscards the possibly unreliable interpretations from distantbasis and obtains more accurate posteriori probability esti-mation. However, the accuracy of such posteriori estimation(i.e., coding result) heavily depends on the size of the adopteddictionary and the underlying distribution of local features,which determine the performance of image classification.

Inspecting the feature coding in (1), the information lossmay originate from two aspects. The first one is the inac-curate linear approximation and the imperfectness of thedictionary B . The second one is that the enforced structurein R(·) can only be achieved by sacrificing some approxi-mation accuracy. In linear coding models which operate onthe original local features, such information loss is inevitable.

However, the lost information is probably quite important foraccurate image classification [8].

2) NBNN Methods: The Naive Bayes Nearest Neighbor(NBNN) [8] is essentially a non-parametric classificationmethod without a training phase, where the classification isperformed based on the summation of Euclidean distancesbetween local features of the test image and reference classes(i.e., image-to-class distance) [8], [12]–[14]. By avoiding thefeature coding, the NBNN effectively reduces the informationloss and thus achieves competitive classification performanceon multiple benchmark datasets.

In the NBNN methods, all local features from the same classare assumed to be i.i.d. sampled from a certain class-specificdistribution, and thus image classification is equivalent to amaximum likelihood estimation problem [8]:

c = arg maxc

p(c|Q) = arg maxc

x∈Q

p(x|c) (2)

where c denotes the class, and Q denotes all the descriptorsof the query image. In particular, the NBNN estimates thelikelihood probability through a set of Parzen kernel functions(typically Gaussian kernel function):

p(x|c) = 1

L

r∑

j=1

exp

(− 1

2σ 2 ‖x − xcj‖2

)(3)

where xcj is the j -th nearest neighbor on the class c, σ is

the bandwidth of kernel function, L is a normalization factor,and r denotes the number of nearest neighbors. In NBNN,the case of r = 1 is particularly used due to its simplicity andinterpretability. Under such case, the resulting NBNN criterionis simplified to:

c = arg minc

N∑

i=1

‖xi − xci ‖2�2

(4)

where xci is the nearest neighbor of xi on the class c, and N is

the number of local features. The original NBNN method [8]equally and independently treats local features and classes viathe summation in (4), which causes the sensitiveness to thenoisy features and outliers. Consequently, the classificationperformance cannot be greatly improved although more robustimage-to-class distance is adopted.

More specifically, the original NBNN algorithm suffersfrom the following three drawbacks: 1) the spatial informa-tion [7] is not fully exploited, which however is shown tobe quite useful for image classification; 2) the computationalcomplexity rapidly increases with the number of local featuresand thus the scalability is severely limited. In particular, thetime complexity for one query image with N features isO(N ND log ND), where ND is the number of all local featuresof the training images [8]; and 3) it equally treats all classes forany local feature of testing images, and consequently can notadapt to the involved dataset and capture the image saliencywell, as discussed above.

To alleviate these issues, various modified methods havebeen proposed, such as using class-specific Mahalanobismetric instead of Euclidean distance [13], associating class-specific parameters for each class [12] and kernelizing the

Page 4: Linear Distance Coding for Image Classification

540 IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 22, NO. 2, FEBRUARY 2013

NBNN [14]. These modified NBNN methods [12]–[14] sharetwo features although they seem to be quite different. First, allof them use the same strategy to improve the classification per-formance, namely enhancing the adaptiveness of the resultantmetrics by learning some key parameters. In fact, such learningprocess is an alternative of training parametric models on thetraining samples. Second, the final classification criterion isalways reduced to the summation of certain distance of alllocal features within each image, no matter what distancemetric is adopted. Such uniformly summing operation usuallyrenders the generated metric sensitive to noisy points asaforementioned. Consequently, the individual NBNN cannotoutperform the feature coding based methods in the imageclassification tasks.

III. DISTANCE PATTERN

In this work, we focus on solving the image classificationproblem formally stated as follows: given a set of localfeatures Xi and the class label yi of the i -th image Ii , wewant to learn a classifier from local features to image labelC: Xi �→ yi such that classification error can be minimizedw.r.t. both the training and test images. In particular, we aimat a method generating more discriminative image representa-tions from Xi for better classification performance. Here wepropose a novel coding method which preserves the superiordiscriminative capability and robustness of the feature codingbased methods [2], and meanwhile effectively captures the lostinformation in the previous coding methods. In the following,we first introduce the proposed desired distance pattern whichis more discriminative and robust.

A. Class-Specific Distance

Using the distance between local feature and certain class toestimate image membership can provide better generalizationcapability. Such class-specific distance is fundamental to theNBNN methods and crucial to achieve outstanding classifica-tion performance [8]. In particular, all of the existing NBNNmethods approximate the class-specific distance by calculatingthe distances between the local feature and its correspondingnearest neighbor retrieved in the reference images [8]. For-mally, let d(xi , c) denote the distance between a local featurexi and the class c. Here the class c consists of a set oflocal features {xc

j } all of which are extracted from the trainingimages from c. Then d(xi , c) is computed as

d(xi , c) = minx∈{xc

j }‖xi − x‖2�2

= ‖xi − xci ‖2�2

(5)

where xci denotes the mapped point of xi in class c and reduces

to the nearest neighbor of xi in the NBNN methods. However,the derived distance in Equation (5) suffers from the followingdrawbacks:

1) It is quite sensitive to noisy features in the trainingset {xc

j }. Local feature is prone to change significantlyeven under slight appearance variation and this causesubiquitous noisy features. In the presence of noisyfeatures or outliers in {xc

j }, the estimated distance oflocal features in the testing image may severely deviate

from the correct one because of the fragile quadraticcriterion. This may lead to quite unreliable distancepattern and consequently degrade the performance of theclassification criterion based on such distance pattern.

2) It is highly computationally expensive to find the nearestneighbor for each query feature as aforementioned. Thecomputational complexity O(N ND log ND) is propor-tionally increasing with the number of local featuresin the training set. In practice, many works extract ahuge number of local features which heavily limits theefficiency of NBNN based methods. Although there aresome accelerated algorithms [17], [18], the low effi-ciency is still a bottleneck of such distance calculation.

To alleviate these issues, we propose a novel algorithm tocalculate the distance d(xi , c). The essential idea here is tocalculate a more appropriate mapping point xc

i rather than tosimply find the nearest neighbor as in NBNN. The new xc

i isallowed to be a virtual local feature in the class c. In particular,we assume the local features of each class are sampled froma class-specific manifold Mc, which is completely determinedby available local features of the corresponding class {mc

i }nci=1.

And such features are called “anchor points” [19], whichcan be obtained through clustering the local features fromclass c. Here the manifold of class c is denoted as Mc =[mc

1, mc2, . . . , mc

nc]. Then the computational complexity of a

single input image with N features becomes O(Nnc log(nc))with nc � ND , where ND is the number of all traininglocal features. For example, in our following experiment, thereare about 60 000 local features for each class with 2000features per image and 30 training images. After the clusteringpreprocessing, only nc = 1024 � 60 000 anchor points areused to describe the manifold. In addition to reducing thecomplexity, using the cluster centers as anchor points caneffectively reduce the influence of noisy features and thusproduce a more robust description for the manifold. This isestablished under the reasonable assumption that the fractionof outliers is small, and the resultant centers are mainlydetermined by the dominant inlier features.

Now we present an efficient algorithm to determine thegood mapping point xc

i , even when relatively few anchorpoints are provided. By utilizing the locally linear structure ofthe manifold, xc

i can be calculated through the locally linearregression method. More specifically, xc

i is computed as alinear combination of its neighboring anchors in the manifoldMc. Here we apply an approximate fast solution of LLC [2] toour problem, which only selects the fixed number of nearestneighbors and can be formulated as follows:

minvi‖xc

i − Mcvi‖2�2

subject to : vi, j = 0 if mcj /∈ N k

i

1T vi = 1, ∀i (6)

where vi = [vi,1, vi,2, . . . , vi,nc ]T is the linear representationcoefficients of xi on the manifold Mc, and N k

i is the setof k nearest neighbors of xi . Substitute the resultant xc

iderived from (6) into (5), the distance d(xi , c) will be finallyobtained, which is denoted as dc

i . Such class-specific distance

Page 5: Linear Distance Coding for Image Classification

WANG et al.: LINEAR DISTANCE CODING FOR IMAGE CLASSIFICATION 541

is motivated by capturing the underlying manifold structure ofthe local features and computed in a robust linear regressionway. Thus it gains stronger discriminative power and morerobustness to noisy and outlier features.

B. What is Good Distance Pattern?

Let di = [d1i , d2

i , . . . , d Ki ]T ∈ R

K denote the distancevector of the local feature xi , which aggregates its distancerelationship to all K classes. In contrast to original localfeatures (e.g., SIFT), which describe the appearance patterns ofcharacteristic object, the distance vector represents a relativepattern that captures the discriminative part of local featuresw.r.t. specified classes, i.e., it is more class-specific as wedesired. In fact, the distance vector is the projection residue oflocal features onto the class manifolds, as shown in Figure 1.Note that in the figure each axis denotes one class manifold.Through such residue-pursuit feature transformation, the dis-tance vector gains the following advantages compared withoriginal local features:

1) The distance vector preserves the discriminative infor-mation of local features lost in the traditional featurecoding process.

2) The distance vector can coordinate better with theadditional operation to explore useful spatial informa-tion, e.g., SPM. The spatial pooling of traditional localfeatures requires the involved images have similar objectlayout such that the resulting representations of differentimages can be well element-wisely matched. Such overstrict requirement is significantly relieved by the distancevector because of the class-specific characteristic of theadopted image-to-class distance, as shown in Figure 2.

Compared with previous NBNN methods which directlysum up the image-class distance for classification, here wepropose to use the distance vector as a new kind of localfeature. Thus, any classification model used on the originallocal features can perfectly fit for the distance vector.

Before providing more robust and discriminative distancepattern, we first recall the original NBNN strategy for imageclassification. Given an image I with N local features xi ,the distance vectors di ∈ R

K are calculated as in (5). Thenthe estimated category c of I is determined by the followingcriterion:

c=arg minN∑

i=1

di=arg mink

(N∑

i=1

di,1,

N∑

i=1

di,2, . . . ,

N∑

i=1

di,K

)

(7)

where k is the index of element corresponding to the category.Namely, the original NBNN method just separately considersthe element-wise semantic of the obtained distance vector,and completely ignores the intrinsic pattern described by thedistance vector.

Different from the previous methods, we regard eachdistance vector as an integral feature, and then apply theoutperforming coding model on these transformed features.In particular, the final used distance pattern in our method

Original local features Distance featuresClass 1 Class 1Class 2 Class 2

Image-level representa�on space

Feature space

Example in Class 1

Image 1

Image 2

Feature representa�on

Fig. 2. Schematic diagram of the distance pattern relieving the requirement oflayout similarity. In the original feature space, each class has multiple clustersof characteristic features. When the images involved have different layouts,the resulting image representations may be quite different due to the featurescontained by the same SPM grid of different images being different. Thishas a negative impact on the usual element-wise matching-based methods toachieve high classification accuracy. But such an undesired situation can besignificantly resolved by our proposed distance transformation, as all distancevectors within the same class turn out to be more similar in the distancefeature space, benefitting from the class-specific characteristic of the adoptedimage-to-class distance. Consequently, image representations of the sameclass become closer to each other in the image level representation space,even though they show a totally different layout (e.g., the distance imagerepresentations vd

I1and vd

I2in class 1). Different shapes represent different

classes in certain feature spaces and different color indicates different features(e.g., the pink rectangles represent the indistinctive features in class 1, lyingclose to class 2). Image best viewed in color.

admits the following form:

d′i = di −min(di ),

di = fn(d′i ) =

1

‖d′i‖�2

[d ′i,1, d′i,2, . . . , d

′i,K ]T (8)

where fn(·) is the normalization function with �2-norm. FromEquation (8), the used di mainly represents the distancepattern with ‖di‖�2 = 1. In practice, compared with the directnormalization of fn(di ) without the minimum subtraction, it isexperimentally shown that the normalization in (8) producesa slightly higher classification accuracy [14], which may bebenefitted from the increased gap between elements for morediscriminatively describing features. For simplicity, we woulduse di to refer to di if without ambiguity in the followingsections. Finally, we summarize the procedure to compute theadopted distance pattern in Algorithm 1.

IV. LINEAR DISTANCE CODING

Here we explore how to utilize the obtained distance vectorsto produce discriminative and robust image representation.Different from the previous NBNN-like methods, we aggre-gate the obtained distance pattern under the coding-poolingframework which provides state-of-the-art performance in theprevious works. The overview of the image classificationflowchart is shown in Figure 3. The distance vectors aretransformed from local features one by one, then the distancevector and the original local feature are separately encodedand pooled to generate two image representations vd

I and vI .

Page 6: Linear Distance Coding for Image Classification

542 IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 22, NO. 2, FEBRUARY 2013

Algorithm 1: Distance Pattern

Data: N local features {xi }Ni=1 of image I, theclass-specified manifolds Mc, c = 1, 2, . . . , K .

Result: The desired distance vectors di , i = 1, 2, . . . , N .for i ← 1; i ≤ N; i ← i + 1 do

for k ← 1; k ≤ K ; k ← k + 1 docalculate vi using (6), then di,k = ‖xi − Mkvi‖2�2

.endConstruct the distance vectordi = [di,1, di,2, . . . , di,K ]T .Obtain the normalized distance vector di from (8).

end

(a)

Local Feature Extrac�on

LinearCoding

Spa�al Max-pooling

LinearSVM

Distance Transforma�on LDC

Spa�al Max-pooling

ConcatenatedRepresenta�on

(b)

Fig. 3. Overview of the image classification flowchart. This architecturehas been proven to achieve state-of-the-art performance on the basis of asingle type of feature, e.g., LLC [2]. (a) Linear coding and max-poolingare sequentially performed on original extracted local features, resultingin an original image representation. (b) All local features are transformedinto distance vectors, on which the linear coding and max-pooling aresequentially performed. This coding process is called LDC in this paper,and it results in a distance image representation. Finally, the original imagerepresentation and the distance image representation are simply concatenatedso that they complement each other, where linear SVM is adopted for thefinal classification.

Finally, the linear SVM is adopted to classify the imagesbased on individual image representation, or their concatenatedimage representation.

To verify the effectiveness and generalization of such dis-tance transformation, we apply two different coding modelsindependently, i.e., LLC [2] and Localized Soft-Assignmentcoding (LSA) [10], to encode distance vectors due to theirhigh efficiency provided by the approximate fast solution. Weparticularly illustrate this procedure via LLC2. Let B ∈ R

K×P

be the distance dictionary consisting of P distance vectorsb1, b2, . . . , bP , which can be obtained by k-means clusteringfrom the obtained distance vectors of training images. Forthe input distance vector di , the corresponding code yi iscalculated as follows [2]

minyi

{‖di − Byi‖2�2

+ λ‖ei � yi‖2�2

},

subject to : 1T yi = 1, ∀i (9)

where � denotes the element-wise multiplication, 1 is aP-dimensional all-1 vector, and ei ∈ R

P is the locality adaptorthat gives different freedom for each visual word proportionalto its similarity to the input distance feature di .

After linear coding on the distance vectors, the max-poolingis performed on the obtained sparse codes {yi } to produce thedistance image representation vd

I for image I, namely,

vdI = max(y1, y2, . . . , yN ) (10)

2The counterpart of LSA refers to [10] for details.

Fig. 4. Illustration of the complementary between image representationsproduced by the LLC-like coding methods and our LDC method. In thecoding-pooling framework, the original local feature x are approximatedby the fixed visual words (anchor points) and the corresponding code v.Here we specially suppose the anchor points of all classes to form afixed global dictionary B = [M1, M2, . . . , MK ] by concatenating them.Then the original information of the original feature x can be completelyexpressed by the generated codes v = [v1, v2, . . . , vK ]T and the residueerror [n1, n2, . . . , nK ]T . In fact, the proposed LDC is to utilize the residueerror information by compressing nk into dk with dk = ‖nk‖2�2

. Therefore,

the image representations vI and vdI are complementary to each other due

to their complementary perspectives on utilizing the original information.

where max is performed element-wisely for the involvedvectors. In addition, SPM with three levels is adopted forthe spatial pooling. Thus, the distance image representationvdI is equivalently compact, salient, and discriminative as the

original image representation vI .Here we provide brief analysis on the relationship of the

original image representation vI and the distance imagerepresentation vd

I . The most intuitive difference is that theyare derived from two different local features: the originallocal features {xi } and the distance vector {di }, respectively.For individual point within images, the coding quantizationon original local features inevitably loses some importantinformation more or less due to only preserving the principalinformation, while the distance vector captures the discrimi-native information in the residue part and thus compensate theinformation loss, as shown in Figure 4. So it is creditable thatthe resulting image representations vI and vd

I are complemen-tary to each other. In practice, we simply concatenate vI andvdI to form a longer vector vc

I , which is expected to achievebetter performance. The benefit of such complementarity iswell verified by the following experiments on multiple typesof benchmark datasets.

V. EXPERIMENTS

In this section, we evaluate the performance of the proposedmethod on three groups of benchmark datasets: specific objects(e.g., flower, food), scene and general objects. In particular, thespecific object datasets include Flower 102 [20] and PFID61 [21], in which the images are relatively clean withoutcluttered background. The scene datasets include Scene 15 [7]and Indoor 67 [22]. And the general object datasets includeCaltech 101 [23] and Caltech 256 [24].

Among various feature coding models producing relativelycompact image representations, Locality-constrained LinearCoding (LLC) and Localized Soft-Assignment Coding (LSA)

Page 7: Linear Distance Coding for Image Classification

WANG et al.: LINEAR DISTANCE CODING FOR IMAGE CLASSIFICATION 543

almost always achieves the state-of-the-art classificationperformance [2], [10]. In addition, they, compared withScSPM and other similar methods, have much lower com-putational complexity owing to existed fast solution [2]. Thuswe adopt LLC and LSA individually as the coding model inour method, where the max-pooling is always employed. Ofcourse, similar coding models can also be naturally appliedon the transformed distance features, e.g., Laplacian SparseCoding (LSCSPM) proposed by Gao et al. in [16]. And themain target of the following experiments is to verify theuniform effectiveness of the proposed distance pattern onimproving classification performance. Moreover, we adopt thebest performance of the comparable methods ever reported oneach dataset and the achieved accuracies of LLC and LSA asthe baselines in the performance evaluation. Before reportingthe detailed classification results on these datasets, we firstgive the experimental settings.

A. Experimental Settings

For fair comparison with ever reported results, local featuresof a single type, dense SIFT [3], are used throughout theexperiments. In all of our experiments, SIFT features areextracted at single-scale from densely located patches of grayimages. The patches are centered at every 4 pixels and ofthe fixed size as 16× 16 pixels, where the VLFeat lib [25] isused. Before feature extraction, all the images are resized withreserved aspect ratio to no more than 300 × 300 pixels. Theanchor points {mc

i } of each class manifold Mc are learned fromthe training images of that class, and their number is fixed asKc = 1024 for all classes throughout our experiments. For theoriginal dense SIFT features, and the corresponding distancevectors, the global dictionaries containing P visual words arelearned individually from all training samples via k-meansclustering. In particular, P = 2048 is fixed for all datasets.Each SIFT feature xi or distance vector di is normalized byits �2-norm and then encoded into a P-dimensional vector.

An important parameter of LLC and LSA is the numberof nearest neighbors kc

nn on encoding local features. In ourmethod, the distance vector is similarly calculated basedon kd

nn neighbors in specified class manifold. For reducingtheir influence to classification performance, four differentvalues are used individually for these parameters, i.e., kd

nn ∈{1, 2, 3, 4}, and kc

nn ∈ {2, 5, 10, 20} as suggested in LLC [2].In experiments, we report the best result for each methodunder these parameters, and the influence of these parametersis discussed in the following subsection. In addition, thebandwidth parameter β of LSA is fixed as 10, as the author’ssetting in [10].

In the experiments, the SPM is used by hierarchically parti-tioning each image into 1×1, 2×2, and 4×4 blocks on 3 levels,whose cumulative concatenations are denoted by SPM0, SPM1and SPM2, respectively. In particular, SPM2 means that allthree levels (from 0 to 2) are used by concatenating theirpooling vectors. All obtained image-level representations arefed into the linear SVM in the training and testing phases(the libLinear package [26]), where the penalty parameter ofSVM is fixed as C = 1. Actually we found the classification

(a) (b)

Fig. 5. Example images of Flower 102 dataset, where each row represents onecategory. (a) Original images. (b) Corresponding segmented images. Limitedby the performance of the segmentation algorithm, the segmented images maycontain part of the background, lose part of the object, or even lose the wholeobject. Image best viewed in color.

performance is quite stable for different penalty parametervalues. The number of repeatitions and the number of trainingand testing samples follow the provided configuration alongwith each dataset. The performance is measured by the averageclassification accuracy on all classes. For multiple runs, boththe mean and the standard deviation of the classificationaccuracy are reported.

As for the evaluations of the proposed methods, we reportthe results of three different image-level representations: theoriginal feature representation vI , the distance image rep-resentation vd

I , and their direct concatenation vcI . In the

experimental results, LLC and LSA is assembled separatelywith different input features. For example, LLC-SIFT refersto applying LLC on the original SIFT features to producethe image level representation, and LLC-Combine refers tothe result of the concatenated image representations fromLLC-SIFT and LLC-Distance.

B. Specific Object Datasets

We first evaluate the proposed method on theFlower 102 [20] and PFID 61 [21] datasets, whose imagesare relatively clean and the background is less cluttered.

1) Flower 102: Flower 102 is a 102 category flowerdataset [20], containing 8189 images. And each class consistsof 40 to 258 images. Some examples are shown in Figure 5.In particular, the images possess small inter-class differenceand large intra-class variance. Here we focus on classifying thesegmented images available from the dataset. Limited by theimperfectness of the segmentation algorithm, the segmentedforeground may contain part of background, or lose part ofobject. Therefore, it is still challenging for the classificationmethod on such segmented images. The dataset has beendivided into a training set, a validation set and a testing setin the provided protocol. The training set and validation setconsist of 10 images per class. And the testing set consists ofthe remaining 6149 images (minimum 20 per class).

2) PFID 61: Pittsburgh Fast-Food Image Dataset is acollection of fast food images from 13 chain restaurants(e.g., McDonald, Pizza Hut, KFC) acquired under lab and

Page 8: Linear Distance Coding for Image Classification

544 IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 22, NO. 2, FEBRUARY 2013

Fig. 6. Example images of PFID 61 dataset, where each row of the leftand right part represents one category. Each category contains three instancesand each instance has six images from different views. Two images of eachinstance are shown here. Image best viewed in color.

realistic settings [21]. It contains 61 categories of food itemsselected from 101 categories. There are 3 instances of eachfood item, each of which are bought from different branchesand taken on different days. And 6 images from 6 viewpoints(60 degrees apart) for each food instance. Figure 6 shows14 categories of them with two example images per category.It is notable that the appearance of different instances ineach category vary greatly. And some different categories(e.g., Hamburgers) are too similar to distinguish them evenby the human eyes. Such large instance variance and tinydifference between classes make the classification quite chal-lenging.

For Flower 102, most of the previous classification methodsemploying single feature are based on the χ2 kernel functionof the clustered SIFTint and SIFTbdy features [27]. In starkcontrast, we directly uses much simpler and more efficientlinear SVM to classify the segmented images. We directlytrain the classifier on the training and validation images, asused by the baseline method provided in [20]. Namely, 20images per class are used for training, and the remaining areused for testing. For PFID 61, we follow the experimentalprotocol proposed in previous work [21], [28], and use 3-foldcross-validation to evaluate the performance. In each iteration,12 images of two instances are used for training and the6 images of the third one are used for testing. We repeatthe training and testing process for 3 times, with a differentinstance serving as the test set.

Table I gives the classification performances of differentmethods on the datasets Flower 102 and PFID 61. HereKMTJSRC-CG is the method proposed by Yuan et al. [27]that uses multi-task joint sparse coding and achieves the state-of-the-art performance 55.20% on this dataset. As for PFID61, the state-of-the-art performance is 28.20%. It is achievedby Yang et al. [28] through utilizing the spatial relationship oflocal features. Besides these methods, we perform the adoptedcoding methods LLC and LSA on both datasets to demonstratethe effectiveness of our proposed LDC on improving theclassification performance.

From Table I, it can be observed that the proposed methodsignificantly outperforms LLC and LSA with SIFT featuresand generally achieves the state-of-the-art performance. Thiswell verifies that the proposed distance pattern of local fea-tures is able to more effectively capture the discriminative

TABLE I

CLASSIFICATION ACCURACY (%) COMPARISON ON

TWO OBJECT DATASETS Flower 102 AND PFID 61

Methods Flower 102 PFID 61

SVM (SIFTint) [20]a 55.10 -

KMTJSRC-CG (SIFTint) [27] 55.20 -

Bag of SIFT [21]b - 9.20

OM [28]c - 28.20

LLC-SIFT 57.75 44.63 ± 4.00

LLC-distance 59.76 48.45 ± 3.58

LLC-combine 61.45 48.27 ± 3.59

LSA-SIFT 57.80 43.35 ± 3.36

LSA-distance 58.78 46.90 ± 3.47

LSA-combine 60.38 46.54 ± 3.08aThe best baseline accuracy provided by the authors of Flower 102for the single feature, which is based on SVM.bOne of baseline accuracies on the 61 categories provided by the authorsof PFID 61.cThe Orientation and Midpoint (OM), as one of a set of methods based onthe statistics of pairwise local features proposed by Yang et al., yields thebest accuracy, where the χ2 kernel is adopted with SVM.

information among multiple classes. According to ouranalysis, the combination of the distance vector and the origi-nal SIFT features should yield better classification accuracythan using each of them individually. This is because thecombination is able to compensate the information loss andprovide more useful information. This point is well shownon the dataset Flower 102, where the combination achievesthe best accuracy 61.45%. However, the effectiveness of suchcombination does not hold on the dataset PFID 61, wherethe individual distance vector achieves the best performance48.45% rather than the combination. The reason is that dif-ferent instances of PFID 61 possess too large variations, andthus the consistency of local features distribution between thetraining images and the testing images is not well guaranteed.This is experimentally demonstrated by the larger accuracyderivations from both LLC and LSA methods in Table I. In thiscase, the combination may slightly overfit the training data andlead to a negligible decrease of classification accuracy, e.g., theaverage accuracy is decreased from 48.45% to 48.27% whenLLC-Distance is combined with LLC-SIFT.

C. Scene DataSets

Now we evaluate the proposed method on the scene datasetsScene 15 and Indoor 67. The scene recognition is a challengingopen problem in high level vision because each image containsnot only the undeterminable characterizing objects but also thecomplex background [22]. Compared with the object classi-fication, the variations of images in the scene classificationare more severe, especially for the light condition, scale, andspatial layout.

1) Scene 15: This dataset consists of 15 scene cate-gories, among which 8 categories are originally collected byOliva et al. [29], 5 are added by Li et al. [5] and 2are adopted from Lazebnik et al. [7]. Each class contains200 to 400 images, and the average image size is around

Page 9: Linear Distance Coding for Image Classification

WANG et al.: LINEAR DISTANCE CODING FOR IMAGE CLASSIFICATION 545

Fig. 7. Example images of Scene 15 dataset containing all 15 categorieswith two images per category.

Fig. 8. Example images of Indoor 67 data set containing 67 categories.All categories are organized into five big groups: Store, Home, Public spaces,Leisure, and Working. Four categories with two images per category are shownfor each group. Due to the complex background, images within each categoryvary widely. Image best viewed in color.

300 × 250 pixels. Figure 7 shows some example images ofeach category.

all 15 categories with two images per category.2) Indoor 67: This dataset contains 67 indoor scene cat-

egories, and a total of 15620 images [22]. The images inthe dataset were collected from three different sources: onlineimage search tools (Google and Altavista), online photo shar-ing sites (Flickr) and the LabelMe dataset. All images havea minimum resolution of 200 pixels along the smaller axis.The number of images varies across categories, but thereare at least 100 images per category. To facilitate seeing thevariety of different scene categories, they are organized into5 big scene groups (Store, Home, Public spaces, Leisure, andWorking places), as shown in Figure 8.

For Scene 15, we follow the setting in [7] to randomlychoose 100 images per class for training and test on the rest. Inparticular, we repeat the evaluation three times, then report theaverage results and the standard deviation. As for Indoor 67,we follow the settings of the baseline method provided in [22].The 80 images of each class are used for training and 20images for testing, whose partition is provided on the datasetwebsite.

Table II provides the classification results on Scene 15 andIndoor 67. In the table, several baseline results on these twoscene datasets are provided. The used methods include thedetection based methods, the linear coding methods, and theNBNN method. For these two datasets, the distance vectors

TABLE II

CLASSIFICATION ACCURACY (%) COMPARISON ON

TWO SCENE DATASETS Scene 15 AND Indoor 67

Methods Scene 15 Indoor 67

ROI + gist-annotation [22]a - 26.50

Object Bank [30]b 80.90 37.60

KSPM [7] 81.40 ± 0.50 -

ScSPM [1] 80.28 ± 0.93 -

SC + linear kernel [31]c 84.10 ± 0.50 -

NBNN [13]d 77.00 -

LLC-SIFT 79.81 ± 0.35 43.78

LLC-distance 80.30 ± 0.62 43.53

LLC-combine 82.40 ± 0.35 46.28

LSA-SIFT 80.12 ± 0.60 44.19

LSA-distance 79.73 ± 0.70 42.04

LSA-combine 82.50 ± 0.47 46.69

aThe baseline result provided by the authors of Indoor 67, where theRegion of interest (ROI) detection is employed to reduce the interferenceof clutter background and the RBF-kernel SVM is adopted.bObject Bank pre-trains one object detector for each class.cFor comparison, the result of basic features is shown here, but it adoptsthe intersection kernel rather than our adopted linear SVM.d This is the optimized version of NBNN, where the image-to-class distanceis learned by employing the Mahalanobis metrics.

yield classification performance close to the original localfeatures due to the relatively poor consistency on the featuredistribution of training and testing images. As expected, thecombination achieves the best performance for both LLC andLSA methods, as the spatial robustness of the transformed dis-tance vectors strengthens the robustness of the final combinedimage level representation.

D. General Object Datasets

Here we conduct experiments on the datasets Caltech 101and Caltech 256, in which each image contains certain objectand cluttered background. The Caltech 101 dataset [23] con-tains 9144 images in 101 object categories including animals,vehicles, flowers, buildings, etc. The number of images percategory varies from 31 to 800. The Caltech 256 dataset [24]contains 30, 607 images from 256 object categories and eachcategory contains at least 80 images. Besides the objectcategories, both datasets are individually added to an extra“background” class, i.e., BACKGROUND_Google and clutter,respectively. Figure 9 gives some example images. Comparedwith Caltech 101, Caltech 256 presents much greater variationin object size, location, pose, etc.

For both datasets, we randomly select 30 images for train-ing and test on the rest. In particular, we repeat it threetimes and then report the average classification accuracy andthe corresponding standard deviation. Table III provides theresultant classification performance on these two datasets.Here we compare our method mainly with the linear codingmethods and the NBNN method. In particular, LLC in [2]adopted three-scale SIFT features, while our work only usesthe single-scale SIFT features. For Caltech 256, LLC [2]adopted a dictionary of 4096 visual words to further improve

Page 10: Linear Distance Coding for Image Classification

546 IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 22, NO. 2, FEBRUARY 2013

Fig. 9. Example images of Caltech 101 and Caltech 256 data setscontaining 102 and 257 categories, respectively. Besides object categories,each of both data sets contains one extra background category, namely,BACKGROUND_Google for Caltech 101 and clutter for Caltech 256. Allcategories in two datasets have large object variations with cluttered back-ground. Compared with Caltech 101, Caltech 256 has a more irregular objectlayout, which may degrade the classification performance due to the imperfectmatching of spatial pooling. Image best viewed in color.

TABLE III

CLASSIFICATION ACCURACY (%) COMPARISON

ON Caltech 101 AND Caltech 256

Methods Caltech 101 Caltech 256

SVM-KNN [32] 66.20 ± 0.50 -

KSPM [7], [24] 64.60 ± 0.80 34.10

ScSPM [1] 73.20 ± 0.54 34.02 ± 0.35

SC + linear kernel [31]a 71.50 ± 1.10 -

LScSPM [16] - 35.74 ± 0.10

NBNN [2], [8]b 70.40 37.00

LLC [2]c 73.44 41.19

LSA [10] 74.21 ± 0.81 -

LLC-SIFT 72.65 ± 0.33 36.27 ± 0.27

LLC-distance 73.34 ± 0.95 37.40 ± 0.07

LLC-combine 74.59 ± 0.54 38.41 ± 0.11

LSA-SIFT 72.86 ± 0.33 36.52 ± 0.26

LSA-distance 71.45 ± 0.87 36.30 ± 0.06

LSA-combine 74.47 ± 0.46 38.25 ± 0.08

aFor fair comparison, the result of basic features with linear kernel is shownhere. Higher accuracy is also reported in [31], but where the intersectionkernel is employed.bPerformance of the original NBNN [8] provided in [2].cLLC adopts three-scale SIFT features and the global dictionary of size 4096,which can yield higher accuracy than single scale features, especially forCaltech 256 with larger scale variation.

the performance, and our used dictionary of size fixed as2048. However, even following the same setting for Caltech101 dataset, the results by ourselves are slightly worse thanthe reported ones in the previous literatures. It is similar forLSA. Such decrease may be introduced by some implementingdetails. For the fair comparison, here we only compare theresults from our own implementation.

Comparing the results in Table III, we can observe that thecombination of the distance vector and the original features

always yields better performance than individual one, asexpected. Compared with the previous methods, our methodachieves the satisfying performance and outperforms the sim-ilar methods with linear SVM and single feature. Actually,the classification accuracy can be further increased if someadvanced learning-based model [15] or graph-matching ker-nel [33] is adopted with neglecting their complications.

From the above experimental results on several differenttypes of image datasets, we can summarize the effectivenessof the proposed method as follows:

1) The distance vectors are quite discriminative under mildcondition that the distributions of the training data andthe testing data are consistent to some extent, e.g.,the involved images have less interference of clutteredbackground.

2) The transformation to the distance vector relaxes therequirement for the similarity of object spatial layoutdue to its independence on spatial position of distinctiveobjects. This is one of the critical differences from theoriginal local features.

3) Under the coding-pooling framework, the distancevector and the original feature are complementary toeach other. Consequently, their combination can morecomprehensively capture the useful classification infor-mation and generally achieves higher classification per-formance, which is uniformly effective on all useddatasets.

E. Discussion

We have proposed the linear distance coding method, andthen verified its effectiveness on multiple types of benchmarkdatasets. Here we evaluate the influence of the number of near-est neighbors on calculating distance and coding separately.Particularly, we select the datasets Flower 102, Indoor 67 andCaltech 101 with one per type to investigate the performanceunder different values, where LLC is particularly employed.

1) Neighbor Number kdnn on Calculating Distance: In

Section III, we introduce the class manifolds to calculatethe distance of local feature to certain class with the aimof reducing the complexity and the interference of noisyfeatures. To investigate how kd

nn affects the final classificationperformance, we provide the average classification accuracyunder different values kd

nn ∈ {1, 2, 3, 4}, and the plot is shownin Figure 10.

From these results, we have the following observations.First, the combined representation is more robust to kd

nn thanthe individual distance vector, since the combination alsoencapsulates the information from the original features, whichis not affected by this parameter. Second, the influence of thisparameter varies a lot on different datasets, especially whenonly the distance vector is adopted. For example, the classifica-tion accuracy on the dataset Flower 102 keeps increasing whenkd

nn increases from 1 to 4. In fact, the performance has onlyslight fluctuation when discarding the results under kd

nn = 1.Based on the observations of the influences on differentdatasets, kd

nn = 3 is a good trade-off as our suggestion.2) Neighbor Number kc

nn on Coding: Now we investigatethe effect of kc

nn to the final classification performance, where

Page 11: Linear Distance Coding for Image Classification

WANG et al.: LINEAR DISTANCE CODING FOR IMAGE CLASSIFICATION 547

35.00%40.00%45.00%50.00%55.00%60.00%65.00%70.00%75.00%80.00%

1 2 3 4

Caltech 101 - DistanceCaltech 101 - CombineFlower 102 - DistanceFlower 102 - CombineIndoor 67 - DistanceIndoor 67 - Combine

Cla

ssifi

catio

n A

ccur

acy

knnd

Fig. 10. Classification accuracy of the proposed methods under differentkd

nn ∈ {1, 2, 3, 4}, where three types of data sets, Flower 102, Indoor 67,and Caltech 101, are adopted. Compared to the individual distance vector,the combination is more robust to the parameter kd

nn, as it provides morecomplete information. Image best viewed in color.

Fig. 11. Classification accuracy curve of LLC (Original), LDC (Distance),and their combination (Combine) for different kc

nn ∈ {2, 5, 10, 20}, wherethree types of data sets, Flower 102, Indoor 67, and Caltech 101, are adopted.Three methods have different trends as the variation of kc

nn. In particular,the combination has the most slight diversification, i.e., the combination isconsidered to be nonsensitive to the parameter kc

nn. Image best viewed incolor.

kdnn = 3 is universally used for calculating the distance vector.

Similarly, we show the varying classification performanceunder different values, as shown in Figure 11. In particular,the results of LLC on the SIFT features is provided besidesthat of the distance vector and the combination, where fourvalues of kc

nn ∈ {2, 5, 10, 20} are explored, as suggested in[2]. For fair comparison, all results here is produced by ourown implementations.

From Figure 11, the optimal parameter of different methodsheavily depends on the characteristics of the involved dataset,e.g., the variations of images, the cluttered degree of thebackground, etc. Here, we can summarize the observationsof Figure 11 for the different representations individually asfollows.

1) Sift: For the selected three datasets, the optimal para-meter is quite different, e.g., kc

nn = 2 for Flower 102,while kc

nn = 5 for Indoor 67 and Caltech 101. This maybe caused by the dependence of the optimal parametervalue on the interference of cluttered background. Inparticular, the images in Flower 102 are all segmented,which can significantly reduces the influence of back-ground and a small neighborhood is sufficient.

2) Distance: The distance vector possesses different seman-tic from the original local feature introduced by ourproposed transformation. Compared with SIFT, the per-

formance of the distance vector is relatively stable ondifferent datasets. For example, the optimal accuracy isalmost always achieved at kc

nn = 10.3) Combine: Due to taking advantages of both the stable

“SIFT” and the discriminative “Distance”, the combi-nation is most robust to the value of kc

nn across alldifferent datasets. For example, its achieved almost thesame accuracy on Flower 102 at different values kc

nn =1, 2, 3, 4.

From the above analysis, the parameter kcnn is very influ-

ential to performance when using original SIFT features toperform LLC, but such dependence is relaxed for the trans-formed distance vector. In particular, kc

nn = 10 is suggestedfor both the individual distance vector and the combination inthis work.

VI. CONCLUSION

In this paper, we propose linear distance coding method tocapture the discriminative information of local features andrelieve the dependence of spatial pooling on object layoutsimilarity of images. Consequently, the proposed method caneffectively improve the classification performance, which iswell verified on various types of datasets. In fact, the distancevector is to extract the discriminative information based on theimage-to-class distance, which is motivated quite differentlyfrom the traditional coding models. From the analysis andthe experiments, it is shown that the distance vector and theoriginal features are complementary to each other. Thus thecombination of two image representations can generally yieldhigher classification performance.

Through comparing the classification results of the pro-posed method on different types of benchmark datasets, it isconcluded that the cluttered background would significantlydegrade the final classification performance because of itsinfluence on the salient features of different classes. Inspiredby this observation, we plan to design a new model toreduce the interference of background aiming to improve theclassification performance, e.g., embedding the segmentationresults into the classification framework, which forms one ofour future directions.

REFERENCES

[1] J. Yang, K. Yu, Y. Gong, and T. Huang, “Linear spatial pyramidmatching using sparse coding for image classification,” in Proc. IEEEConf. Comput. Vis. Pattern Recognit., Jun. 2009, pp. 1794–1801.

[2] J. Wang, J. Yang, K. Yu, F. Lv, T. Huang, and Y. Gong, “Locality-constrained linear coding for image classification,” in Proc. IEEE Conf.Comput. Vis. Pattern Recognit., Jun. 2010, pp. 3360–3367.

[3] D. G. Lowe, “Distinctive image features from scale-invariant keypoints,”Int. J. Comput. Vis., vol. 60, no. 2, pp. 91–110, 2004.

[4] N. Dalal and B. Triggs, “Histograms of oriented gradients for humandetection,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., vol. 1.Jun. 2005, pp. 886–893.

[5] L. Fei-Fei and P. Perona, “A Bayesian hierarchical model for learningnatural scene categories,” in Proc. IEEE Conf. Comput. Vis. PatternRecognit., vol. 2. Jun. 2005, pp. 524–531.

[6] J. van Gemert, C. Veenman, A. Smeulders, and J. Geusebroek, “Visualword ambiguity,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 32, no. 7,pp. 1271–1283, Jul. 2010.

[7] S. Lazebnik, C. Schmid, and J. Ponce, “Beyond bags of features: Spatialpyramid matching for recognizing natural scene categories,” in Proc.IEEE Conf. Comput. Vis. Pattern Recognit., vol. 2. Jun. 2006, pp. 2169–2178.

Page 12: Linear Distance Coding for Image Classification

548 IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 22, NO. 2, FEBRUARY 2013

[8] O. Boiman, E. Shechtman, and M. Irani, “In defense of nearest-neighborbased image classification,” in Proc. IEEE Conf. Comput. Vis. PatternRecognit., Jun. 2008, pp. 1–8.

[9] J. van Gemert, J. Geusebroek, C. Veenman, and A. Smeulders, “Kernelcodebooks for scene categorization,” in Proc. Eur. Conf. Comput. Vis.,Oct. 2008, pp. 696–709.

[10] L. Liu, L. Wang, and X. Liu, “In defense of soft-assignment coding,”in Proc. Int. Conf. Comput. Vis., Nov. 2011, pp. 2486–2493.

[11] X. Zhou, K. Yu, T. Zhang, and T. Huang, “Image classification usingsuper-vector coding of local image descriptors,” in Proc. Eur. Conf.Comput. Vis., vol. 5. Sep. 2010, pp. 141–154.

[12] R. Behmo, P. Marcombes, A. S. Dalalyan, and V. Prinet, “Towardoptimal naive Bayes nearest neighbor,” in Proc. Eur. Conf. Comput.Vis., vol. 4. Sep. 2010, pp. 171–184.

[13] Z. Wang, Y. Hu, and L.-T. Chia, “Image-to-class distance metric learningfor image classification,” in Proc. Eur. Conf. Comput. Vis., vol. 1. Sep.2010, pp. 706–719.

[14] T. Tuytelaars, M. Fritz, K. Saenko, and T. Darrell, “The NBNN kernel,”in Proc. Int. Conf. Comput. Vis., vol. 1. Nov. 2011, pp. 1824–1831.

[15] J. Feng, B. Ni, Q. Tian, and S. Yan, “Geometric �p-norm feature poolingfor image classification,” in Proc. IEEE Conf. Comput. Vis. PatternRecognit., Jun. 2011, pp. 2609–2704.

[16] S. Gao, I. Tsang, L. Chia, and P. Zhao, “Local features are not lonely -Laplacian sparse coding for image classification,” in Proc. IEEE Conf.Comput. Vis. Pattern Recognit., San Francisco, CA, Jun. 2010, pp. 3555–3561.

[17] M. Muja and D. G. Lowe, “Fast approximate nearest neighbors withautomatic algorithm configuration,” in Proc. Int. Joint Conf. Comput.Vis. Theory Appl., vol. 1. Lisboa, Portugal, Feb. 2009, pp. 331–340.

[18] H. Jégou, M. Douze, and C. Schmid, “Product quantization for nearestneighbor search,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 33, no. 1,pp. 117–128, Jan. 2011.

[19] K. Yu and T. Zhang, “Improved local coordinate coding using localtangents,” in Proc. Int. Conf. Mach. Learn., Jun. 2010, pp. 1215–1222.

[20] M.-E. Nilsback and A. Zisserman, “Automated flower classification overa large number of classes,” in Proc. Indian Conf. Comput. Vis., Graph.Image Process., Dec. 2008, pp. 722–729.

[21] M. Chen, K. Dhingra, W. Wu, L. Yang, R. Sukthankar, and J. Yang,“PFID: Pittsburgh fast-food image dataset,” in Proc. Int. Conf. ImageProcess., Nov. 2009, pp. 289–292.

[22] A. Quattoni and A. Torralba, “Recognizing indoor scenes,” inProc. IEEE Conf. Comput. Vis. Pattern Recognit., Jun. 2009, pp.413–420.

[23] F.-F. Li, R. Fergus, and P. Perona, “Learning generative visual modelsfrom few training examples: An incremental Bayesian approach testedon 101 object categories,” Comput. Vis. Image Understand., vol. 106,no. 1, pp. 59–70, 2007.

[24] G. Griffin, A. Holub, and P. Perona, “Caltech-256 object categorydataset,” Dept. Comput. Sci., California Inst. Technology, Tech. Rep.7694, Apr. 2007.

[25] A. Vedaldi and B. Fulkerson. (2008). VLfeat: An Open andPortable Library of Computer Vision Algorithms [Online]. Available:http://www.vlfeat.org/

[26] R. Fan, K. Chang, C. Hsieh, X. Wang, and C. Lin, “Liblinear: A libraryfor large linear classification,” J. Mach. Learn. Res., vol. 9, pp. 1871–1874, May 2008.

[27] X. Yuan and S. Yan, “Visual classification with multi-task joint sparserepresentation,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit.,Jun. 2010, pp. 3493–3500.

[28] S. Yang, M. Chen, D. Pomerleau, and R. Sukthankar, “Food recognitionusing statistics of pairwise local features,” in Proc. IEEE Conf. Comput.Vis. Pattern Recognit., Jun. 2010, pp. 2249–2256.

[29] A. Oliva and A. Torralba, “Modeling the shape of the scene: A holisticrepresentation of the spatial envelope,” Int. J. Comput. Vis., vol. 42,no. 3, pp. 145–175, 2001.

[30] L.-J. Li, H. Su, E. P. Xing, and F.-F. Li, “Object bank: A high-level image representation for scene classification & semantic featuresparsification,” in Proc. Adv. Neural Inf. Process. Syst., Dec. 2010, pp.1378–1386.

[31] Y. Boureau, F. Bach, Y. LeCun, and J. Ponce, “Learning mid-levelfeatures for recognition,” in Proc. IEEE Conf. Comput. Vis. PatternRecognit., Jun. 2010, pp. 2559–2566.

[32] H. Zhang, A. C. Berg, M. Maire, and J. Malik, “SVM-KNN: Discrim-inative nearest neighbor classification for visual category recognition,”in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., vol. 2. Jun. 2006,pp. 2126–2136.

[33] O. Duchenne, A. Joulin, and J. Ponce, “A graph-matching kernelfor object categorization,” in Proc. Int. Conf. Comput. Vis., vol. 5.Barcelona, Spain, Nov. 2011, pp. 1792–1799.

Zilei Wang received the B.S. and Ph.D. degrees incontrol theory and control engineering from the Uni-versity of Science and Technology of China (USTC),Hefei, China, in 2002 and 2007, respectively.

He is currently an Associate Professor with theDepartment of Automation, USTC, and is alsowith the Vision and Machine Learning Laboratory,National University of Singapore, Singapore, asa Research Fellow. His current research interestsinclude computer vision and media streaming tech-niques.

Jiashi Feng received the B.S. degree from theUniversity of Science and Technology of China,Hefei, China, in 2007. He is currently pursuingthe Ph.D. degree with the Department of Electricaland Computer Engineering, National University ofSingapore, Singapore.

His current research interests include computervision and machine learning.

Shuicheng Yan (M’06–SM’09) is currently anAssistant Professor with the Department of Elec-trical and Computer Engineering, National Uni-versity of Singapore, where he is the Found-ing Lead of the Learning and Vision ResearchGroup (http://www.lv-nus.org). His current researchinterests include computer vision, multimedia, andmachine learning. He has authored or co-authoredover 200 technical papers.

He was a recipient of the Best Paper Award fromICIMCS in 2009, ACMMM in 2010, and ICME in

2010, the Winner Prize of the Classification Task in PASCAL VOC in 2010,the Honorable Mention Prize of the Detection Task in PASCAL VOC in 2010,the TCSVT Best Associate Editor (BAE) Award in 2010, and the co-authorof the Best Student Paper Award of PREMIA in 2009 and PREMIA in 2011.He is an Associate Editor of the IEEE TRANSACTIONS ON CIRCUITS AND

SYSTEMS FOR VIDEO TECHNOLOGY, and the Guest Editor of the specialissues for TMM and CVIU.

Hongsheng Xi received the B.S. and M.S. degrees inapplied mathematics from the University of Scienceand Technology of China (USTC), Hefei, China, in1980 and 1985, respectively.

He is currently a Professor with the Departmentof Automation, USTC, where he also directs theLaboratory of Network Communication Systems andControl. His current research interests include sto-chastic control systems, network performance analy-sis and optimization, wireless communications, andsignal processing.