Visual Coding in a Semantic Hierarchy - UESTCcfm.uestc.edu.cn/~fshen/MM15.pdf · 2018-04-24 ·...

10
Visual Coding in a Semantic Hierarchy Yang Yang , Hanwang Zhang , Mingxing Zhang , Fumin Shen , Xuelong Li § University of Electronic Science and Technology of China National University of Singapore § Chinese Academy of Sciences {dlyyang, hanwangzhang, zhangmingxing, fumin.shen}@gmail.com; [email protected] ABSTRACT In recent years, tremendous research endeavours have been dedicated to seeking effective visual representations for fa- cilitating various multimedia applications, such as visual an- notation and retrieval. Nonetheless, existing approaches can hardly achieve satisfactory performance due to the scarcity of fully exploring semantic properties of visual codes. In this paper, we present a novel visual coding approach, termed as hierarchical semantic visual coding (HSVC), to effectively encode visual objects (e.g., image and video) in a seman- tic hierarchy. Specifically, we first construct a semantic- enriched dictionary hierarchy, which is comprised of dictio- naries corresponding to all concepts in a semantic hierarchy as well as their hierarchical semantic correlation. Moreover, we devise an on-line semantic coding model, which simul- taneously 1) exploits the rich hierarchical semantic prior knowledge in the learned dictionary, 2) reflects semantic sparse property of visual codes, and 3) explores semantic relationships among concepts in the semantic hierarchy. To this end, we propose to integrate concept-level group spar- sity constraint and semantic correlation matrix into a uni- fied regularization term. We design an effective algorithm to optimize the proposed model, and a rigorous mathematical analysis has been provided to guarantee that the algorithm converges to a global optima. Extensive experiments on var- ious multimedia datasets have been conducted to illustrate the superiority of our proposed approach as compared to state-of-the-art methods. Categories and Subject Descriptors H.3.1 [Content Analysis and Indexing]: Abstracting methods General Terms Algorithm Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full cita- tion on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or re- publish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]. MM’15, October 26–30, 2015, Brisbane, Australia. c 2015 ACM. ISBN 978-1-4503-3459-4/15/10 ...$15.00. DOI: http://dx.doi.org/10.1145/2733373.2806244. Keywords semantic hierarchy; dictionary hierarchy; visual coding; group lasso 1. INTRODUCTION Recent years have witness the power of multimedia tech- niques to facilitate various real-world applications, such as visual retrieval [32], visual annotation [29, 30], etc. Over the past two decades, visual representation (or features) has undergone the transition from simple raw feature ex- traction, to higher-level feature statistics such as Bag-of- Words, and today, to more complex but generic feature ex- traction pipelines. For example, one of the most popular pipelines extracts hand-crafted descriptors (e.g., SIFT or HOG) from local visual patches, and then encodes these descriptors into certain over-complete representation using various algorithms such as k-means or sparse coding; after coding stage, global visual representations are formed by spatially pooling the coded local descriptors. Finally, the global representations of visual samples are fed into task- dependent classifiers. Another promising pipeline has been growing its popularity with the recent advances in deep learning. In general, it is a deep neural network with re- peated feature detection and pooling layers. The output of the top layer is regarded as the ultimate visual features, which are then fed into the final classification layer. Al- though these two pipelines are motivated differently, they are both shown to achieve the state-of-the-art performance in most image classification benchmarks. However, as compared to our humans, who can easily rec- ognize more than tens of thousands objects, the above ar- tificial pipelines are obviously inferior. Deng et al. [5] have shown that when they tried to classify ten thousand objects, the accuracy is only around 3.7%. Admittedly, it is still un- clear how to computationally model human vision system, nevertheless, we can observe the following three crucial de- fects in current visual representation designs. First, one of the core stages of the pipelines is the generation of elementary feature representations (e.g., codebook, dictionary or convolutional kernels), which are learned or sampled globally over the whole dataset. However, this may lose fine-grained details that are specific object categories during the sample bias. For example, if the number of crucial elementary features of “dog” is smaller than that of other objects, then the resultant codebook is less capable to represent “dog” images. 59

Transcript of Visual Coding in a Semantic Hierarchy - UESTCcfm.uestc.edu.cn/~fshen/MM15.pdf · 2018-04-24 ·...

Page 1: Visual Coding in a Semantic Hierarchy - UESTCcfm.uestc.edu.cn/~fshen/MM15.pdf · 2018-04-24 · Visual Coding in a Semantic Hierarchy Yang Yang†, Hanwang Zhang‡, Mingxing Zhang†,

Visual Coding in a Semantic Hierarchy

Yang Yang†, Hanwang Zhang‡, Mingxing Zhang†, Fumin Shen†, Xuelong Li§†University of Electronic Science and Technology of China

‡National University of Singapore§Chinese Academy of Sciences

{dlyyang, hanwangzhang, zhangmingxing, fumin.shen}@gmail.com; [email protected]

ABSTRACTIn recent years, tremendous research endeavours have beendedicated to seeking effective visual representations for fa-cilitating various multimedia applications, such as visual an-notation and retrieval. Nonetheless, existing approaches canhardly achieve satisfactory performance due to the scarcityof fully exploring semantic properties of visual codes. In thispaper, we present a novel visual coding approach, termed ashierarchical semantic visual coding (HSVC), to effectivelyencode visual objects (e.g., image and video) in a seman-tic hierarchy. Specifically, we first construct a semantic-enriched dictionary hierarchy, which is comprised of dictio-naries corresponding to all concepts in a semantic hierarchyas well as their hierarchical semantic correlation. Moreover,we devise an on-line semantic coding model, which simul-taneously 1) exploits the rich hierarchical semantic priorknowledge in the learned dictionary, 2) reflects semanticsparse property of visual codes, and 3) explores semanticrelationships among concepts in the semantic hierarchy. Tothis end, we propose to integrate concept-level group spar-sity constraint and semantic correlation matrix into a uni-fied regularization term. We design an effective algorithm tooptimize the proposed model, and a rigorous mathematicalanalysis has been provided to guarantee that the algorithmconverges to a global optima. Extensive experiments on var-ious multimedia datasets have been conducted to illustratethe superiority of our proposed approach as compared tostate-of-the-art methods.

Categories and Subject DescriptorsH.3.1 [Content Analysis and Indexing]: Abstractingmethods

General TermsAlgorithm

Permission to make digital or hard copies of all or part of this work for personal or

classroom use is granted without fee provided that copies are not made or distributed

for profit or commercial advantage and that copies bear this notice and the full cita-

tion on the first page. Copyrights for components of this work owned by others than

ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or re-

publish, to post on servers or to redistribute to lists, requires prior specific permission

and/or a fee. Request permissions from [email protected].

MM’15, October 26–30, 2015, Brisbane, Australia.

c© 2015 ACM. ISBN 978-1-4503-3459-4/15/10 ...$15.00.

DOI: http://dx.doi.org/10.1145/2733373.2806244.

Keywordssemantic hierarchy; dictionary hierarchy; visual coding; grouplasso

1. INTRODUCTIONRecent years have witness the power of multimedia tech-

niques to facilitate various real-world applications, such asvisual retrieval [32], visual annotation [29, 30], etc. Overthe past two decades, visual representation (or features)has undergone the transition from simple raw feature ex-traction, to higher-level feature statistics such as Bag-of-Words, and today, to more complex but generic feature ex-traction pipelines. For example, one of the most popularpipelines extracts hand-crafted descriptors (e.g., SIFT orHOG) from local visual patches, and then encodes thesedescriptors into certain over-complete representation usingvarious algorithms such as k-means or sparse coding; aftercoding stage, global visual representations are formed byspatially pooling the coded local descriptors. Finally, theglobal representations of visual samples are fed into task-dependent classifiers. Another promising pipeline has beengrowing its popularity with the recent advances in deeplearning. In general, it is a deep neural network with re-peated feature detection and pooling layers. The outputof the top layer is regarded as the ultimate visual features,which are then fed into the final classification layer. Al-though these two pipelines are motivated differently, theyare both shown to achieve the state-of-the-art performancein most image classification benchmarks.

However, as compared to our humans, who can easily rec-ognize more than tens of thousands objects, the above ar-tificial pipelines are obviously inferior. Deng et al. [5] haveshown that when they tried to classify ten thousand objects,the accuracy is only around 3.7%. Admittedly, it is still un-clear how to computationally model human vision system,nevertheless, we can observe the following three crucial de-fects in current visual representation designs.

• First, one of the core stages of the pipelines is thegeneration of elementary feature representations (e.g.,codebook, dictionary or convolutional kernels), whichare learned or sampled globally over the whole dataset.However, this may lose fine-grained details that arespecific object categories during the sample bias. Forexample, if the number of crucial elementary featuresof “dog” is smaller than that of other objects, then theresultant codebook is less capable to represent “dog”images.

59

Page 2: Visual Coding in a Semantic Hierarchy - UESTCcfm.uestc.edu.cn/~fshen/MM15.pdf · 2018-04-24 · Visual Coding in a Semantic Hierarchy Yang Yang†, Hanwang Zhang‡, Mingxing Zhang†,

off-line on-line

Image with Car ,

Building , People ,

etc.

Hierarchical Semantic Visual Coding

D1 D6D4, , , , , ,

DGD

Concept-specific Group Sparsity

Semantic-Enriched Dictionary Hierarchy

Hierarchical Semantic

Correlation

0.8 0

0 0.8

0

0 80.814S, , , , 46S

0.2 0

0 0.2

0

0 20.2

c 1c 4c 6c Gc,...,[ [,..., ,...,

Car Building People

Sparse Codes

Dictionary Learning in a Semantic Hierarchy

Concept Semantic Similarity

1c

4c6c

Gc... ...0.8

0.2

0.5

0.3

0.7

D1

D2

D3

D4

D5

D6

DG

Figure 1: The proposed hierarchical semantic visual coding (HSVC) pipeline. In the off-line part, we learn dictionaries for every conceptin a semantic hierarchy. Each dictionary is learned by the training samples belonging to the referenced concept. In the on-line part, weaggregate the set of learned dictionaries as a holistic semantic-enriched dictionary hierarchy, which provides fundamental feature basisfor HSVC. Meanwhile, we also explore semantic properties of visual codes, including sematic sparsity and semantic correlation. Theresultant sparse visual codes are the visual representation used in subsequent specific tasks such as visual annotation and visual retrieval.

• Second, these pipelines neglect to consider the rela-tionships between different object categories, which isthe most appealing ability that human vision systempossesses. For example, when a human is trying to rec-ognize a “puma”, she may transfer some visual knowl-edge from “pet cat”. By this way, our vision systemcan generalize well to objects unseen in the trainingsamples.

• Moreover, real-world visual objects usually contain mul-tiple objects and scenes, which are correlated throughcertain semantics. For example, “people” and “dog”are more likely to appear in the same scene than “dog”and “fish”, since the former two have closer semanticrelationship than the latter. However, to the best ofour knowledge, seldom existing visual representationmethods take this observation into account.

In this paper, we propose a novel sematic visual represen-tation extraction pipeline, termed as hierarchical semanticvisual coding (HSVC), which encodes visual objects in accor-dance with a semantic hierarchy. As illustrated in Figure 1,the proposed pipeline comprises an off-line part and an on-line part. Suppose we have a semantic hierarchy populatedwith training samples referenced to each concept (e.g., Im-ageNet [6]). In the off-line part, we embed rich hierarchialsemantic information into dictionary learning process to con-struct a semantic-enriched dictionary hierarchy. For each in-dividual concept in the sematic hierarchy, we extract visualfeatures (e.g., DeCAF [8], Bag-of-Words, etc.) for all the at-tached training samples and learn a concept-specific dictio-nary; then the dictionaries corresponding to all concepts inthe hierarchy are concatenated to form a holistic dictionary.Moreover, in order to facilitate the subsequent on-line vi-sual coding, we preliminarily compute semantic similarities(or correlations) of concepts in the hierarchy as well. In theon-line part, we devise a novel semantic coding approach byfully exploring semantic properties of visual codes, includ-ing rich hierarchical semantic information in the learned dic-tionary, semantic sparsity of concepts, as well as semanticrelationships among concepts in the semantic hierarchy. Inorder to achieve this goal, we incorporate a novel semanti-cally structured regularization term, which not only induces

concept-level group sparsity to preserve semantic purity ofvisual samples, but also employs hierarchical semantic cuesto retain semantic correlations among concepts and enhancethe representative power of visual codes. The generatedsparse codes are considered as the final visual representationand further used in specific tasks. We conduct extensive ex-periments on real-world large-scale datasets, and the promis-ing results demonstrate that the proposed visual codes aremore effective in supporting fundamental applications, suchas visual annotation and retrieval, as compared to state-of-the-art methods.

We highlight the three major contributions of the pro-posed coding pipeline as follows:

• Dictionary Learning in a Semantic Hierarchy.In order to prepare dictionaries for generating effectivevisual representation, we propose a novel dictionarylearning approach in a semantic hierarchy. We learna concept-specific dictionary for every concept usingall its attached visual samples. The constructed dic-tionary comprises elementary features that are mostrepresentative for the referenced concept. By aggre-gating the dictionaries of all concepts in the semantichierarchy, we can obtain a holistic semantic-enricheddictionary hierarchy.

• Concept-specific Group Sparsity. We obverse thatreal-world visual objects normally contain very sparsesemantic concepts. For instance, an image of“car”mayonly correspond to a small number of concepts such as“car”, “road”, “building”. Hence, encoding informationof all the concepts into an individual visual sample mayaffect semantic purity of visual samples and lead to im-precise representation. In order to avoid such problem,we propose to impose concept-level sparsity on the de-sired codes. Each concept is the part of dictionary (i.e.,a group of base elements) that corresponds to a specificsemantic concept, and the induced sparsity will lead tosparse codes with a reasonable number of concepts.

• Hierarchical Semantic Correlation. In order toenable sparse codes to be more meaningful and repre-sentative, we further exploit the semantic relationships

60

Page 3: Visual Coding in a Semantic Hierarchy - UESTCcfm.uestc.edu.cn/~fshen/MM15.pdf · 2018-04-24 · Visual Coding in a Semantic Hierarchy Yang Yang†, Hanwang Zhang‡, Mingxing Zhang†,

among concepts when coding visual objects. By ex-ploiting the explicit semantic information of conceptsin the hierarchy, we are able to calculate semantic sim-ilarity (or correlation) and incorporate such informa-tion into the coding process. For instance, we maypunish the coincidence of the codes of “people” and“dog” lighter than that of “fish” and “dog”. Therefore,the resultant visual codes would be more semanticallyinformative.

The rest of the paper is organized as follows. Section 2reviews related work. Section 3 describes dictionary learn-ing in a semantic hierarchy. Section 4 details the proposedhierarchical semantic visual coding pipeline, including theformulation, algorithms and a rigorous mathematical proof.Experimental results and analysis are reported in Section 5,followed by conclusions in Section 6.

2. RELATED WORK

2.1 Visual Coding for RepresentationState-of-the-art visual representation comprises the fol-

lowing stages. First, raw visual feature descriptors (e.g.,color, texture or shape) are extracted from local patches;then, these features are encoded into an over-complete rep-resentation. One general coding approach is described asbelow:

minc

‖x−Dc‖22,s.t. c ∈ F ,

(1)

where x ∈ Rd is the feature vector of a visual sample, and

d is the dimensionality of the feature space. D ∈ Rd×M

is an over-complete dictionary (i.e., M ≥ d), M is thenumber of bases in the dictionary, c ∈ R

M is the sparsecodes, and F is certain feasible region. For example, whenF = {c|c ∈ {0, 1}, ‖c‖1 = 1}, the coding reduces to thewell-known k-means method [9]; when F = {c|‖c‖1 ≤ γ},where γ > 0 is a constant, it becomes the popular sparsecoding [17]. After the coding stage, global image represen-tations are formed by spatially pooling the coded local de-scriptors [16]. Finally, such global representations are usedin specific tasks such as classification and retrieval. Meth-ods following such a pipeline have achieved state-of-the-artperformance on several benchmark datasets. Recently, con-volutional neural network (CNN) has been shown a success-ful alternative in learning image representations, achieving asignificant boost in image classification task [8,15,24]. Thesetwo pipelines both follow a hierarchical feature extractionphilosophy, but the core difference is that CNN automat-ically learns features while the other one is manually de-signed. However, they do not conflict since we can use thelearned image representation as an input to the manually de-signed pipeline in order to further boost the performance [8].

As compared to local feature-level coding methods for im-age representation, our work falls into the image-level cod-ing, where feature x is a global representation of an image(e.g., we use DeCAF feature in this paper). In this way,we relate to some recent work. Zheng et al. [34] exploitedthe geometry structure of the data space, casting the image-level sparse coding problem into a graph embedding one.Huang et al. [13] proposed an efficient sparse coding meth-ods for image annotation, where the dictionary is the feature

matrix of the whole training images. Some studies used cod-ing methods with group sparsity, which tackles the problemthat the �1-norm sparse coding implicitly assumes that anelement in the dictionary is independent of all others. Es-sentially, group sparsity motivates to integrate correlationsamong the dictionaries. Zhang et al. [33] used group spar-sity in selecting useful features of the same type for imageannotation. Bengio et al. [1] proposed to use group sparsecoding for jointly encoding the group of visual descriptorswithin the same image to achieve image-level sparsity. Yanget al. [28] considered regions of an image as a group, usedto achieve region-level image annotation. Gao et al. [11]extended the meaning of group into “class”, “tag” and “in-stance” levels and proposed a multi-layer group sparse cod-ing algorithm for image annotation. Formally, given a dic-tionary D = {D1,D2, ...,DG}, where Di is a group (or sub-set) of the entire dictionary D, the sparse coding with groupsparsity (a.k.a. Group Lasso [31]) can be formulated as

minc

∥∥∥∥∥x−G∑

i=1

Dici

∥∥∥∥∥2

2

+ λG∑

i=1

‖ci‖2, (2)

where λ is the trade-off constant and ci is the coding coef-ficient corresponding to the i-th group. In fact, the groupsparsity constraint

∑Gi=1 ‖ci‖2 can be considered as the �1-

norm of the vector [‖c1‖2, ‖c2‖2, . . . , ‖cG‖2]T . In this paper,we consider a group as a concept-specific dictionary.

2.2 Semantic HierarchyA semantic hierarchy is a formally defined taxonomy or

ontology, where each node represents a semantic concept,such as WordNet [10], ImageNet [6], or LSCOM [20], etc.It organizes semantic concepts from general to specific andhas been shown to be effective in boosting visual recognition[18] and retrieval [4, 7].

In this paper, we predict the semantic path of images byhierarchical concept classifiers. As compared to a bag of flatclassifiers (e.g., a bag of one-vs-all classifiers), hierarchicalclassifiers exploit the ontological relationship and thus signif-icantly reduce the prediction time while producing compara-tively good accuracy. For example, [12] automatically builtclassification trees which achieve significant speedup at asmall performance drop. There are also studies which foundthat hierarchical top-down prediction using local classifierscan retain the same performance as one-vs-all classifiers [18].Here, “local” is a concept classifier training strategy thatconsiders the images of the concept itself as positive andthose of its siblings as negative. However, local classifiersmay suffer severely from the error propagation problem. Ifa classifier at higher levels fails, it is hard for the successiveclassifiers at lower levels to stop the error (e.g., classify theerror as negative) since they have never seen such an erroras a negative sample. In order to alleviate this problem, [3]developed a structure learning method that optimizes theperformance of all local classifiers in a semantic hierarchy.However, the training cost of the method is prohibitive inthe large-scale semantic hierarchy as employed in this work.For efficiency, we adopt an hierarchical one-vs-all trainingstrategy to learn local concept classifiers as in [23].

Most recent work exploits semantic hierarchy for image re-trieval and designs certain semantic similarity function thatembeds hierarchical information. In [7], given two images,their visual nearest neighbor images were first found, and

61

Page 4: Visual Coding in a Semantic Hierarchy - UESTCcfm.uestc.edu.cn/~fshen/MM15.pdf · 2018-04-24 · Visual Coding in a Semantic Hierarchy Yang Yang†, Hanwang Zhang‡, Mingxing Zhang†,

their semantic distance was computed as a distance betweenthe concepts of their neighbors. [4] developed an hierarchi-cal bilinear similarity function for image retrieval. Theyfirst represented an image as a semantic vector z consistingof its relevances to a set of concepts, and defined the bilin-ear similarity between any two images as zTi Szj , where Sis a matrix encoding the pairwise semantic affinities amongthe concepts. They have shown that this method achievesstate-of-the-art performance of image retrieval on ImageNet.Recently, in [27], a new method was proposed to associateseparated visual similarity metrics for every concept in a hi-erarchy and then learn the metrics jointly through an aggre-gated hierarchical metric. In our work, we use the semanticsimilarities between the concepts in a semantic hierarchy toregularize the semantic coding. By doing this, we cannotonly achieve semantic group sparsity which boost the dis-criminative power of the sparse codes, but also enrich thesemantic descriptiveness of them.

3. DICTIONARY LEARNING IN A SEMA-NTIC HIERARCHY

Dictionary learning [22, 25] plays a vital role in multime-dia applications. The premise of our visual coding pipelineis to construct a semantic-enriched dictionary hierarchy inaccordance with certain hierarchy. As shown in Figure 1,for each concept in the hierarchy, we learn a dictionary fromvisual samples that belong to the concept. Without loss ofgenerality, for any concept c, we learn a dictionary Dc fromthe training samples as follows:

minC,Dc

‖X−DcC‖2F , (3)

where X = [x1, ...,xNc ] ∈ Rd×Nc is the matrix of the train-

ing samples, Nc is the number of the training samples, xi ∈R

d is the i-th sample, and C is the matrix of the corre-sponding codes for all training visual samples. ‖ · ‖F is theFrobenius norm of a matrix. Fortunately, the solution we areinterested in, the dictionary Dc, can be obtained throughSingular Value Decomposition (SVD) of the visual samplematrix X = DcΣCT , where Σ is a diagonal matrix of thesingular values, Dc and C are the the matrix concatenationof the left and right singular vectors, respectively. Therefore,the ultimate dictionary with respect to the entire semantichierarchy can be obtained by concatenating all the concept-specific dictionaries:

D = [D1, ...,Dc, ...,DG] , (4)

where G is the number of semantic concepts in the hierarchy.Dictionary Dc can be considered as a base matrix that

spans an intrinsic subspace for the visual information of con-cept c. The advantages of our concept-specific dictionarylearning are two folds. First, since we will eventually com-bine these dictionaries, the size of each dictionary should besmall otherwise it will cause prohibitive computation prob-lem. Fortunately, the SVD solution Dc leads to compactbase matrix depicting the characteristics of a specific con-cept. In this work, we set the number of base features to 10,which means only the top 10 visual variances are captured.Note that each dictionary is only responsible for its ownconcept, therefore we merely need to retain the principalvariances in the dictionary since the “entire” picture will beeventually emerged by aggregating all the “local” pictures.

Dog Grass Basket People

Football

Sky

Building

Figure 3: Illustration of semantic sparsity and seman-tic correlation among concepts. As shown, due to limitedamount of content, an individual image can only cover asmall proportion of semantic concepts in the whole seman-tic space, which implies that within the image, the semanticinformation should be sparse enough at concept level. Onthe other hand, concepts co-occurring in the same image(e.g., “sky” and “building”) should be semantically corre-lated, and thus should mutually reinforce/influence weightsof each other’s codes.

Figure 2 illustrates the semantic meanings of the dictionar-ies we learned. Second, computing Dc is very efficient, es-pecially when the number of training samples is large-scale.This is very common when learning dictionaries for moregeneral concepts who possess many more samples. In fact,no matter how large the number of the samples (i.e., thecolumn dimension of X) is, we can always efficiently obtainDc by computing the singular eigenvectors for XT , whichhas a smaller column dimension, i.e., the dimension of theoriginal visual feature.

4. CODING VISUAL OBJECTS IN A SEM-ANTIC HIERARCHY

After we harvest the concept-dependent dictionaries asaforementioned, we are ready to encode visual objects intosparse codes while exploring the semantic properties in thehierarchy. In this way, we expect the sparse codes not onlyprovide efficient sparse visual representations but also canbe explicitly interpreted as semantic meanings such as ob-jects and relationships among them. As aforementioned inEq. (2), a straightforward way is to formulate the codingproblem into a Group Lasso problem. However, this formu-lation treats the codes of each semantic concepts indepen-dently and thus neglects the hierarchical semantic relation-ships between them. In fact, concepts are semantically inter-correlated to each other in the semantic space. Furthermore,for an individual visual object, its related semantics shouldbe sparse due to limited amount of content. In this section,we show how to efficiently and effectively encode visual ob-jects in a semantic hierarchy by jointly considering semanticsparsity and semantic correlation.

4.1 FormulationIf we inspect Eq. (2), we can easily find that the Group

Lasso term (∑G

i=1 ‖ci‖2) fails to incorporate the relationshipbetween ci and cj . For example, if the semantic similaritysdog,people of “dog” and “people” is large, we should consider

62

Page 5: Visual Coding in a Semantic Hierarchy - UESTCcfm.uestc.edu.cn/~fshen/MM15.pdf · 2018-04-24 · Visual Coding in a Semantic Hierarchy Yang Yang†, Hanwang Zhang‡, Mingxing Zhang†,

cock chickadee bird

Figure 2: The illustration of concept-specific dictionary. By Eq. (3), the learned dictionary only captures the intrinsic semanticof the concept. For concept “cock” (ImageNetID: n01514668), “chickadee” (ImageNetID: n01592084), and “bird” (ImageNetID:n01503061), we choose the dimension of the dictionary with the largest singular value, and we map all the images in oneconcept into this dimension. The top and bottom images are largest and smallest mapped values, respectively, representingsemantic meanings. For example, the dimension in “bird” dictionary captures birds with/without wings.

them jointly and expect them to co-occur in the sparse codeswith a relatively high probability; if sdog,fish of concepts“dog”and“fish” is small, it is reasonable to neglect their jointsparsity since they are semantically far away. In general, wepropose to jointly consider the sparsity of the codes througha similarity weight sij between concepts ci and cj :

minc

∥∥∥∥∥x−G∑

i=1

Dici

∥∥∥∥∥2

2

+ λ

G∑i=1

G∑j=1

(cTi Sijcj

)1/2

, (5)

where Sij is a diagonal matrix whose diagonal entries are sij .In particular, we can see that the original Group Lasso inEq. (2) is only a special case such that sij = 1 if i = jand sij = 0 otherwise. Specifically, sij can be definedsij = f(π(i, j)), where π(i, j) is the lowest common ancestorof concepts ci and cj , and f(·) is some real function that isnon-decreasing going down the hierarchy, i.e., the lower thelowest shared ancestor is, the more similar concepts i andj are. For example, “dog” is much more similar to “people”than“car”because “dog” shares a lower level common ances-tor “mammal” with “people” than “object” with “car”. Fig-ure 3 illustrates semantic sparsity and semantic correlation.Note that there are other ways to obtain sij , e.g., miningfrom the Internet corpus or exploiting crowd sourcing [21].

In [19], it has shown that the generalized form of GroupLasso in Eq. (5) is an intermediate regularizer between the�1-norm in the Lasso and the �2-norm in ridge regression.This indicates that our visual coding formulation encour-ages concept-level sparsity when taking hierarchical seman-tic prior knowledge into consideration.

4.2 SolutionNote that our coding formulation in Eq. (5) is non-smooth

because it contains the square root to impose concept-levelsparsity. In this part, we propose an efficient iterative algo-rithm to solve the optimization problem in Eq. (5).

To illustrate the algorithm, at each iteration, we reformu-late Eq. (5) into an alternative quadratic and smooth formby dividing the Group Lasso term at the previous iteration,

minc(t)

∥∥∥∥∥x−G∑

i=1

Dic(t)i

∥∥∥∥∥2

2

+ λ

G∑i=1

G∑j=1

c(t)i

TSijc

(t)j

2(c(t−1)i

TSijc

(t−1)j

)1/2.

(6)

The re-weighting term(c(t−1)i

TSijc

(t−1)j

)1/2

can be consid-

ered as a constant term since we assume that we have alreadyobtained {ci}Gi=1 at the previous iteration. Eq. (6) can be

rewritten into a more compact form as

minc(t)

∥∥∥x−Dc(t)∥∥∥2

2+ λc(t)

TS(t−1)c(t), (7)

where D = [D1, . . . ,DG], c = [cT1 , . . . , cTG]

T , and

S(t−1) =

⎛⎜⎜⎝

S(t−1)11 · · · S

(t−1)1G

.... . .

...

S(t−1)G1 · · · S

(t−1)GG

⎞⎟⎟⎠ ,

S(t−1)ij =

Sij

2(c(t−1)i

TSijc

(t−1)j

)1/2.

(8)

It is easy to see that Eq. (7) has a closed form solution:

c(t) =(DTD+ λS(t−1)

)−1

DTx. (9)

This equation is the resultant iterative updating rule for theultimate hierarchical semantic sparse codes c∗. A detailedalgorithm is illustrated in Algorithm 1.

In the next section, we elaborate detailed analysis of theproposed algorithm and provide a rigorous mathematicalverification of the convergence.

Algorithm 1: Hierarchical Semantic Visual Coding

Input : A visual feature x, dictionary matrix D andsimilarity matrix S w.r.t. a semantichierarchy, trade-off parameter λ;

Output: Hierarchical semantic sparse codes c∗;1 Initialization: Randomly initialize sparse codes c(0)

with ‖c(0)‖2 = 1, t = 0;2 repeat3 t ← t+ 1;

4 Calculate S(t−1) according to Eq. (8);

5 c(t) =(DTD+ λS(t−1)

)−1

DTx;

6 until there is no change to c(t);

7 Return: c∗ = c(t);

5. ALGORITHMIC ANALYSISIn this section, we prove that Algorithm 1 converges to

a global minimum of Eq. (5) and analyze its computationalcomplexity. We first show the Algorithm 1 converges to alimiting point and then show this point is a stationary point

63

Page 6: Visual Coding in a Semantic Hierarchy - UESTCcfm.uestc.edu.cn/~fshen/MM15.pdf · 2018-04-24 · Visual Coding in a Semantic Hierarchy Yang Yang†, Hanwang Zhang‡, Mingxing Zhang†,

of Eq. (5), which is therefore a global minimum due to theconvexity of Eq. (5). We start from the following lemma.

Lemma 1. For nonzero vectors c and b, and symmetricpositive definite matrix S, we have

(cTSc

)1/2

− cTSc

2 (bTSb)1/2≤

(bTSb

)1/2

− bTSb

2 (bTSb)1/2.

(10)

Proof. Since S is symmetric positive definite, we haveS = LTL by Cholesky decomposition, and for any nonzero

vectors we have cTSc > 0. Denote b = Lb and c = Lc,

since (‖c‖2 − ‖b‖2)2, we have

‖c‖2 − ‖c‖222‖b‖2

≤ ‖b‖2 − ‖b‖222‖b‖2

, (11)

which therefore proves Lemma 1.

Thereom 1 (Convergence of Algorithm 1). The it-erative update in Algorithm 1 guarantees that the objectivefunction in Eq. (5) converges to a global minimum.

Proof. Since the solution in Eq. (9) minimizes the valueof the objective function in Eq. (7), we have∥∥∥∥∥x−

G∑i=1

Dic(t)i

∥∥∥∥∥2

2

+ λ∑

i,j

c(t)i

TSijc

(t)j

2(c(t−1)i

TSijc

(t−1)j

)1/2

≤∥∥∥∥∥x−

G∑i=1

Dic(t−1)i

∥∥∥∥∥2

2

+ λ∑

i,j

c(t−1)i

TSijc

(t−1)j

2(c(t−1)i

TSijc

(t−1)j

)1/2.

(12)

By applying Lemma 1, we have

(c(t)i

TSijc

(t)j

)1/2

− c(t)i

TSijc

(t)j

2(c(t−1)i

TSijc

(t−1)j

)1/2

≤(c(t−1)i

TSijc

(t−1)j

)1/2

− c(t−1)i

TSijc

(t−1)j

2(c(t−1)i

TSijc

(t−1)j

)1/2.

(13)

By summing Ineq. (13) over i and j, multiplying by λ, andadding it to Ineq. (12), we have∥∥∥∥∥x−

G∑i=1

Dic(t)i

∥∥∥∥∥2

2

+ λ

G∑i=1,j=1

(c(t)

T

i Sijc(t)j

)1/2

≤∥∥∥∥∥x−

G∑i=1

Dic(t−1)i

∥∥∥∥∥2

2

+ λ

G∑i=1,j=1

(c(t−1)T

i Sijc(t−1)j

)1/2

.

(14)

Ineq. (14) shows that the updating rule in Eq. (9) decreasesthe value of the objective function in Eq. (5), denoted as

F (c(t)) ≤ F (c(t−1)). Since F (·) is bounded above zero, we

can conclude that F (c(t)) converges to a limiting point F ∗

when t → ∞. Moreover, note that F (·) is Lipschitz contin-

uous, we can conclude that limt→∞ c(t) = c∗. Further, it iseasy to verify that when ct = c(t−1) = c∗, Eq. (9) assuresc∗ is a stationary point of Eq. (6), hence it is also a station-ary point of Eq. (5). Due to the convexity of Eq. (5), weconclude that c∗ is a global minimum of Eq. (5).

In practice, we use an efficient algorithm to avoid tack-

ling the large-scale(DTD+ λS(t−1)

). To this end, we ex-

ploit the symmetric positive definite nature of S(t−1), using

Cholesky decomposition S(t−1) = (L(t−1))TL(t−1). Here,

(L(t−1))Tis an upper triangular matrix with strictly positive

diagonal entries, which is efficiently invertible. By denoting

c(t) = L(t−1)c(t), D(t−1) = D(L(t−1))−1

, the new iterativeupdating rule is

c(t) =((D(t−1))

TD(t−1) + λI

)−1

(D(t−1))Tx. (15)

Here, in order to inverse((D(t−1))

TD(t−1) + λI

), we can

first apply eigen decomposition for D(t−1)(D(t−1))T, and

then inverse the matrix by only inversing the eigenvalue di-agonal matrix. Suppose d is the feature dimension of a visualsample x and d N , its complexity O(d3) is much smallerthan the former O(N3). Note that the only overhead is the

Cholesky decomposition for S(t−1). Fortunately, due to theextreme sparse form of S(t−1), this overhead is almost a verysmall constant.

6. EXPERIMENTIn this section, we systematically evaluate the effectiveness

of our proposed HSVC approach on two classic multimediatasks, namely visual annotation and content-based retrieval,as compared to several baselines and state-of-the-art visualcoding methods over various multimedia datasets.

6.1 Experimental Settings

6.1.1 DataIn our experiments, three real-world visual datasets are

utilized for evaluation. The first one is from ImageNet [6],which is one of the largest well-organized image datasetspublicly-available. All images on ImageNets were collectedfrom the Web and organized in accordance with the struc-ture of WordNet hierarchy. Each concept in the hierarchyis depicted by hundreds to thousands of images. We useda subset of ImageNet with 1,000 concepts and 1.27 millionimages, which were used for ILSVRC 20121. This datasetcontains partial hierarchy of WordNet and some new nodesthat are not in WordNet. In our evaluation, ImageNet isused for constructing semantic-enrich dictionary as depictedin Section 3. For each individual concept, its attached im-ages are employed for learning concept-specific dictionaryaccording to Eq. (3). The number of base features in eachconcept-specific dictionary is set to 10, and thus we have10,000 base features in total in the aggregated dictionary.

To the end of evaluating the effectiveness of the proposedHSVC approach, we chose to perform two classic multimediaapplications, namely visual annotation and content-basedimage/video retrieval, on one Flickr2-based image set NUS-WIDE and one YouTube3-based video set Columbia Con-sumer Video (CCV) database [14]. While the former onecontains 269,648 images annotated with 81 concepts, thelatter is comprised of 9,317 YouTube video clips associated

1http://www.image-net.org/challenges/LSVRC/2012/index2http://www.flickr.com/3http://www.youtube.com/

64

Page 7: Visual Coding in a Semantic Hierarchy - UESTCcfm.uestc.edu.cn/~fshen/MM15.pdf · 2018-04-24 · Visual Coding in a Semantic Hierarchy Yang Yang†, Hanwang Zhang‡, Mingxing Zhang†,

Table 1: Overall performance (mean Average Precision) of Classeme, PiCodes, DeCAF, Lasso, Group Lassoand HSVC forvisual annotation on NUS-WIDE and CCV datasets.

���������DatasetMethod

Classeme PiCodes DeCAF Lasso Group Lasso HSVC

NUS-WIDE 24.09% 20.29% 36.04% 37.42% 38.01% 40.15%

CCV 47.60% 50.08% 61.02% 64.30% 65.62% 67.32%

to 20 categories. For annotation task, we follow the origi-nal setting of “train/test”, i.e., “161,789/107,859” for NUS-WIDE and “4,659/4,658” for CCV. For retrieval task, werandomly selected 10% test samples from each concept asquery examples to search the results in training samples.

For all three visual datasets, we adopted the 4,096-D De-CAF generic visual feature [8], which is the activations ofthe 6-th layer of a deep CNN trained in a fully supervisedfashion on ImageNet. It has shown that this feature is veryeffective for multimedia tasks on various benchmark datasets. For each individual video sample in CCV dataset, weaveraged visual features of all its keyframes to form a video-level representation. All features are �2-normalized.For evaluation metrics, we chose commonly used mAP

(mean Average Precision) over all concepts for visual an-notation evaluation. For visual retrieval, we adopt AveragePrecision at top K retrieved results (AP@K), which is de-fined as

AP@K =1

min(R,K)

K∑j=1

Rj

j× Ij ,

where R is the number of relevant visual samples withinthe dataset and Rj is the number of relevant visual sam-ples among top j search results. Ij is set to 1 if the j-thvisual sample is relevant, and 0 otherwise. We averagedAP@K over all queries to obtain mAP@K as overall evalua-tion metric, where K ∈ {1, 2, . . . , 100}.6.1.2 Comparison MethodsIn order to illustrate the representative power of our pro-

posed HSVC approach, we compared it to several state-of-the-art semantic coding methods, including

• Classeme [26], which uses the output of 2,659 imageclassifiers trained on images retrieved from Bing ImageSearch engine4 according to 2,659 query concepts fromLSCOM [20];

• PiCodes [2], which gathers 2048 binary codes by op-timizing the classification processes of 2,659 classifierson ImageNet;

• DeCAF [8], which utilizes the output of the 6-thelayer in a CNN deep learning architecture learned fromImageNet.

To demonstrate the effects of different components (e.g.,semantic sparsity and semantic correlation) of HSVC, wefurther compared to two baselines, namely Lasso andGroupLasso. For both of the above coding baselines, we used thesame semantic-enriched dictionary hierarchy as in HSVC.For Lasso, we did not consider any semantic sparsity con-straint and semantic correlation among concepts. ForGroup

4https://www.bing.com/images/

Lasso, visual codes follow the same constraint as HSVC, butdo not take semantic correlation into consideration.

For visual annotation task, we chose to train concept clas-sifiers using SVM based on the generated codes of differ-ent comparison methods. In our implementation, liblinear5

toolbox was used. For visual retrieval experiments, all thequery samples and the gallery samples were represented us-ing these comparison visual codes and then the retrieval wasthen performed by similarity search. In particular, we used�1-norm distance for Classeme, DeCAF and our proposedHSVC Hamming distance for PiCodes since its outputs arebinary codes.

For all comparison methods, we tuned parameters, suchas λ in HSVC, Lasso, Group Lasso and C in SVM, in therange of {10−6, 10−4, . . . , 106}.

6.2 Experimental Results I: Visual AnnotationIn this part, we report and analyze the experiments on

visual annotation task. We compare HSVC to the otherstate-of-the-arts approaches and show the effects of the useof semantic properties. In particular, Table 1 shows overallperformance (mAP) of all comparison approaches over allconcepts in NUS-WIDE and CCV datasets. Figure 4 andFigure 5 reports detailed performance (AP) of individualconcepts on NUS-WIDE and CCV, respectively.

From the results, we can observe that HSVC consistentlyachieves the best performance among all comparison cod-ing algorithms. Specially, in NUS-WIDE, HSVC outper-forms Classeme, PiCodes and DeCAF semantic features byaround 16%, 20% and 4%, respectively; and in CCV, the im-provements are about 20%, 17% and 6%, respectively. Theunderlying reasons are in two folds. First of all, the dictio-nary of HSVC are derived from existing knowledge base con-taining rich semantic and hierarchical information, therebyestablishing a favourable foundation for effective visual rep-resentation of visual semantics. Second, our proposed HSVCfully explores semantic properties of visual codes, includingsemantic sparsity and semantic correlation. Such consid-erations are rarely reflected in all above semantic features.On the one hand, semantic sparsity of visual data is sig-nificant to visual coding since it deeply reveals the genuinepatterns of visual world, i.e., within a visual object, only asmall part of the entire semantic space is touched. On theother hand, semantic correlation captures intrinsic relation-ships (e.g., co-occurrence) among concepts, which leads toexerting mutually reinforcing effects on the qualities of theresultant visual codes.

The comparison of the results of Lasso, Group Lasso andHSVC can further shed light on the effectiveness of seman-tic sparsity and semantic correlation. As seen, HSVC andGroup Lasso always performs better than Lasso. AlthoughLasso applies �1-norm to induce feature-level sparsity, it can

5http://www.csie.ntu.edu.tw/˜cjlin/liblinear/

65

Page 8: Visual Coding in a Semantic Hierarchy - UESTCcfm.uestc.edu.cn/~fshen/MM15.pdf · 2018-04-24 · Visual Coding in a Semantic Hierarchy Yang Yang†, Hanwang Zhang‡, Mingxing Zhang†,

0.0

0.2

0.4

0.6

0.8

1.0

airport

anim

al

beach

bear

birds

boats

book

bridge

build

ings

cars

castle

cat

cityscape

clo

uds

com

pute

r

cora

l

cow

dancin

g

dog

eart

hquake

elk

fire

fish

flags

flow

ers

food

fox

frost

gard

en

gla

cie

r

gra

ss

harb

or

hors

es

house

lake

leaf

map

mili

tary

moon

mounta

in

nig

httim

e

ocean

pers

on

pla

ne

pla

nts

polic

e

pro

test

railr

oad

rain

bow

reflection

road

rocks

runnin

g

sand

sig

n

sky

snow

soccer

sport

s

sta

tue

str

eet

sun

sunset

surf

sw

imm

ers

tattoo

tem

ple

tiger

tow

er

tow

n

toy

train

tree

valle

y

vehic

le

wate

r

wate

rfall

weddin

g

whale

s

win

dow

zebra

AP

Classeme PiCodes DeCAF HSVC

Figure 4: Detailed performance (Average Precision) of individual concepts of Classeme, PiCodes, DeCAF and HSVC on visualannotation over NUS-WIDE dataset.

0.0

0.2

0.4

0.6

0.8

1.0

basketball

baseball

soccer

iceskating

skiing

swimming

biking catdog

bird

graduation

birthday

wedding receptio

n

wedding ceremony

wedding dance

music perform

ance

nonmusic perform

ance

paradebeach

playground

AP

Classeme PiCodes DeCAF HSVC

Figure 5: Detailed performance (Average Precision) of indi-vidual concepts of Classeme, PiCodes, DeCAF and HSVCon visual annotation over CCV dataset.

easily propagate non-zero weights to irrelevant concepts, re-sulting in influencing semantic purity of visual codes. More-over, HSVC consistently obtains superior visual annotationresults than Group Lasso. A reasonable explanation is thatHSVC takes semantic correlation into consideration and thusacquires more internal mutual promotion of code quality.

0.0

0.1

0.2

0.3

0.4

0.5

1 10 20 30 40 50 60 70 80 90 100

mA

P@

K

K

HSVCGroup Lasso

Lasso

DeCAFPiCodes

Classeme

(a) NUS-WIDE

0.0

0.1

0.2

0.3

0.4

0.5

0.6

1 10 20 30 40 50 60 70 80 90 100

mA

P@

K

K

HSVCGroup Lasso

Lasso

DeCAFPiCodes

Classeme

(b) CCV

Figure 6: Overall performance (mAP@K) of visual retrievalusing 10% random queries on (a) NUS-WIDE and (b) CCVdatasets. K = 1, 2, . . . , 100.

6.3 Experimental Results II: Visual RetrievalIn this part, we report and analyze the experimental re-

sults on visual retrieval task over NUS-WIDE and CCVdatasets. As illustrated, Figure 6.(a) and 6.(b) give mAP@K(K = 1, 2, . . . , 100) curves of all comparison approaches overall concepts in NUS-WIDE and CCV datasets, respectively.Figure 7 and Figure 8 report detailed performance (AP) ofindividual concepts on NUS-WIDE and CCV, respectively.

0.0

0.2

0.4

0.6

0.8

basketball

baseball

soccer

iceskating

skiing

swimming

biking catdog

bird

graduation

birthday

wedding receptio

n

wedding ceremony

wedding dance

music perform

ance

nonmusic perform

ance

paradebeach

playground

AP

@2

0

Classeme PiCodes DeCAF HSVC

Figure 8: Detailed performance (mAP@20) of individualconcepts of Classeme, PiCodes, DeCAF and HSVC for vi-sual retrieval on CCV datasets.

We can observe from the results that HSVC can achieveobvious improvements in terms of retrieval performance overall the other comparison methods on both of the evaluateddatasets. For instance, on NUS-WIDE, HSVC improves theperformance of Classeme, PiCodes and DeCAF by around18%, 17% and 5% in terms of mAP@20; and on CCV, ourmethod can perform better than Classeme, PiCodes and De-CAF in terms of mAP@20 by approximately 27%, 28% and3%, respectively. The above phenomenons of performanceboost can be reasonably attributed to the deep explorationof semantic aspects in our proposed approach. While theconstruction of semantic-enrich dictionary in the hierarchi-cal fashion guarantees a reliable basis for generating effectivevisual codes, semantic sparsity and semantic correlation pro-vides a better understanding of the essence of visual worlds.By naturally connecting visual and semantic worlds throughthe proposed semantic coding method, HSVC is able to re-trieve more useful results for users.

6.4 Parameter SensitivityIn this part, we empirically test the parameter sensitivity

in our approach. Specifically, we evaluate the effects of λ onvisual annotation and visual retrieval over the two evaluateddatasets. We tune λ in the range of {10−6,10−4,. . .,104,106}.The experimental results are illustrated in Figure 9.(a) and9.(b) for NUS-WIDE and CCV, respectively. As we can see,generally speaking, our approach is not obviously sensitiveto different values of λ; nonetheless, we may still observecertain patterns of variation trends from the results. Forinstance, as shown in Figure 9.(a), when λ gradually growsfrom 10−6 to 102, we can see an increasing trend of theannotation performance for NUS-WIDE dataset; as λ keeps

66

Page 9: Visual Coding in a Semantic Hierarchy - UESTCcfm.uestc.edu.cn/~fshen/MM15.pdf · 2018-04-24 · Visual Coding in a Semantic Hierarchy Yang Yang†, Hanwang Zhang‡, Mingxing Zhang†,

0.0

0.2

0.4

0.6

0.8

1.0

airport

anim

al

beach

bear

birds

boats

book

bridge

build

ings

cars

castle

cat

cityscape

clo

uds

com

pute

r

cora

l

cow

dancin

g

dog

eart

hquake

elk

fire

fish

flags

flow

ers

food

fox

frost

gard

en

gla

cie

r

gra

ss

harb

or

hors

es

house

lake

leaf

map

mili

tary

moon

mounta

in

nig

httim

e

ocean

pers

on

pla

ne

pla

nts

polic

e

pro

test

railr

oad

rain

bow

reflection

road

rocks

runnin

g

sand

sig

n

sky

snow

soccer

sport

s

sta

tue

str

eet

sun

sunset

surf

sw

imm

ers

tattoo

tem

ple

tiger

tow

er

tow

n

toy

train

tree

valle

y

vehic

le

wate

r

wate

rfall

weddin

g

whale

s

win

dow

zebra

AP

@20

Classeme PiCodes DeCAF HSVC

Figure 7: Detailed performance (mAP@20) of individual concepts of Classeme, PiCodes, DeCAF and HSVC for visual retrievalon NUS-WIDE datasets.

0.3

0.4

0.5

0.6

0.7

0.8

-6 -4 -2 0 2 4 6

mA

P

log(λ)

NUS-WIDECCV

(a) Annotation

0.20

0.25

0.30

0.35

0.40

-6 -4 -2 0 2 4 6

mA

P@

20

log(λ)

NUS-WIDECCV

(b) Retrieval

Figure 9: Evaluation of effects of λ on (a) visual annotation(mAP) and (b) visual retrieval (mAP@20) over NUS-WIDEand CCV datasets.

increasing to 106, the performance starts to drop. Such phe-nomenon hints us that neither large nor small value of λ canhelp to achieve the best performance. This is because whenλ is too large, the effect of hierarchical semantic informationin the dictionary may be suppressed; and small λ probablycauses seldom reflection of semantic sparsity and semanticcorrelation in HSVC.

6.5 Convergence AnalysisAs depicted in Section 5, the rigorous mathematical anal-

ysis guarantees that Algorithm 1 is able to converge to aglobal optimal solution as t → ∞. In this part, we conductempirical study on the convergence in both of the evaluateddatasets. Figure 10 illustrate the results on the two datasets.Here, we fix the tradeoff parameter λ to 1. As we can see,in experiments, our algorithm can quickly converge withinonly a few iterations, which clearly shows its efficiency forpractical use. We analyze that our designed iterative up-dating rule in Eq. (9) can almost always detect a fast “path”from initial point to the global optimal.

7. CONCLUSIONIn this paper, we introduced a novel visual coding ap-

proach, termed as hierarchical semantic visual coding (HSVC),to effectively encode visual objects in a semantic hierarchy.Following the designed pipeline, we generated a semantic-enriched dictionary hierarchy, which contains dictionariescorresponding to all concepts in a semantic hierarchy aswell as their hierarchical semantic information. We alsodeveloped an on-line semantic coding model, which jointlyexploits the rich hierarchical semantic information in thelearned dictionary as well as considers various semantic prop-erties of visual objects, including semantic sparsity and se-

3x104

4x104

5x104

6x104

7x104

8x104

9x104

0 10 20 30 40 50 60 70 80 90 100

3x104

4x104

4 6 8 10 12 14

(a) NUS-WIDE

1.5x104

2.0x104

2.5x104

3.0x104

3.5x104

4.0x104

0 10 20 30 40 50 60 70 80 90 100

1.5x104

2.0x104

4 6 8 10 12 14

(b) CCV

Figure 10: Convergence analysis on the two evaluateddatasets: (a) NUS-WIDE and (b) CCV. For illustration,we fix λ = 1.

mantic correlations among concepts in the semantic hierar-chy. We proposed to combine concept-level sparsity con-straint and semantic similarity into a unified regularizationterm. An effective algorithm was devised for solving the op-timization problem in the proposed model, and a rigorousmathematical analysis has been provided to guarantee thealgorithm converges to the global optima. Extensive exper-iments have been conducted to show the effectiveness of ourapproach.

Although HSVC has achieved superior performance ascompared to several baselines and state-of-the-art methods,it still leaves the following open problem to be further han-dled. The representative power of the learned dictionaryheavily relies on existing hand-crafted semantic hierarchy,which may not be suitable for constantly-evolving visualworlds. In the future, we will further investigate an auto-matic way for discovering representative concepts and build-ing semantic dictionary from both the visual and textualdata, such that visual semantics could be identified in a moreprecise manner.

8. REFERENCES[1] S. Bengio, F. Pereira, Y. Singer, and D. Strelow. Group

sparse coding. In NIPS, pages 82–89, 2009.[2] A. Bergamo, L. Torresani, and A. W. Fitzgibbon. Picodes:

Learning a compact code for novel-category recognition. InNIPS, pages 2088–2096, 2011.

[3] A. Binder, K.-R. Muller, and M. Kawanabe. On taxonomiesfor multi-class image categorization. IJCV, 99(3):281–301,2012.

[4] J. Deng, A. C. Berg, and L. Fei-Fei. Hierarchical semanticindexing for large scale image retrieval. In CVPR, pages785–792, 2011.

67

Page 10: Visual Coding in a Semantic Hierarchy - UESTCcfm.uestc.edu.cn/~fshen/MM15.pdf · 2018-04-24 · Visual Coding in a Semantic Hierarchy Yang Yang†, Hanwang Zhang‡, Mingxing Zhang†,

CCV

NUS-WIDE

Query

Query

Figure 11: Illustration of retrieval examples using our proposed HSVC on NUS-WIDE and CCV. The top two lines are resultsof NUS-WIDE while the bottom two are from NUS-WIDE.

[5] J. Deng, A. C. Berg, K. Li, and L. Fei-Fei. What doesclassifying more than 10,000 image categories tell us? InECCV, pages 71–84. 2010.

[6] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, andL. Fei-Fei. Imagenet: A large-scale hierarchical imagedatabase. In CVPR, pages 248–255. IEEE, 2009.

[7] T. Deselaers and V. Ferrari. Visual and semantic similarityin imagenet. In CVPR, pages 1777–1784, 2011.

[8] J. Donahue, Y. Jia, O. Vinyals, J. Hoffman, N. Zhang,E. Tzeng, and T. Darrell. Decaf: A deep convolutionalactivation feature for generic visual recognition. arXivpreprint arXiv:1310.1531, 2013.

[9] C. Elkan. Using the triangle inequality to acceleratek-means. In ICML, volume 3, pages 147–153, 2003.

[10] C. Fellbaum. Wordnet. Theory and Applications ofOntology: Computer Applications, 2010.

[11] S. Gao, L.-T. Chia, and I. W. Tsang. Multi-layer groupsparse coding – for concurrent image classification andannotation. In CVPR, pages 2809–2816, 2011.

[12] G. Griffin and P. Perona. Learning and using taxonomiesfor fast visual categorization. In CVPR, pages 1–8, 2008.

[13] J. Huang, H. Liu, J. Shen, and S. Yan. Towards efficientsparse coding for scalable image annotation. In ACMMultimedia, pages 947–956. ACM, 2013.

[14] Y.-G. Jiang, G. Ye, S.-F. Chang, D. Ellis, and A. C. Loui.Consumer video understanding: A benchmark databaseand an evaluation of human and machine performance. InICMR, page 29, 2011.

[15] A. Krizhevsky, I. Sutskever, and G. E. Hinton. Imagenetclassification with deep convolutional neural networks. InNIPS, pages 1097–1105, 2012.

[16] S. Lazebnik, C. Schmid, and J. Ponce. Beyond bags offeatures: Spatial pyramid matching for recognizing naturalscene categories. In CVPR, volume 2, pages 2169–2178,2006.

[17] H. Lee, A. Battle, R. Raina, and A. Y. Ng. Efficient sparsecoding algorithms. In NIPS, pages 801–808, 2006.

[18] M. Marsza�lek and C. Schmid. Semantic hierarchies forvisual object recognition. In CVPR, pages 1–7, 2007.

[19] L. Meier, S. Van De Geer, and P. Buhlmann. The grouplasso for logistic regression. JRSS-B, 70(1):53–71, 2008.

[20] M. Naphade, J. R. Smith, J. Tesic, S.-F. Chang, W. Hsu,L. Kennedy, A. Hauptmann, and J. Curtis. Large-scale

concept ontology for multimedia. IEEE Multimedia,13(3):86–91, 2006.

[21] M. Rohrbach, M. Stark, G. Szarvas, I. Gurevych, andB. Schiele. What helps where–and why? semanticrelatedness for knowledge transfer. In CVPR, pages910–917, 2010.

[22] L. Shao, R. Yan, X. Li, and Y. Liu. From heuristicoptimization to dictionary learning: A review andcomprehensive comparison of image denoising algorithms.TCYB, 44(7):1001–1013, 2014.

[23] Y. Song, M. Zhao, J. Yagnik, and X. Wu. Taxonomicclassification for web-based videos. In CVPR, pages871–878, 2010.

[24] Y. Sun, X. Wang, and X. Tang. Deep convolutionalnetwork cascade for facial point detection. In CVPR, pages3476–3483, 2013.

[25] J. Tang, L. Shao, and X. Li. Efficient dictionary learningfor visual categorization. CVIU, 124:91–98, 2014.

[26] L. Torresani, M. Szummer, and A. Fitzgibbon. Efficientobject category recognition using classemes. In ECCV,pages 776–789. 2010.

[27] N. Verma, D. Mahajan, S. Sellamanickam, and V. Nair.Learning hierarchical similarity metrics. In CVPR, pages2280–2287, 2012.

[28] Y. Yang, Y. Yang, Z. Huang, H. T. Shen, and F. Nie. Taglocalization with spatial correlations and joint groupsparsity. In CVPR, pages 881–888, 2011.

[29] Y. Yang, Y. Yang, and H. T. Shen. Effective transfertagging from image to video. TOMM, 9(2):14, 2013.

[30] Y. Yang, Z.-J. Zha, Y. Gao, X. Zhu, and T.-S. Chua.Exploiting web images for semantic video indexing viarobust sample-specific loss. TMM, 16(6):1677–1689, 2014.

[31] M. Yuan and Y. Lin. Model selection and estimation inregression with grouped variables. JRSS-B, 68(1):49–67,2006.

[32] H. Zhang, Z.-J. Zha, Y. Yang, S. Yan, Y. Gao, and T.-S.Chua. Attribute-augmented semantic hierarchy: towardsbridging semantic gap and intention gap in image retrieval.In ACM Multimedia, pages 33–42, 2013.

[33] S. Zhang, J. Huang, H. Li, and D. N. Metaxas. Automaticimage annotation and retrieval using group sparsity.TCYB, 42(3):838–849, 2012.

[34] M. Zheng, J. Bu, C. Chen, C. Wang, L. Zhang, G. Qiu, andD. Cai. Graph regularized sparse coding for imagerepresentation. TIP, 20(5):1327–1336, 2011.

68