Visual Coding in a Semantic Hierarchy - UESTCcfm.uestc.edu.cn/~fshen/MM15.pdf · 2018-04-24 ·...
Transcript of Visual Coding in a Semantic Hierarchy - UESTCcfm.uestc.edu.cn/~fshen/MM15.pdf · 2018-04-24 ·...
Visual Coding in a Semantic Hierarchy
Yang Yang†, Hanwang Zhang‡, Mingxing Zhang†, Fumin Shen†, Xuelong Li§†University of Electronic Science and Technology of China
‡National University of Singapore§Chinese Academy of Sciences
{dlyyang, hanwangzhang, zhangmingxing, fumin.shen}@gmail.com; [email protected]
ABSTRACTIn recent years, tremendous research endeavours have beendedicated to seeking effective visual representations for fa-cilitating various multimedia applications, such as visual an-notation and retrieval. Nonetheless, existing approaches canhardly achieve satisfactory performance due to the scarcityof fully exploring semantic properties of visual codes. In thispaper, we present a novel visual coding approach, termed ashierarchical semantic visual coding (HSVC), to effectivelyencode visual objects (e.g., image and video) in a seman-tic hierarchy. Specifically, we first construct a semantic-enriched dictionary hierarchy, which is comprised of dictio-naries corresponding to all concepts in a semantic hierarchyas well as their hierarchical semantic correlation. Moreover,we devise an on-line semantic coding model, which simul-taneously 1) exploits the rich hierarchical semantic priorknowledge in the learned dictionary, 2) reflects semanticsparse property of visual codes, and 3) explores semanticrelationships among concepts in the semantic hierarchy. Tothis end, we propose to integrate concept-level group spar-sity constraint and semantic correlation matrix into a uni-fied regularization term. We design an effective algorithm tooptimize the proposed model, and a rigorous mathematicalanalysis has been provided to guarantee that the algorithmconverges to a global optima. Extensive experiments on var-ious multimedia datasets have been conducted to illustratethe superiority of our proposed approach as compared tostate-of-the-art methods.
Categories and Subject DescriptorsH.3.1 [Content Analysis and Indexing]: Abstractingmethods
General TermsAlgorithm
Permission to make digital or hard copies of all or part of this work for personal or
classroom use is granted without fee provided that copies are not made or distributed
for profit or commercial advantage and that copies bear this notice and the full cita-
tion on the first page. Copyrights for components of this work owned by others than
ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or re-
publish, to post on servers or to redistribute to lists, requires prior specific permission
and/or a fee. Request permissions from [email protected].
MM’15, October 26–30, 2015, Brisbane, Australia.
c© 2015 ACM. ISBN 978-1-4503-3459-4/15/10 ...$15.00.
DOI: http://dx.doi.org/10.1145/2733373.2806244.
Keywordssemantic hierarchy; dictionary hierarchy; visual coding; grouplasso
1. INTRODUCTIONRecent years have witness the power of multimedia tech-
niques to facilitate various real-world applications, such asvisual retrieval [32], visual annotation [29, 30], etc. Overthe past two decades, visual representation (or features)has undergone the transition from simple raw feature ex-traction, to higher-level feature statistics such as Bag-of-Words, and today, to more complex but generic feature ex-traction pipelines. For example, one of the most popularpipelines extracts hand-crafted descriptors (e.g., SIFT orHOG) from local visual patches, and then encodes thesedescriptors into certain over-complete representation usingvarious algorithms such as k-means or sparse coding; aftercoding stage, global visual representations are formed byspatially pooling the coded local descriptors. Finally, theglobal representations of visual samples are fed into task-dependent classifiers. Another promising pipeline has beengrowing its popularity with the recent advances in deeplearning. In general, it is a deep neural network with re-peated feature detection and pooling layers. The outputof the top layer is regarded as the ultimate visual features,which are then fed into the final classification layer. Al-though these two pipelines are motivated differently, theyare both shown to achieve the state-of-the-art performancein most image classification benchmarks.
However, as compared to our humans, who can easily rec-ognize more than tens of thousands objects, the above ar-tificial pipelines are obviously inferior. Deng et al. [5] haveshown that when they tried to classify ten thousand objects,the accuracy is only around 3.7%. Admittedly, it is still un-clear how to computationally model human vision system,nevertheless, we can observe the following three crucial de-fects in current visual representation designs.
• First, one of the core stages of the pipelines is thegeneration of elementary feature representations (e.g.,codebook, dictionary or convolutional kernels), whichare learned or sampled globally over the whole dataset.However, this may lose fine-grained details that arespecific object categories during the sample bias. Forexample, if the number of crucial elementary featuresof “dog” is smaller than that of other objects, then theresultant codebook is less capable to represent “dog”images.
59
off-line on-line
Image with Car ,
Building , People ,
etc.
Hierarchical Semantic Visual Coding
D1 D6D4, , , , , ,
DGD
Concept-specific Group Sparsity
Semantic-Enriched Dictionary Hierarchy
Hierarchical Semantic
Correlation
0.8 0
0 0.8
0
0 80.814S, , , , 46S
0.2 0
0 0.2
0
0 20.2
c 1c 4c 6c Gc,...,[ [,..., ,...,
Car Building People
Sparse Codes
Dictionary Learning in a Semantic Hierarchy
Concept Semantic Similarity
1c
4c6c
Gc... ...0.8
0.2
0.5
0.3
0.7
D1
D2
D3
D4
D5
D6
DG
Figure 1: The proposed hierarchical semantic visual coding (HSVC) pipeline. In the off-line part, we learn dictionaries for every conceptin a semantic hierarchy. Each dictionary is learned by the training samples belonging to the referenced concept. In the on-line part, weaggregate the set of learned dictionaries as a holistic semantic-enriched dictionary hierarchy, which provides fundamental feature basisfor HSVC. Meanwhile, we also explore semantic properties of visual codes, including sematic sparsity and semantic correlation. Theresultant sparse visual codes are the visual representation used in subsequent specific tasks such as visual annotation and visual retrieval.
• Second, these pipelines neglect to consider the rela-tionships between different object categories, which isthe most appealing ability that human vision systempossesses. For example, when a human is trying to rec-ognize a “puma”, she may transfer some visual knowl-edge from “pet cat”. By this way, our vision systemcan generalize well to objects unseen in the trainingsamples.
• Moreover, real-world visual objects usually contain mul-tiple objects and scenes, which are correlated throughcertain semantics. For example, “people” and “dog”are more likely to appear in the same scene than “dog”and “fish”, since the former two have closer semanticrelationship than the latter. However, to the best ofour knowledge, seldom existing visual representationmethods take this observation into account.
In this paper, we propose a novel sematic visual represen-tation extraction pipeline, termed as hierarchical semanticvisual coding (HSVC), which encodes visual objects in accor-dance with a semantic hierarchy. As illustrated in Figure 1,the proposed pipeline comprises an off-line part and an on-line part. Suppose we have a semantic hierarchy populatedwith training samples referenced to each concept (e.g., Im-ageNet [6]). In the off-line part, we embed rich hierarchialsemantic information into dictionary learning process to con-struct a semantic-enriched dictionary hierarchy. For each in-dividual concept in the sematic hierarchy, we extract visualfeatures (e.g., DeCAF [8], Bag-of-Words, etc.) for all the at-tached training samples and learn a concept-specific dictio-nary; then the dictionaries corresponding to all concepts inthe hierarchy are concatenated to form a holistic dictionary.Moreover, in order to facilitate the subsequent on-line vi-sual coding, we preliminarily compute semantic similarities(or correlations) of concepts in the hierarchy as well. In theon-line part, we devise a novel semantic coding approach byfully exploring semantic properties of visual codes, includ-ing rich hierarchical semantic information in the learned dic-tionary, semantic sparsity of concepts, as well as semanticrelationships among concepts in the semantic hierarchy. Inorder to achieve this goal, we incorporate a novel semanti-cally structured regularization term, which not only induces
concept-level group sparsity to preserve semantic purity ofvisual samples, but also employs hierarchical semantic cuesto retain semantic correlations among concepts and enhancethe representative power of visual codes. The generatedsparse codes are considered as the final visual representationand further used in specific tasks. We conduct extensive ex-periments on real-world large-scale datasets, and the promis-ing results demonstrate that the proposed visual codes aremore effective in supporting fundamental applications, suchas visual annotation and retrieval, as compared to state-of-the-art methods.
We highlight the three major contributions of the pro-posed coding pipeline as follows:
• Dictionary Learning in a Semantic Hierarchy.In order to prepare dictionaries for generating effectivevisual representation, we propose a novel dictionarylearning approach in a semantic hierarchy. We learna concept-specific dictionary for every concept usingall its attached visual samples. The constructed dic-tionary comprises elementary features that are mostrepresentative for the referenced concept. By aggre-gating the dictionaries of all concepts in the semantichierarchy, we can obtain a holistic semantic-enricheddictionary hierarchy.
• Concept-specific Group Sparsity. We obverse thatreal-world visual objects normally contain very sparsesemantic concepts. For instance, an image of“car”mayonly correspond to a small number of concepts such as“car”, “road”, “building”. Hence, encoding informationof all the concepts into an individual visual sample mayaffect semantic purity of visual samples and lead to im-precise representation. In order to avoid such problem,we propose to impose concept-level sparsity on the de-sired codes. Each concept is the part of dictionary (i.e.,a group of base elements) that corresponds to a specificsemantic concept, and the induced sparsity will lead tosparse codes with a reasonable number of concepts.
• Hierarchical Semantic Correlation. In order toenable sparse codes to be more meaningful and repre-sentative, we further exploit the semantic relationships
60
among concepts when coding visual objects. By ex-ploiting the explicit semantic information of conceptsin the hierarchy, we are able to calculate semantic sim-ilarity (or correlation) and incorporate such informa-tion into the coding process. For instance, we maypunish the coincidence of the codes of “people” and“dog” lighter than that of “fish” and “dog”. Therefore,the resultant visual codes would be more semanticallyinformative.
The rest of the paper is organized as follows. Section 2reviews related work. Section 3 describes dictionary learn-ing in a semantic hierarchy. Section 4 details the proposedhierarchical semantic visual coding pipeline, including theformulation, algorithms and a rigorous mathematical proof.Experimental results and analysis are reported in Section 5,followed by conclusions in Section 6.
2. RELATED WORK
2.1 Visual Coding for RepresentationState-of-the-art visual representation comprises the fol-
lowing stages. First, raw visual feature descriptors (e.g.,color, texture or shape) are extracted from local patches;then, these features are encoded into an over-complete rep-resentation. One general coding approach is described asbelow:
minc
‖x−Dc‖22,s.t. c ∈ F ,
(1)
where x ∈ Rd is the feature vector of a visual sample, and
d is the dimensionality of the feature space. D ∈ Rd×M
is an over-complete dictionary (i.e., M ≥ d), M is thenumber of bases in the dictionary, c ∈ R
M is the sparsecodes, and F is certain feasible region. For example, whenF = {c|c ∈ {0, 1}, ‖c‖1 = 1}, the coding reduces to thewell-known k-means method [9]; when F = {c|‖c‖1 ≤ γ},where γ > 0 is a constant, it becomes the popular sparsecoding [17]. After the coding stage, global image represen-tations are formed by spatially pooling the coded local de-scriptors [16]. Finally, such global representations are usedin specific tasks such as classification and retrieval. Meth-ods following such a pipeline have achieved state-of-the-artperformance on several benchmark datasets. Recently, con-volutional neural network (CNN) has been shown a success-ful alternative in learning image representations, achieving asignificant boost in image classification task [8,15,24]. Thesetwo pipelines both follow a hierarchical feature extractionphilosophy, but the core difference is that CNN automat-ically learns features while the other one is manually de-signed. However, they do not conflict since we can use thelearned image representation as an input to the manually de-signed pipeline in order to further boost the performance [8].
As compared to local feature-level coding methods for im-age representation, our work falls into the image-level cod-ing, where feature x is a global representation of an image(e.g., we use DeCAF feature in this paper). In this way,we relate to some recent work. Zheng et al. [34] exploitedthe geometry structure of the data space, casting the image-level sparse coding problem into a graph embedding one.Huang et al. [13] proposed an efficient sparse coding meth-ods for image annotation, where the dictionary is the feature
matrix of the whole training images. Some studies used cod-ing methods with group sparsity, which tackles the problemthat the �1-norm sparse coding implicitly assumes that anelement in the dictionary is independent of all others. Es-sentially, group sparsity motivates to integrate correlationsamong the dictionaries. Zhang et al. [33] used group spar-sity in selecting useful features of the same type for imageannotation. Bengio et al. [1] proposed to use group sparsecoding for jointly encoding the group of visual descriptorswithin the same image to achieve image-level sparsity. Yanget al. [28] considered regions of an image as a group, usedto achieve region-level image annotation. Gao et al. [11]extended the meaning of group into “class”, “tag” and “in-stance” levels and proposed a multi-layer group sparse cod-ing algorithm for image annotation. Formally, given a dic-tionary D = {D1,D2, ...,DG}, where Di is a group (or sub-set) of the entire dictionary D, the sparse coding with groupsparsity (a.k.a. Group Lasso [31]) can be formulated as
minc
∥∥∥∥∥x−G∑
i=1
Dici
∥∥∥∥∥2
2
+ λG∑
i=1
‖ci‖2, (2)
where λ is the trade-off constant and ci is the coding coef-ficient corresponding to the i-th group. In fact, the groupsparsity constraint
∑Gi=1 ‖ci‖2 can be considered as the �1-
norm of the vector [‖c1‖2, ‖c2‖2, . . . , ‖cG‖2]T . In this paper,we consider a group as a concept-specific dictionary.
2.2 Semantic HierarchyA semantic hierarchy is a formally defined taxonomy or
ontology, where each node represents a semantic concept,such as WordNet [10], ImageNet [6], or LSCOM [20], etc.It organizes semantic concepts from general to specific andhas been shown to be effective in boosting visual recognition[18] and retrieval [4, 7].
In this paper, we predict the semantic path of images byhierarchical concept classifiers. As compared to a bag of flatclassifiers (e.g., a bag of one-vs-all classifiers), hierarchicalclassifiers exploit the ontological relationship and thus signif-icantly reduce the prediction time while producing compara-tively good accuracy. For example, [12] automatically builtclassification trees which achieve significant speedup at asmall performance drop. There are also studies which foundthat hierarchical top-down prediction using local classifierscan retain the same performance as one-vs-all classifiers [18].Here, “local” is a concept classifier training strategy thatconsiders the images of the concept itself as positive andthose of its siblings as negative. However, local classifiersmay suffer severely from the error propagation problem. Ifa classifier at higher levels fails, it is hard for the successiveclassifiers at lower levels to stop the error (e.g., classify theerror as negative) since they have never seen such an erroras a negative sample. In order to alleviate this problem, [3]developed a structure learning method that optimizes theperformance of all local classifiers in a semantic hierarchy.However, the training cost of the method is prohibitive inthe large-scale semantic hierarchy as employed in this work.For efficiency, we adopt an hierarchical one-vs-all trainingstrategy to learn local concept classifiers as in [23].
Most recent work exploits semantic hierarchy for image re-trieval and designs certain semantic similarity function thatembeds hierarchical information. In [7], given two images,their visual nearest neighbor images were first found, and
61
their semantic distance was computed as a distance betweenthe concepts of their neighbors. [4] developed an hierarchi-cal bilinear similarity function for image retrieval. Theyfirst represented an image as a semantic vector z consistingof its relevances to a set of concepts, and defined the bilin-ear similarity between any two images as zTi Szj , where Sis a matrix encoding the pairwise semantic affinities amongthe concepts. They have shown that this method achievesstate-of-the-art performance of image retrieval on ImageNet.Recently, in [27], a new method was proposed to associateseparated visual similarity metrics for every concept in a hi-erarchy and then learn the metrics jointly through an aggre-gated hierarchical metric. In our work, we use the semanticsimilarities between the concepts in a semantic hierarchy toregularize the semantic coding. By doing this, we cannotonly achieve semantic group sparsity which boost the dis-criminative power of the sparse codes, but also enrich thesemantic descriptiveness of them.
3. DICTIONARY LEARNING IN A SEMA-NTIC HIERARCHY
Dictionary learning [22, 25] plays a vital role in multime-dia applications. The premise of our visual coding pipelineis to construct a semantic-enriched dictionary hierarchy inaccordance with certain hierarchy. As shown in Figure 1,for each concept in the hierarchy, we learn a dictionary fromvisual samples that belong to the concept. Without loss ofgenerality, for any concept c, we learn a dictionary Dc fromthe training samples as follows:
minC,Dc
‖X−DcC‖2F , (3)
where X = [x1, ...,xNc ] ∈ Rd×Nc is the matrix of the train-
ing samples, Nc is the number of the training samples, xi ∈R
d is the i-th sample, and C is the matrix of the corre-sponding codes for all training visual samples. ‖ · ‖F is theFrobenius norm of a matrix. Fortunately, the solution we areinterested in, the dictionary Dc, can be obtained throughSingular Value Decomposition (SVD) of the visual samplematrix X = DcΣCT , where Σ is a diagonal matrix of thesingular values, Dc and C are the the matrix concatenationof the left and right singular vectors, respectively. Therefore,the ultimate dictionary with respect to the entire semantichierarchy can be obtained by concatenating all the concept-specific dictionaries:
D = [D1, ...,Dc, ...,DG] , (4)
where G is the number of semantic concepts in the hierarchy.Dictionary Dc can be considered as a base matrix that
spans an intrinsic subspace for the visual information of con-cept c. The advantages of our concept-specific dictionarylearning are two folds. First, since we will eventually com-bine these dictionaries, the size of each dictionary should besmall otherwise it will cause prohibitive computation prob-lem. Fortunately, the SVD solution Dc leads to compactbase matrix depicting the characteristics of a specific con-cept. In this work, we set the number of base features to 10,which means only the top 10 visual variances are captured.Note that each dictionary is only responsible for its ownconcept, therefore we merely need to retain the principalvariances in the dictionary since the “entire” picture will beeventually emerged by aggregating all the “local” pictures.
Dog Grass Basket People
Football
Sky
Building
Figure 3: Illustration of semantic sparsity and seman-tic correlation among concepts. As shown, due to limitedamount of content, an individual image can only cover asmall proportion of semantic concepts in the whole seman-tic space, which implies that within the image, the semanticinformation should be sparse enough at concept level. Onthe other hand, concepts co-occurring in the same image(e.g., “sky” and “building”) should be semantically corre-lated, and thus should mutually reinforce/influence weightsof each other’s codes.
Figure 2 illustrates the semantic meanings of the dictionar-ies we learned. Second, computing Dc is very efficient, es-pecially when the number of training samples is large-scale.This is very common when learning dictionaries for moregeneral concepts who possess many more samples. In fact,no matter how large the number of the samples (i.e., thecolumn dimension of X) is, we can always efficiently obtainDc by computing the singular eigenvectors for XT , whichhas a smaller column dimension, i.e., the dimension of theoriginal visual feature.
4. CODING VISUAL OBJECTS IN A SEM-ANTIC HIERARCHY
After we harvest the concept-dependent dictionaries asaforementioned, we are ready to encode visual objects intosparse codes while exploring the semantic properties in thehierarchy. In this way, we expect the sparse codes not onlyprovide efficient sparse visual representations but also canbe explicitly interpreted as semantic meanings such as ob-jects and relationships among them. As aforementioned inEq. (2), a straightforward way is to formulate the codingproblem into a Group Lasso problem. However, this formu-lation treats the codes of each semantic concepts indepen-dently and thus neglects the hierarchical semantic relation-ships between them. In fact, concepts are semantically inter-correlated to each other in the semantic space. Furthermore,for an individual visual object, its related semantics shouldbe sparse due to limited amount of content. In this section,we show how to efficiently and effectively encode visual ob-jects in a semantic hierarchy by jointly considering semanticsparsity and semantic correlation.
4.1 FormulationIf we inspect Eq. (2), we can easily find that the Group
Lasso term (∑G
i=1 ‖ci‖2) fails to incorporate the relationshipbetween ci and cj . For example, if the semantic similaritysdog,people of “dog” and “people” is large, we should consider
62
cock chickadee bird
Figure 2: The illustration of concept-specific dictionary. By Eq. (3), the learned dictionary only captures the intrinsic semanticof the concept. For concept “cock” (ImageNetID: n01514668), “chickadee” (ImageNetID: n01592084), and “bird” (ImageNetID:n01503061), we choose the dimension of the dictionary with the largest singular value, and we map all the images in oneconcept into this dimension. The top and bottom images are largest and smallest mapped values, respectively, representingsemantic meanings. For example, the dimension in “bird” dictionary captures birds with/without wings.
them jointly and expect them to co-occur in the sparse codeswith a relatively high probability; if sdog,fish of concepts“dog”and“fish” is small, it is reasonable to neglect their jointsparsity since they are semantically far away. In general, wepropose to jointly consider the sparsity of the codes througha similarity weight sij between concepts ci and cj :
minc
∥∥∥∥∥x−G∑
i=1
Dici
∥∥∥∥∥2
2
+ λ
G∑i=1
G∑j=1
(cTi Sijcj
)1/2
, (5)
where Sij is a diagonal matrix whose diagonal entries are sij .In particular, we can see that the original Group Lasso inEq. (2) is only a special case such that sij = 1 if i = jand sij = 0 otherwise. Specifically, sij can be definedsij = f(π(i, j)), where π(i, j) is the lowest common ancestorof concepts ci and cj , and f(·) is some real function that isnon-decreasing going down the hierarchy, i.e., the lower thelowest shared ancestor is, the more similar concepts i andj are. For example, “dog” is much more similar to “people”than“car”because “dog” shares a lower level common ances-tor “mammal” with “people” than “object” with “car”. Fig-ure 3 illustrates semantic sparsity and semantic correlation.Note that there are other ways to obtain sij , e.g., miningfrom the Internet corpus or exploiting crowd sourcing [21].
In [19], it has shown that the generalized form of GroupLasso in Eq. (5) is an intermediate regularizer between the�1-norm in the Lasso and the �2-norm in ridge regression.This indicates that our visual coding formulation encour-ages concept-level sparsity when taking hierarchical seman-tic prior knowledge into consideration.
4.2 SolutionNote that our coding formulation in Eq. (5) is non-smooth
because it contains the square root to impose concept-levelsparsity. In this part, we propose an efficient iterative algo-rithm to solve the optimization problem in Eq. (5).
To illustrate the algorithm, at each iteration, we reformu-late Eq. (5) into an alternative quadratic and smooth formby dividing the Group Lasso term at the previous iteration,
minc(t)
∥∥∥∥∥x−G∑
i=1
Dic(t)i
∥∥∥∥∥2
2
+ λ
G∑i=1
G∑j=1
c(t)i
TSijc
(t)j
2(c(t−1)i
TSijc
(t−1)j
)1/2.
(6)
The re-weighting term(c(t−1)i
TSijc
(t−1)j
)1/2
can be consid-
ered as a constant term since we assume that we have alreadyobtained {ci}Gi=1 at the previous iteration. Eq. (6) can be
rewritten into a more compact form as
minc(t)
∥∥∥x−Dc(t)∥∥∥2
2+ λc(t)
TS(t−1)c(t), (7)
where D = [D1, . . . ,DG], c = [cT1 , . . . , cTG]
T , and
S(t−1) =
⎛⎜⎜⎝
S(t−1)11 · · · S
(t−1)1G
.... . .
...
S(t−1)G1 · · · S
(t−1)GG
⎞⎟⎟⎠ ,
S(t−1)ij =
Sij
2(c(t−1)i
TSijc
(t−1)j
)1/2.
(8)
It is easy to see that Eq. (7) has a closed form solution:
c(t) =(DTD+ λS(t−1)
)−1
DTx. (9)
This equation is the resultant iterative updating rule for theultimate hierarchical semantic sparse codes c∗. A detailedalgorithm is illustrated in Algorithm 1.
In the next section, we elaborate detailed analysis of theproposed algorithm and provide a rigorous mathematicalverification of the convergence.
Algorithm 1: Hierarchical Semantic Visual Coding
Input : A visual feature x, dictionary matrix D andsimilarity matrix S w.r.t. a semantichierarchy, trade-off parameter λ;
Output: Hierarchical semantic sparse codes c∗;1 Initialization: Randomly initialize sparse codes c(0)
with ‖c(0)‖2 = 1, t = 0;2 repeat3 t ← t+ 1;
4 Calculate S(t−1) according to Eq. (8);
5 c(t) =(DTD+ λS(t−1)
)−1
DTx;
6 until there is no change to c(t);
7 Return: c∗ = c(t);
5. ALGORITHMIC ANALYSISIn this section, we prove that Algorithm 1 converges to
a global minimum of Eq. (5) and analyze its computationalcomplexity. We first show the Algorithm 1 converges to alimiting point and then show this point is a stationary point
63
of Eq. (5), which is therefore a global minimum due to theconvexity of Eq. (5). We start from the following lemma.
Lemma 1. For nonzero vectors c and b, and symmetricpositive definite matrix S, we have
(cTSc
)1/2
− cTSc
2 (bTSb)1/2≤
(bTSb
)1/2
− bTSb
2 (bTSb)1/2.
(10)
Proof. Since S is symmetric positive definite, we haveS = LTL by Cholesky decomposition, and for any nonzero
vectors we have cTSc > 0. Denote b = Lb and c = Lc,
since (‖c‖2 − ‖b‖2)2, we have
‖c‖2 − ‖c‖222‖b‖2
≤ ‖b‖2 − ‖b‖222‖b‖2
, (11)
which therefore proves Lemma 1.
Thereom 1 (Convergence of Algorithm 1). The it-erative update in Algorithm 1 guarantees that the objectivefunction in Eq. (5) converges to a global minimum.
Proof. Since the solution in Eq. (9) minimizes the valueof the objective function in Eq. (7), we have∥∥∥∥∥x−
G∑i=1
Dic(t)i
∥∥∥∥∥2
2
+ λ∑
i,j
c(t)i
TSijc
(t)j
2(c(t−1)i
TSijc
(t−1)j
)1/2
≤∥∥∥∥∥x−
G∑i=1
Dic(t−1)i
∥∥∥∥∥2
2
+ λ∑
i,j
c(t−1)i
TSijc
(t−1)j
2(c(t−1)i
TSijc
(t−1)j
)1/2.
(12)
By applying Lemma 1, we have
(c(t)i
TSijc
(t)j
)1/2
− c(t)i
TSijc
(t)j
2(c(t−1)i
TSijc
(t−1)j
)1/2
≤(c(t−1)i
TSijc
(t−1)j
)1/2
− c(t−1)i
TSijc
(t−1)j
2(c(t−1)i
TSijc
(t−1)j
)1/2.
(13)
By summing Ineq. (13) over i and j, multiplying by λ, andadding it to Ineq. (12), we have∥∥∥∥∥x−
G∑i=1
Dic(t)i
∥∥∥∥∥2
2
+ λ
G∑i=1,j=1
(c(t)
T
i Sijc(t)j
)1/2
≤∥∥∥∥∥x−
G∑i=1
Dic(t−1)i
∥∥∥∥∥2
2
+ λ
G∑i=1,j=1
(c(t−1)T
i Sijc(t−1)j
)1/2
.
(14)
Ineq. (14) shows that the updating rule in Eq. (9) decreasesthe value of the objective function in Eq. (5), denoted as
F (c(t)) ≤ F (c(t−1)). Since F (·) is bounded above zero, we
can conclude that F (c(t)) converges to a limiting point F ∗
when t → ∞. Moreover, note that F (·) is Lipschitz contin-
uous, we can conclude that limt→∞ c(t) = c∗. Further, it iseasy to verify that when ct = c(t−1) = c∗, Eq. (9) assuresc∗ is a stationary point of Eq. (6), hence it is also a station-ary point of Eq. (5). Due to the convexity of Eq. (5), weconclude that c∗ is a global minimum of Eq. (5).
In practice, we use an efficient algorithm to avoid tack-
ling the large-scale(DTD+ λS(t−1)
). To this end, we ex-
ploit the symmetric positive definite nature of S(t−1), using
Cholesky decomposition S(t−1) = (L(t−1))TL(t−1). Here,
(L(t−1))Tis an upper triangular matrix with strictly positive
diagonal entries, which is efficiently invertible. By denoting
c(t) = L(t−1)c(t), D(t−1) = D(L(t−1))−1
, the new iterativeupdating rule is
c(t) =((D(t−1))
TD(t−1) + λI
)−1
(D(t−1))Tx. (15)
Here, in order to inverse((D(t−1))
TD(t−1) + λI
), we can
first apply eigen decomposition for D(t−1)(D(t−1))T, and
then inverse the matrix by only inversing the eigenvalue di-agonal matrix. Suppose d is the feature dimension of a visualsample x and d N , its complexity O(d3) is much smallerthan the former O(N3). Note that the only overhead is the
Cholesky decomposition for S(t−1). Fortunately, due to theextreme sparse form of S(t−1), this overhead is almost a verysmall constant.
6. EXPERIMENTIn this section, we systematically evaluate the effectiveness
of our proposed HSVC approach on two classic multimediatasks, namely visual annotation and content-based retrieval,as compared to several baselines and state-of-the-art visualcoding methods over various multimedia datasets.
6.1 Experimental Settings
6.1.1 DataIn our experiments, three real-world visual datasets are
utilized for evaluation. The first one is from ImageNet [6],which is one of the largest well-organized image datasetspublicly-available. All images on ImageNets were collectedfrom the Web and organized in accordance with the struc-ture of WordNet hierarchy. Each concept in the hierarchyis depicted by hundreds to thousands of images. We useda subset of ImageNet with 1,000 concepts and 1.27 millionimages, which were used for ILSVRC 20121. This datasetcontains partial hierarchy of WordNet and some new nodesthat are not in WordNet. In our evaluation, ImageNet isused for constructing semantic-enrich dictionary as depictedin Section 3. For each individual concept, its attached im-ages are employed for learning concept-specific dictionaryaccording to Eq. (3). The number of base features in eachconcept-specific dictionary is set to 10, and thus we have10,000 base features in total in the aggregated dictionary.
To the end of evaluating the effectiveness of the proposedHSVC approach, we chose to perform two classic multimediaapplications, namely visual annotation and content-basedimage/video retrieval, on one Flickr2-based image set NUS-WIDE and one YouTube3-based video set Columbia Con-sumer Video (CCV) database [14]. While the former onecontains 269,648 images annotated with 81 concepts, thelatter is comprised of 9,317 YouTube video clips associated
1http://www.image-net.org/challenges/LSVRC/2012/index2http://www.flickr.com/3http://www.youtube.com/
64
Table 1: Overall performance (mean Average Precision) of Classeme, PiCodes, DeCAF, Lasso, Group Lassoand HSVC forvisual annotation on NUS-WIDE and CCV datasets.
���������DatasetMethod
Classeme PiCodes DeCAF Lasso Group Lasso HSVC
NUS-WIDE 24.09% 20.29% 36.04% 37.42% 38.01% 40.15%
CCV 47.60% 50.08% 61.02% 64.30% 65.62% 67.32%
to 20 categories. For annotation task, we follow the origi-nal setting of “train/test”, i.e., “161,789/107,859” for NUS-WIDE and “4,659/4,658” for CCV. For retrieval task, werandomly selected 10% test samples from each concept asquery examples to search the results in training samples.
For all three visual datasets, we adopted the 4,096-D De-CAF generic visual feature [8], which is the activations ofthe 6-th layer of a deep CNN trained in a fully supervisedfashion on ImageNet. It has shown that this feature is veryeffective for multimedia tasks on various benchmark datasets. For each individual video sample in CCV dataset, weaveraged visual features of all its keyframes to form a video-level representation. All features are �2-normalized.For evaluation metrics, we chose commonly used mAP
(mean Average Precision) over all concepts for visual an-notation evaluation. For visual retrieval, we adopt AveragePrecision at top K retrieved results (AP@K), which is de-fined as
AP@K =1
min(R,K)
K∑j=1
Rj
j× Ij ,
where R is the number of relevant visual samples withinthe dataset and Rj is the number of relevant visual sam-ples among top j search results. Ij is set to 1 if the j-thvisual sample is relevant, and 0 otherwise. We averagedAP@K over all queries to obtain mAP@K as overall evalua-tion metric, where K ∈ {1, 2, . . . , 100}.6.1.2 Comparison MethodsIn order to illustrate the representative power of our pro-
posed HSVC approach, we compared it to several state-of-the-art semantic coding methods, including
• Classeme [26], which uses the output of 2,659 imageclassifiers trained on images retrieved from Bing ImageSearch engine4 according to 2,659 query concepts fromLSCOM [20];
• PiCodes [2], which gathers 2048 binary codes by op-timizing the classification processes of 2,659 classifierson ImageNet;
• DeCAF [8], which utilizes the output of the 6-thelayer in a CNN deep learning architecture learned fromImageNet.
To demonstrate the effects of different components (e.g.,semantic sparsity and semantic correlation) of HSVC, wefurther compared to two baselines, namely Lasso andGroupLasso. For both of the above coding baselines, we used thesame semantic-enriched dictionary hierarchy as in HSVC.For Lasso, we did not consider any semantic sparsity con-straint and semantic correlation among concepts. ForGroup
4https://www.bing.com/images/
Lasso, visual codes follow the same constraint as HSVC, butdo not take semantic correlation into consideration.
For visual annotation task, we chose to train concept clas-sifiers using SVM based on the generated codes of differ-ent comparison methods. In our implementation, liblinear5
toolbox was used. For visual retrieval experiments, all thequery samples and the gallery samples were represented us-ing these comparison visual codes and then the retrieval wasthen performed by similarity search. In particular, we used�1-norm distance for Classeme, DeCAF and our proposedHSVC Hamming distance for PiCodes since its outputs arebinary codes.
For all comparison methods, we tuned parameters, suchas λ in HSVC, Lasso, Group Lasso and C in SVM, in therange of {10−6, 10−4, . . . , 106}.
6.2 Experimental Results I: Visual AnnotationIn this part, we report and analyze the experiments on
visual annotation task. We compare HSVC to the otherstate-of-the-arts approaches and show the effects of the useof semantic properties. In particular, Table 1 shows overallperformance (mAP) of all comparison approaches over allconcepts in NUS-WIDE and CCV datasets. Figure 4 andFigure 5 reports detailed performance (AP) of individualconcepts on NUS-WIDE and CCV, respectively.
From the results, we can observe that HSVC consistentlyachieves the best performance among all comparison cod-ing algorithms. Specially, in NUS-WIDE, HSVC outper-forms Classeme, PiCodes and DeCAF semantic features byaround 16%, 20% and 4%, respectively; and in CCV, the im-provements are about 20%, 17% and 6%, respectively. Theunderlying reasons are in two folds. First of all, the dictio-nary of HSVC are derived from existing knowledge base con-taining rich semantic and hierarchical information, therebyestablishing a favourable foundation for effective visual rep-resentation of visual semantics. Second, our proposed HSVCfully explores semantic properties of visual codes, includingsemantic sparsity and semantic correlation. Such consid-erations are rarely reflected in all above semantic features.On the one hand, semantic sparsity of visual data is sig-nificant to visual coding since it deeply reveals the genuinepatterns of visual world, i.e., within a visual object, only asmall part of the entire semantic space is touched. On theother hand, semantic correlation captures intrinsic relation-ships (e.g., co-occurrence) among concepts, which leads toexerting mutually reinforcing effects on the qualities of theresultant visual codes.
The comparison of the results of Lasso, Group Lasso andHSVC can further shed light on the effectiveness of seman-tic sparsity and semantic correlation. As seen, HSVC andGroup Lasso always performs better than Lasso. AlthoughLasso applies �1-norm to induce feature-level sparsity, it can
5http://www.csie.ntu.edu.tw/˜cjlin/liblinear/
65
0.0
0.2
0.4
0.6
0.8
1.0
airport
anim
al
beach
bear
birds
boats
book
bridge
build
ings
cars
castle
cat
cityscape
clo
uds
com
pute
r
cora
l
cow
dancin
g
dog
eart
hquake
elk
fire
fish
flags
flow
ers
food
fox
frost
gard
en
gla
cie
r
gra
ss
harb
or
hors
es
house
lake
leaf
map
mili
tary
moon
mounta
in
nig
httim
e
ocean
pers
on
pla
ne
pla
nts
polic
e
pro
test
railr
oad
rain
bow
reflection
road
rocks
runnin
g
sand
sig
n
sky
snow
soccer
sport
s
sta
tue
str
eet
sun
sunset
surf
sw
imm
ers
tattoo
tem
ple
tiger
tow
er
tow
n
toy
train
tree
valle
y
vehic
le
wate
r
wate
rfall
weddin
g
whale
s
win
dow
zebra
AP
Classeme PiCodes DeCAF HSVC
Figure 4: Detailed performance (Average Precision) of individual concepts of Classeme, PiCodes, DeCAF and HSVC on visualannotation over NUS-WIDE dataset.
0.0
0.2
0.4
0.6
0.8
1.0
basketball
baseball
soccer
iceskating
skiing
swimming
biking catdog
bird
graduation
birthday
wedding receptio
n
wedding ceremony
wedding dance
music perform
ance
nonmusic perform
ance
paradebeach
playground
AP
Classeme PiCodes DeCAF HSVC
Figure 5: Detailed performance (Average Precision) of indi-vidual concepts of Classeme, PiCodes, DeCAF and HSVCon visual annotation over CCV dataset.
easily propagate non-zero weights to irrelevant concepts, re-sulting in influencing semantic purity of visual codes. More-over, HSVC consistently obtains superior visual annotationresults than Group Lasso. A reasonable explanation is thatHSVC takes semantic correlation into consideration and thusacquires more internal mutual promotion of code quality.
0.0
0.1
0.2
0.3
0.4
0.5
1 10 20 30 40 50 60 70 80 90 100
mA
P@
K
K
HSVCGroup Lasso
Lasso
DeCAFPiCodes
Classeme
(a) NUS-WIDE
0.0
0.1
0.2
0.3
0.4
0.5
0.6
1 10 20 30 40 50 60 70 80 90 100
mA
P@
K
K
HSVCGroup Lasso
Lasso
DeCAFPiCodes
Classeme
(b) CCV
Figure 6: Overall performance (mAP@K) of visual retrievalusing 10% random queries on (a) NUS-WIDE and (b) CCVdatasets. K = 1, 2, . . . , 100.
6.3 Experimental Results II: Visual RetrievalIn this part, we report and analyze the experimental re-
sults on visual retrieval task over NUS-WIDE and CCVdatasets. As illustrated, Figure 6.(a) and 6.(b) give mAP@K(K = 1, 2, . . . , 100) curves of all comparison approaches overall concepts in NUS-WIDE and CCV datasets, respectively.Figure 7 and Figure 8 report detailed performance (AP) ofindividual concepts on NUS-WIDE and CCV, respectively.
0.0
0.2
0.4
0.6
0.8
basketball
baseball
soccer
iceskating
skiing
swimming
biking catdog
bird
graduation
birthday
wedding receptio
n
wedding ceremony
wedding dance
music perform
ance
nonmusic perform
ance
paradebeach
playground
AP
@2
0
Classeme PiCodes DeCAF HSVC
Figure 8: Detailed performance (mAP@20) of individualconcepts of Classeme, PiCodes, DeCAF and HSVC for vi-sual retrieval on CCV datasets.
We can observe from the results that HSVC can achieveobvious improvements in terms of retrieval performance overall the other comparison methods on both of the evaluateddatasets. For instance, on NUS-WIDE, HSVC improves theperformance of Classeme, PiCodes and DeCAF by around18%, 17% and 5% in terms of mAP@20; and on CCV, ourmethod can perform better than Classeme, PiCodes and De-CAF in terms of mAP@20 by approximately 27%, 28% and3%, respectively. The above phenomenons of performanceboost can be reasonably attributed to the deep explorationof semantic aspects in our proposed approach. While theconstruction of semantic-enrich dictionary in the hierarchi-cal fashion guarantees a reliable basis for generating effectivevisual codes, semantic sparsity and semantic correlation pro-vides a better understanding of the essence of visual worlds.By naturally connecting visual and semantic worlds throughthe proposed semantic coding method, HSVC is able to re-trieve more useful results for users.
6.4 Parameter SensitivityIn this part, we empirically test the parameter sensitivity
in our approach. Specifically, we evaluate the effects of λ onvisual annotation and visual retrieval over the two evaluateddatasets. We tune λ in the range of {10−6,10−4,. . .,104,106}.The experimental results are illustrated in Figure 9.(a) and9.(b) for NUS-WIDE and CCV, respectively. As we can see,generally speaking, our approach is not obviously sensitiveto different values of λ; nonetheless, we may still observecertain patterns of variation trends from the results. Forinstance, as shown in Figure 9.(a), when λ gradually growsfrom 10−6 to 102, we can see an increasing trend of theannotation performance for NUS-WIDE dataset; as λ keeps
66
0.0
0.2
0.4
0.6
0.8
1.0
airport
anim
al
beach
bear
birds
boats
book
bridge
build
ings
cars
castle
cat
cityscape
clo
uds
com
pute
r
cora
l
cow
dancin
g
dog
eart
hquake
elk
fire
fish
flags
flow
ers
food
fox
frost
gard
en
gla
cie
r
gra
ss
harb
or
hors
es
house
lake
leaf
map
mili
tary
moon
mounta
in
nig
httim
e
ocean
pers
on
pla
ne
pla
nts
polic
e
pro
test
railr
oad
rain
bow
reflection
road
rocks
runnin
g
sand
sig
n
sky
snow
soccer
sport
s
sta
tue
str
eet
sun
sunset
surf
sw
imm
ers
tattoo
tem
ple
tiger
tow
er
tow
n
toy
train
tree
valle
y
vehic
le
wate
r
wate
rfall
weddin
g
whale
s
win
dow
zebra
AP
@20
Classeme PiCodes DeCAF HSVC
Figure 7: Detailed performance (mAP@20) of individual concepts of Classeme, PiCodes, DeCAF and HSVC for visual retrievalon NUS-WIDE datasets.
0.3
0.4
0.5
0.6
0.7
0.8
-6 -4 -2 0 2 4 6
mA
P
log(λ)
NUS-WIDECCV
(a) Annotation
0.20
0.25
0.30
0.35
0.40
-6 -4 -2 0 2 4 6
mA
P@
20
log(λ)
NUS-WIDECCV
(b) Retrieval
Figure 9: Evaluation of effects of λ on (a) visual annotation(mAP) and (b) visual retrieval (mAP@20) over NUS-WIDEand CCV datasets.
increasing to 106, the performance starts to drop. Such phe-nomenon hints us that neither large nor small value of λ canhelp to achieve the best performance. This is because whenλ is too large, the effect of hierarchical semantic informationin the dictionary may be suppressed; and small λ probablycauses seldom reflection of semantic sparsity and semanticcorrelation in HSVC.
6.5 Convergence AnalysisAs depicted in Section 5, the rigorous mathematical anal-
ysis guarantees that Algorithm 1 is able to converge to aglobal optimal solution as t → ∞. In this part, we conductempirical study on the convergence in both of the evaluateddatasets. Figure 10 illustrate the results on the two datasets.Here, we fix the tradeoff parameter λ to 1. As we can see,in experiments, our algorithm can quickly converge withinonly a few iterations, which clearly shows its efficiency forpractical use. We analyze that our designed iterative up-dating rule in Eq. (9) can almost always detect a fast “path”from initial point to the global optimal.
7. CONCLUSIONIn this paper, we introduced a novel visual coding ap-
proach, termed as hierarchical semantic visual coding (HSVC),to effectively encode visual objects in a semantic hierarchy.Following the designed pipeline, we generated a semantic-enriched dictionary hierarchy, which contains dictionariescorresponding to all concepts in a semantic hierarchy aswell as their hierarchical semantic information. We alsodeveloped an on-line semantic coding model, which jointlyexploits the rich hierarchical semantic information in thelearned dictionary as well as considers various semantic prop-erties of visual objects, including semantic sparsity and se-
3x104
4x104
5x104
6x104
7x104
8x104
9x104
0 10 20 30 40 50 60 70 80 90 100
3x104
4x104
4 6 8 10 12 14
(a) NUS-WIDE
1.5x104
2.0x104
2.5x104
3.0x104
3.5x104
4.0x104
0 10 20 30 40 50 60 70 80 90 100
1.5x104
2.0x104
4 6 8 10 12 14
(b) CCV
Figure 10: Convergence analysis on the two evaluateddatasets: (a) NUS-WIDE and (b) CCV. For illustration,we fix λ = 1.
mantic correlations among concepts in the semantic hierar-chy. We proposed to combine concept-level sparsity con-straint and semantic similarity into a unified regularizationterm. An effective algorithm was devised for solving the op-timization problem in the proposed model, and a rigorousmathematical analysis has been provided to guarantee thealgorithm converges to the global optima. Extensive exper-iments have been conducted to show the effectiveness of ourapproach.
Although HSVC has achieved superior performance ascompared to several baselines and state-of-the-art methods,it still leaves the following open problem to be further han-dled. The representative power of the learned dictionaryheavily relies on existing hand-crafted semantic hierarchy,which may not be suitable for constantly-evolving visualworlds. In the future, we will further investigate an auto-matic way for discovering representative concepts and build-ing semantic dictionary from both the visual and textualdata, such that visual semantics could be identified in a moreprecise manner.
8. REFERENCES[1] S. Bengio, F. Pereira, Y. Singer, and D. Strelow. Group
sparse coding. In NIPS, pages 82–89, 2009.[2] A. Bergamo, L. Torresani, and A. W. Fitzgibbon. Picodes:
Learning a compact code for novel-category recognition. InNIPS, pages 2088–2096, 2011.
[3] A. Binder, K.-R. Muller, and M. Kawanabe. On taxonomiesfor multi-class image categorization. IJCV, 99(3):281–301,2012.
[4] J. Deng, A. C. Berg, and L. Fei-Fei. Hierarchical semanticindexing for large scale image retrieval. In CVPR, pages785–792, 2011.
67
CCV
NUS-WIDE
Query
Query
Figure 11: Illustration of retrieval examples using our proposed HSVC on NUS-WIDE and CCV. The top two lines are resultsof NUS-WIDE while the bottom two are from NUS-WIDE.
[5] J. Deng, A. C. Berg, K. Li, and L. Fei-Fei. What doesclassifying more than 10,000 image categories tell us? InECCV, pages 71–84. 2010.
[6] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, andL. Fei-Fei. Imagenet: A large-scale hierarchical imagedatabase. In CVPR, pages 248–255. IEEE, 2009.
[7] T. Deselaers and V. Ferrari. Visual and semantic similarityin imagenet. In CVPR, pages 1777–1784, 2011.
[8] J. Donahue, Y. Jia, O. Vinyals, J. Hoffman, N. Zhang,E. Tzeng, and T. Darrell. Decaf: A deep convolutionalactivation feature for generic visual recognition. arXivpreprint arXiv:1310.1531, 2013.
[9] C. Elkan. Using the triangle inequality to acceleratek-means. In ICML, volume 3, pages 147–153, 2003.
[10] C. Fellbaum. Wordnet. Theory and Applications ofOntology: Computer Applications, 2010.
[11] S. Gao, L.-T. Chia, and I. W. Tsang. Multi-layer groupsparse coding – for concurrent image classification andannotation. In CVPR, pages 2809–2816, 2011.
[12] G. Griffin and P. Perona. Learning and using taxonomiesfor fast visual categorization. In CVPR, pages 1–8, 2008.
[13] J. Huang, H. Liu, J. Shen, and S. Yan. Towards efficientsparse coding for scalable image annotation. In ACMMultimedia, pages 947–956. ACM, 2013.
[14] Y.-G. Jiang, G. Ye, S.-F. Chang, D. Ellis, and A. C. Loui.Consumer video understanding: A benchmark databaseand an evaluation of human and machine performance. InICMR, page 29, 2011.
[15] A. Krizhevsky, I. Sutskever, and G. E. Hinton. Imagenetclassification with deep convolutional neural networks. InNIPS, pages 1097–1105, 2012.
[16] S. Lazebnik, C. Schmid, and J. Ponce. Beyond bags offeatures: Spatial pyramid matching for recognizing naturalscene categories. In CVPR, volume 2, pages 2169–2178,2006.
[17] H. Lee, A. Battle, R. Raina, and A. Y. Ng. Efficient sparsecoding algorithms. In NIPS, pages 801–808, 2006.
[18] M. Marsza�lek and C. Schmid. Semantic hierarchies forvisual object recognition. In CVPR, pages 1–7, 2007.
[19] L. Meier, S. Van De Geer, and P. Buhlmann. The grouplasso for logistic regression. JRSS-B, 70(1):53–71, 2008.
[20] M. Naphade, J. R. Smith, J. Tesic, S.-F. Chang, W. Hsu,L. Kennedy, A. Hauptmann, and J. Curtis. Large-scale
concept ontology for multimedia. IEEE Multimedia,13(3):86–91, 2006.
[21] M. Rohrbach, M. Stark, G. Szarvas, I. Gurevych, andB. Schiele. What helps where–and why? semanticrelatedness for knowledge transfer. In CVPR, pages910–917, 2010.
[22] L. Shao, R. Yan, X. Li, and Y. Liu. From heuristicoptimization to dictionary learning: A review andcomprehensive comparison of image denoising algorithms.TCYB, 44(7):1001–1013, 2014.
[23] Y. Song, M. Zhao, J. Yagnik, and X. Wu. Taxonomicclassification for web-based videos. In CVPR, pages871–878, 2010.
[24] Y. Sun, X. Wang, and X. Tang. Deep convolutionalnetwork cascade for facial point detection. In CVPR, pages3476–3483, 2013.
[25] J. Tang, L. Shao, and X. Li. Efficient dictionary learningfor visual categorization. CVIU, 124:91–98, 2014.
[26] L. Torresani, M. Szummer, and A. Fitzgibbon. Efficientobject category recognition using classemes. In ECCV,pages 776–789. 2010.
[27] N. Verma, D. Mahajan, S. Sellamanickam, and V. Nair.Learning hierarchical similarity metrics. In CVPR, pages2280–2287, 2012.
[28] Y. Yang, Y. Yang, Z. Huang, H. T. Shen, and F. Nie. Taglocalization with spatial correlations and joint groupsparsity. In CVPR, pages 881–888, 2011.
[29] Y. Yang, Y. Yang, and H. T. Shen. Effective transfertagging from image to video. TOMM, 9(2):14, 2013.
[30] Y. Yang, Z.-J. Zha, Y. Gao, X. Zhu, and T.-S. Chua.Exploiting web images for semantic video indexing viarobust sample-specific loss. TMM, 16(6):1677–1689, 2014.
[31] M. Yuan and Y. Lin. Model selection and estimation inregression with grouped variables. JRSS-B, 68(1):49–67,2006.
[32] H. Zhang, Z.-J. Zha, Y. Yang, S. Yan, Y. Gao, and T.-S.Chua. Attribute-augmented semantic hierarchy: towardsbridging semantic gap and intention gap in image retrieval.In ACM Multimedia, pages 33–42, 2013.
[33] S. Zhang, J. Huang, H. Li, and D. N. Metaxas. Automaticimage annotation and retrieval using group sparsity.TCYB, 42(3):838–849, 2012.
[34] M. Zheng, J. Bu, C. Chen, C. Wang, L. Zhang, G. Qiu, andD. Cai. Graph regularized sparse coding for imagerepresentation. TIP, 20(5):1327–1336, 2011.
68