3626 IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 29, …see.xidian.edu.cn/faculty/chdeng/Welcome to...

12
3626 IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 29, 2020 Multi-Task Consistency-Preserving Adversarial Hashing for Cross-Modal Retrieval De Xie, Cheng Deng , Member, IEEE, Chao Li, Xianglong Liu , and Dacheng Tao , Fellow, IEEE Abstract—Owing to the advantages of low storage cost and high query efficiency, cross-modal hashing has received increas- ing attention recently. As failing to bridge the inherent modality gap between modalities, most existing cross-modal hashing meth- ods have limited capability to explore the semantic consistency information between different modality data, leading to unsatis- factory search performance. To address this problem, we propose a novel deep hashing method named Multi-Task C onsistency- P reserving A dversarial H ashing (CPAH) to fully explore the semantic consistency and correlation between different modalities for efficient cross-modal retrieval. First, we design a consistency refined module (CR) to divide the representations of different modality into two irrelevant parts, i.e., modality-common and modality-private representations. Then, a multi-task adversar- ial learning module (MA) is presented, which can make the modality-common representation of different modalities close to each other on feature distribution and semantic consistency. Finally, the compact and powerful hash codes can be generated from modality-common representation. Comprehensive evalua- tions conducted on three representative cross-modal benchmark datasets illustrate our method is superior to the state-of-the-art cross-modal hashing methods. Index Terms— Cross-modal retrieval, hashing, consistency- preserving, adversarial, multi-task. I. I NTRODUCTION W ITH the explosive development of Internet and big data, large-scale and high-dimensional multi-modal data has been indwelt in social network and storage media. In order to deal with such massive data, cross-modal retrieval, which aims to search semantically similar instances in one modality by using a query from another modality, is increas- Manuscript received April 11, 2019; revised October 17, 2019; accepted December 30, 2019. Date of publication January 9, 2020; date of current version January 30, 2020. This work was supported in part by the National Natural Science Foundation of China under Grant 61572388 and Grant 61703327, in part by the Key R&D Program-The Key Industry Innovation Chain of Shaanxi under Grant 2018ZDXM-GY-176 and 2019ZDLGY03-02- 01, and in part by the National Key R&D Program of China under Grant 2017YFE0104100 and 2016YFE0200400. The associate editor coordinating the review of this manuscript and approving it for publication was Dr. Giulia Boato. (Corresponding authors: Cheng Deng; Xianglong Liu.) De Xie, Cheng Deng, and Chao Li are with the School of Elec- tronic Engineering, Xidian University, Xi’an 710071, China (e-mail: [email protected]; [email protected]; [email protected]). Xianglong Liu is with the State Key Laboratory of Software Devel- opment Environment, Beihang University, Beijing 100191, China (e-mail: [email protected]). Dacheng Tao is with the UBTECH Sydney Artificial Intelligence Centre and the School of Computer Science, Faculty of Engineering, University of Syd- ney, Darlington, NSW 2008, Australia (e-mail: [email protected]). Digital Object Identifier 10.1109/TIP.2020.2963957 ingly significant and becomes a fundamental subject in com- puter vision applications [1], [2]. Actually, due to the data structure and feature distribution of different modalities are inconsistent, semantic consistency between a pair of relevant instances is difficult to being captured. How to effectively build the semantic consistency between the different modalities remains a challenging issue. Most of existing cross-modal retrieval methods, including traditional statistical correlation [3], graph regularization [4], and dictionary learning [5], following the vein of subspace learning, which map different modality data into a common subspace and measure the similarities in this space. Therefore, these methods always suffer from high computation cost and low search accuracy. To tackle these drawbacks, hashing-based methods [6]–[10], project high-dimensional data from each modality into compact hash codes and preserve similar instances with similar hash codes. As such, the similarity is computed via fast bit wise XOR operation in Hamming space, which can save the memory storage cost and speed the query efficiency. Traditional cross-modal hashing almost rely on shallow models, which can be divided into unsupervised [11], [12] and supervised settings [5], [13]–[15]. Compared with unsu- pervised counterpart, supervised cross-modal hashing meth- ods can achieve better performance by exploiting semantic labels to build cross-modal correlation. Unfortunately, these shallow cross-modal hashing methods depend on hand-crafted features, which greatly limit the discriminative representa- tion of instances and thus degrade the search performance. To take advantages of the recent progress of deep learning [16]–[18], deep cross-modal hashing methods are proposed as they are able to learn more discriminative representations and thus capture cross-modal correlations more effectively [19]–[27]. However, almost all of these deep models fail to distinguish modality-common and modality-private represen- tation, making the learning of semantic consistency between these modalities inaccurate. Domain separation networks (DSN) [28] designs an encoder-decoder architecture to extract image representation that contains a common subspace across domains and a private subspace for each domain. Inspired by this common-private component analysis, SPDQ [29] projects multi-modal instances into a shared subspace and two private subspaces to learn modality-common and modality-private representations by multiple kernel maximum mean discrep- ancy (MK-MMD). ADAH [30] proposes an attention-aware 1057-7149 © 2020 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See https://www.ieee.org/publications/rights/index.html for more information. Authorized licensed use limited to: XIDIAN UNIVERSITY. Downloaded on March 09,2020 at 04:10:45 UTC from IEEE Xplore. Restrictions apply.

Transcript of 3626 IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 29, …see.xidian.edu.cn/faculty/chdeng/Welcome to...

Page 1: 3626 IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 29, …see.xidian.edu.cn/faculty/chdeng/Welcome to Cheng... · 3626 IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 29, 2020 Multi-Task

3626 IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 29, 2020

Multi-Task Consistency-Preserving AdversarialHashing for Cross-Modal Retrieval

De Xie, Cheng Deng , Member, IEEE, Chao Li, Xianglong Liu , and Dacheng Tao , Fellow, IEEE

Abstract— Owing to the advantages of low storage cost andhigh query efficiency, cross-modal hashing has received increas-ing attention recently. As failing to bridge the inherent modalitygap between modalities, most existing cross-modal hashing meth-ods have limited capability to explore the semantic consistencyinformation between different modality data, leading to unsatis-factory search performance. To address this problem, we proposea novel deep hashing method named Multi-Task Consistency-Preserving Adversarial Hashing (CPAH) to fully explore thesemantic consistency and correlation between different modalitiesfor efficient cross-modal retrieval. First, we design a consistencyrefined module (CR) to divide the representations of differentmodality into two irrelevant parts, i.e., modality-common andmodality-private representations. Then, a multi-task adversar-ial learning module (MA) is presented, which can make themodality-common representation of different modalities closeto each other on feature distribution and semantic consistency.Finally, the compact and powerful hash codes can be generatedfrom modality-common representation. Comprehensive evalua-tions conducted on three representative cross-modal benchmarkdatasets illustrate our method is superior to the state-of-the-artcross-modal hashing methods.

Index Terms— Cross-modal retrieval, hashing, consistency-preserving, adversarial, multi-task.

I. INTRODUCTION

W ITH the explosive development of Internet and bigdata, large-scale and high-dimensional multi-modal

data has been indwelt in social network and storage media.In order to deal with such massive data, cross-modal retrieval,which aims to search semantically similar instances in onemodality by using a query from another modality, is increas-

Manuscript received April 11, 2019; revised October 17, 2019; acceptedDecember 30, 2019. Date of publication January 9, 2020; date of currentversion January 30, 2020. This work was supported in part by the NationalNatural Science Foundation of China under Grant 61572388 and Grant61703327, in part by the Key R&D Program-The Key Industry InnovationChain of Shaanxi under Grant 2018ZDXM-GY-176 and 2019ZDLGY03-02-01, and in part by the National Key R&D Program of China under Grant2017YFE0104100 and 2016YFE0200400. The associate editor coordinatingthe review of this manuscript and approving it for publication was Dr. GiuliaBoato. (Corresponding authors: Cheng Deng; Xianglong Liu.)

De Xie, Cheng Deng, and Chao Li are with the School of Elec-tronic Engineering, Xidian University, Xi’an 710071, China (e-mail:[email protected]; [email protected]; [email protected]).

Xianglong Liu is with the State Key Laboratory of Software Devel-opment Environment, Beihang University, Beijing 100191, China (e-mail:[email protected]).

Dacheng Tao is with the UBTECH Sydney Artificial Intelligence Centre andthe School of Computer Science, Faculty of Engineering, University of Syd-ney, Darlington, NSW 2008, Australia (e-mail: [email protected]).

Digital Object Identifier 10.1109/TIP.2020.2963957

ingly significant and becomes a fundamental subject in com-puter vision applications [1], [2]. Actually, due to the datastructure and feature distribution of different modalities areinconsistent, semantic consistency between a pair of relevantinstances is difficult to being captured. How to effectively buildthe semantic consistency between the different modalitiesremains a challenging issue.

Most of existing cross-modal retrieval methods, includingtraditional statistical correlation [3], graph regularization [4],and dictionary learning [5], following the vein of subspacelearning, which map different modality data into a commonsubspace and measure the similarities in this space. Therefore,these methods always suffer from high computation cost andlow search accuracy. To tackle these drawbacks, hashing-basedmethods [6]–[10], project high-dimensional data from eachmodality into compact hash codes and preserve similarinstances with similar hash codes. As such, the similarity iscomputed via fast bit wise XOR operation in Hamming space,which can save the memory storage cost and speed the queryefficiency.

Traditional cross-modal hashing almost rely on shallowmodels, which can be divided into unsupervised [11], [12]and supervised settings [5], [13]–[15]. Compared with unsu-pervised counterpart, supervised cross-modal hashing meth-ods can achieve better performance by exploiting semanticlabels to build cross-modal correlation. Unfortunately, theseshallow cross-modal hashing methods depend on hand-craftedfeatures, which greatly limit the discriminative representa-tion of instances and thus degrade the search performance.To take advantages of the recent progress of deep learning[16]–[18], deep cross-modal hashing methods are proposedas they are able to learn more discriminative representationsand thus capture cross-modal correlations more effectively[19]–[27]. However, almost all of these deep models fail todistinguish modality-common and modality-private represen-tation, making the learning of semantic consistency betweenthese modalities inaccurate. Domain separation networks(DSN) [28] designs an encoder-decoder architecture to extractimage representation that contains a common subspace acrossdomains and a private subspace for each domain. Inspired bythis common-private component analysis, SPDQ [29] projectsmulti-modal instances into a shared subspace and two privatesubspaces to learn modality-common and modality-privaterepresentations by multiple kernel maximum mean discrep-ancy (MK-MMD). ADAH [30] proposes an attention-aware

1057-7149 © 2020 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.See https://www.ieee.org/publications/rights/index.html for more information.

Authorized licensed use limited to: XIDIAN UNIVERSITY. Downloaded on March 09,2020 at 04:10:45 UTC from IEEE Xplore. Restrictions apply.

Page 2: 3626 IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 29, …see.xidian.edu.cn/faculty/chdeng/Welcome to Cheng... · 3626 IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 29, 2020 Multi-Task

XIE et al.: MULTI-TASK CPAH FOR CROSS-MODAL RETRIEVAL 3627

method to generate attention mask for modality-common andmodality-private representations extraction, simultaneously. Infact, the modality-common and modality-private representa-tions are entangled with each other, which are difficult to beseparated by these models. The more explicit guidance fordisentangling modality-common and modality-private repre-sentations is necessary.

In this paper, we propose a novel deep architec-ture for cross-modal hashing retrieval, named multi-taskconsistency-preserving adversarial hashing (CPAH), wherethe modality-common representation, modality-private repre-sentation and compact hash codes for different modalitiesare learned in a unified framework. Specifically, we firstlydesigned a consistency refined module (CR) to divideeach modality representation into modality-common andmodality-private representation, respectively, which are irrele-vant to each other. Subsequently, in order to bridge the modal-ity gap and capture semantic consistency between differentmodalities effectively, we propose multi-task adversarial learn-ing module (MA), which can make the feature distributionof modality-common representation from different modalitiesmore close to each other and capture semantic consistencyinformation between different modalities more effectively inan adversarial way. Finally, we explore each modality-commonrepresentation to generate discriminative and compact hashcodes for cross-modal retrieval task. The main contributionsof the proposed CPAH can be summarized as follows:

• We propose a multi-task consistency-preserving adversar-ial hashing for cross-modal retrieval, where multi-modalsemantic consistency learning and hash learning areseamlessly incorporated in an end-to-end framework.

• We design consistency refined module (CR) for modal-ity representation separation, which can divide modal-ity representation into two incompatible parts, i.e.,modality-common and modality-private representations.Moreover, we propose multi-task adversarial learningmodule (MA), which can preserve the semantic consis-tency information between different modalities effectivelyby adversarial learning strategy.

• Extensive experiments conducted on three benchmarkdatasets demonstrate that the proposed CPAH signifi-cantly outperforms the state-of-the-art cross-modal hash-ing methods, including traditional as and as deep learningbased methods.

The rest of this paper is organized as follows. We introducesome related works pertaining to cross-modal hashing andadversarial learning in Section II. Section III presents ourproposed CPAH and optimization with theoretical analysisin detail. Section IV illustrates the experimental results andanalysis. Finally, Section V concludes our work.

II. RELATED WORK

A. Cross-Modal Hashing

Cross-modal hashing can be categorized into two groups:unsupervised and supervised methods. The unsupervised meth-ods only use co-occurrence information to learn hash functionsfor multi-modal data. For example, CVH [31] extends spectral

hashing from uni-modal to multi-modal scenarios. LSSH [32]jointly learns latent features from images and texts withsparse coding. On the other hand, the supervised methodsexploit label information to learn more discriminative commonrepresentation. CMSSH [33] regards each hash function as abinary classification problem and uses a boosting algorithm inthe learning process. SCM [34] utilizes non-negative matrixfactorization and neighbor preserving algorithm to maintaininter-modal and intra-modal semantic correlations.

Recently, deep learning based cross-modal hashing hasshown powerful ability to exploit nonlinear correlations acrossdifferent modalities. DCMH [20] combines feature learningand hash learning into a unified framework. PRDH [21] alsoadopts deep CNN models to learn feature representations andhash codes simultaneously. DVSH [35] utilizes convolutionalneural network (CNN) and long short-term memory (LSTM)to separately learn the common representations. TDH [24]adopts deep neural networks with triplet label to capture moregeneral semantic correlations between modalities.

B. Adversarial Learning

With the evolution of generative adversarial networks(GANs) [36], adversarial learning has received increasingattention and achieved considerable progress in image gen-eration [37], [38] and representation learning [39], [40].Meanwhile, adversarial learning have also been employedin retrieval task. HashGAN [41] is a typical adversarialhash-based retrieval method, which can learn compact binaryhash codes from both real images and diverse images synthe-sized by generative models. Nevertheless, HashGAN belongsto uni-modal retrieval, which only utilizes generative adver-sarial learning as a means of data augmentation. In contrast,for real-valued based cross-modal retrieval, ACMR [42] adoptsadversarial learning to discover an effective common subspaceand generate modality-invariant representations. Li et al. [22]proposed a self-supervised adversarial hashing (SSAH), whichleverages two adversarial networks to capture the semanticcorrelation between different modalities. Zhang et al. [30]presented attention-aware deep adversarial hashing (ADAH) togenerate an attention mask for attended and unattended featurerepresentations learning. In comparison to these methods, ourCPAH utilizes multi-task adversarial learning for explicitlymodeling modality-common and modality-private represen-tations learning, which can maximumly preserve semanticconsistency between modalities.

III. MULTI-TASK CONSISTENCY-PRESERVING

ADVERSARIAL HASHING

In this section, we present the details about our CPAHmodel, including model formulation and optimization algo-rithm. Fig. 1 shows the whole flowchart of the CPAH, whichis an end-to-end framework with a couple of networks andcontains two main steps, i.e., consistency learning and hashlearning.

A. Problem Formulation

The notations and the problem formulation is first intro-duced for this paper. Our CPAH can be expanded to multiple

Authorized licensed use limited to: XIDIAN UNIVERSITY. Downloaded on March 09,2020 at 04:10:45 UTC from IEEE Xplore. Restrictions apply.

Page 3: 3626 IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 29, …see.xidian.edu.cn/faculty/chdeng/Welcome to Cheng... · 3626 IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 29, 2020 Multi-Task

3628 IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 29, 2020

Fig. 1. The framework of the proposed CPAH. CPAH is an end-to-end framework with a couple of networks, i.e., image network and text network, andcontains two main steps. Step 1 is consistency learning, which can extract semantic consistency information through consistency refined module (CR) andmulti-task adversarial learning module (MA). Step 2 is hash learning, which utilizes modality-common representation to yield compact hash codes.

modalities, such as image, audio, and video. For simplicity,we only use image and text to explain our method. Overall,matrices and vectors are written in boldface with uppercaseand lowercase letters respectively. To denote the i -th elementof a vector m, we use the notation mi . When it comes to amatrix M, the form of Mij is adopted to denote the i -th rowand j -th column element. Furthermore, the superscript V /T ofa variable indicate that it belongs to the image/text modality.

Given a dataset O = {oi }Ni=1 including image and text

modalities, oi = {vi , ti , li }, where vi and ti are raw image andtext data for i -th instance, and li = {li1, li2, . . . , li L } is themulti-label annotations assigned to oi . L is the class number.If oi belongs to j -th class, li j = 1, otherwise li j = 0. Thepairwise multi-label similarity matrix S is used to describesemantic similarity between each two instances, where Si j = 1means oi is semantically similar with o j , otherwise Si j = 0.For multi-label setting, two instances oi and o j are annotatedwith multiple labels. Thus, we define Si j = 1, if oi and o j

share at least one label, otherwise Si j = 0.For an image (text) query, the goal of cross-modal hashing

is to learn two hash functions for the two modalities: BV ∈{−1,+1}K for the image modality and BT ∈ {−1,+1}K forthe text modality, where K is the length of hash code. Thesimilarity between two hash codes is measured by Hammingdistance. The relationship between their Hamming distancedisH (BV

i , BTj ) and their inner product

⟨BV

i , BTj

⟩can be

formulated as:

disH (BVi , BT

j ) = 1

2(K −

⟨BV

i , BTj

⟩), (1)

thus we can use inner product to evaluate the similarity of twohash codes. Given S, the probability of S under the condition

B∗ (∗ = V , T ) can be expressed as:

p(Si j |B∗) ={

σ(θi j ), Si j = 1

1 − σ(θi j ), Si j = 0(2)

where σ(θi j ) = 11+e−θi j

, and θi j = 12

⟨BV

i , BTj

⟩. Therefore, two

instances with larger inner produce should be similar with ahigh probability. The problem of quantifying the similaritybetween binary codes in Hamming space can be transformedinto calculating the inner product of their original features.

B. Framework

As shown in Fig. 1, we exploit two deep neural networks,i.e., image network and text network, for image modalityV and textual modality T , respectively. For image network,we extract image representation via universal convolutionalneural network such as CNN-F [43], which is a popularframework used in cross-modal hashing. The original CNN-F [43] consists of five convolutional layers (conv1 − conv5)and three fully-connected layers ( f c6 − f c8). We utilize theoutput of f c7 as the representation of image modality. Thena consistency refined module (CR) is designed for imagemodality to generate two incompatible representations, i.e.,modality-common representation and image modality-privaterepresentation. For text network, we first transform each textinstance into a one-hot vector using bag-of-words (BOWs)representation. The BOWs representation is used as the inputto two convolutional layers (conv1 − conv2) to extract textrepresentation. Then, we also apply a consistency refinedmodule (CR) to divide text representation into two incompati-ble representations, i.e., modality-common representation andtextual modality-private representation. After that, we design a

Authorized licensed use limited to: XIDIAN UNIVERSITY. Downloaded on March 09,2020 at 04:10:45 UTC from IEEE Xplore. Restrictions apply.

Page 4: 3626 IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 29, …see.xidian.edu.cn/faculty/chdeng/Welcome to Cheng... · 3626 IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 29, 2020 Multi-Task

XIE et al.: MULTI-TASK CPAH FOR CROSS-MODAL RETRIEVAL 3629

multi-task adversarial learning module (MA), which containssemantic consistency discriminator, semantic consistency clas-sifier and semantic classifier. These three components cooper-ate each other in a adversarial way, which can generate moreeffective modality-common representation. Finally, to map thelearned modality-common representation into Hamming spacedirectly, we design two fully-connected hash layers, f cV

h andf cT

h , with K hidden nodes, as in [24], to yield compact binarycodes for cross-modal hashing retrieval.

1) Consistency Refined Module: In cross-modal retrievalsetting, the essential problem is to find the semantic consis-tency information between the instances of different modal-ities. In fact, for a pair of relevant instances, each instancenot only has their common semantic consistency information,but also has it own private information, e.g., the semanticallyirrelevant background information in images. Although the twokinds of information are mutually exclusive, their extractionprocess can promote each other. Therefore, for image modalityand text modality, we design consistency refined moduleto divided image representation and text representation intoimage/text modality-common representation and image/textmodality-private representation. Suppose rV and rT are imageand text representations, we input them into two differentconvolutional layers f V

cr and f Tcr , where kernel size is 1 × 1

and activation function is sigmoid, to generation informationselection mask for each modality. Note mV

c , mVp and mT

c , mTp

are the information selection masks of image modality andtext modality respectively, which can be formulated as:

mVc = sigmoid( f V

cr (rV )),

mVp = 1 − sigmoid( f V

cr (rV )),

mTc = sigmoid( f T

cr (rT )),

mTp = 1 − sigmoid( f T

cr (rT )). (3)

After that, the modality-common representations rVc , rT

cand modality-private representations rV

p , rTp can be obtained

through these masks as:rV

c = rV · mVc ,

rVp = rV · mV

p ,

rTc = rT · mT

c ,

rTp = rT · mT

p . (4)

Since the modality-common representation and modality-private representation are mutually exclusive, learningmodality-common and modality-private representations simul-taneously is conducive to obtain more discriminativemodality-common representation.

2) Multi-Task Adversarial Learning: For learning modality-common and modality-private representations effectively,we design multi-task adversarial learning module that con-tains three tasks: consistency adversarial learning, informationclassification and semantic classification.

Actually, adversarial learning can be regarded as distrib-ution approximator, which is able to make the distribute ofthe generated data similar with the groundtruths. Therefore,we design consistency adversarial learning task, aiming to

make the distributions of each modality-common represen-tation close to each other, which contains two steps: dis-criminative step and generative step. Let �V

c , �Tc , �V

p and�T

p be the parameters of GVc , GT

c , GVp and GT

p networks,respectively, which generate modality-common representationsrV

c , rTc and modality-private representations rV

p , rTp . �D is

the parameter of discriminator D. For discriminative step,we train D to classify the modality-common representationinto “True” and classify the modality-private representationinto “Fake”. The object function of the discriminator can bedefined as:

min�D

Ldcal =

N∑i=1

‖D(r Vci ) − 1‖2 + ‖D(r T

ci ) − 0‖2, (5)

where r Vci and r T

ci are modality-common representations of thei -th pairwise instances of image and text modalities, vi and ti ,respectively. On the contrary, in generative step, we train GV

cand GT

c to classify the modality-common representation into“Fake” and classify the modality-private representation into“True”. The object function of the generator is defined as:

min�V

c ,�Tc

Lgcal =

N∑i=1

‖D(GVc (vi )) − 0‖2 + ‖D(GT

c (ti )) − 1‖2.

(6)

The generator GVc , GT

c and the discriminator D are trainedadversarially with each other using a adversarial learning strat-egy. By this strategy, the distributions of modality-commonrepresentation rV

c and rTc are able to close to each other

gradually.In order to effectively separate modality-common and

modality-private representations, we apply information clas-sification to learn modality-common representation, imagemodality-private representation and text modality-private rep-resentation respectively in a weakly supervised manner.Specifically, we design a information classifier C and regardinformation classification as a multi-class classification task.The label of different representation are one-hot vector, whichcan be defined as γ ∈ R

3. The image modality-private repre-sentation label γ V

p is “0”, the modality-common representationlabels γ V

c and γ Tc are “1”, and the text modality-private

representation label γ Tp is “2”. Above all, the objec-

tive function of information classification is expressed asfollows:

min�V

c ,�Tc ,�V

p ,�Tp ,�C

Lic = −N∑

i=1

γ · log(C(ri )), (7)

where γ ∈ {γ Vp , γ V

c , γ Tc , γ T

p }, ri ∈ {r Vpi , r V

ci , r Tci , r T

pi } and �C

is the parameter of C .We also adopt semantic classification to preserve the

semantically discriminative capability of image and textmodality-common representations. To this end, two semanticclassifier layers f cV

sc and f cTsc are designed to capture a

fine-grained cross-modal semantic consistency. The objection

Authorized licensed use limited to: XIDIAN UNIVERSITY. Downloaded on March 09,2020 at 04:10:45 UTC from IEEE Xplore. Restrictions apply.

Page 5: 3626 IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 29, …see.xidian.edu.cn/faculty/chdeng/Welcome to Cheng... · 3626 IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 29, 2020 Multi-Task

3630 IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 29, 2020

function of the semantic classification can be denoted as:

min�V

c ,�Tc ,�V

sc,�Tsc

Lsc = −N∑

i=1

[li · log( f cVsc(r

Vci ))

+ li · log( f cTsc(r

Tci ))

+ (1 − li ) log(1 − f cVsc(r

Vci ))

+ (1 − li ) log(1 − f cTsc(r

Tci ))], (8)

where li are the corresponding one-hot label vector of the i -thpairwise instances vi and ti . �V

sc and �Tsc are the parameters

of f cVsc and f cT

sc, respectively. Combining Eq. (5), Eq. (6)Eq. (7) and Eq. (8), we can obtain the object of multi-taskadversarial learning as:

Lmtal = Ldcal + Lg

cal + α(Lic + Lsc), (9)

where α is a hyper-parameter. Through these three tasks,we can disentangle modality-common and modality-privaterepresentations, then further obtain a pair of uniform distrib-uted modality-common representations, which contain effec-tive semantic consistency information.

3) Hash Learning: After multi-task adversarial learning,the compact and discriminative hash codes of image and textcan be generated from modality-common representation rV

c ,rT

c . To this end, we design two hash layers f Vh and f T

h forimage and text hash codes generation. We utilize pairwise lossfunction to measure the similarity between their hash codes,which can be devised as:

min�V

c ,�Tc ,�V

h ,�Th

Lp = −N∑

i, j=1

(Si j θi j − log(1 + eθi j )). (10)

Suppose that hV = f Vh (rV

c ), hT = f Th (rT

c ), θi j = 12 h�

�i h� j ,where �V

h and �Th are the parameters of hash layers f V

h andf Th respectively. It is easy to find that minimizing this loss

function is equivalent to maximizing the likelihood, which canmake the similarity (inner product) between h�i and h� j largewhen Si j = 1 and small when Si j = 0. Hence, optimizing Lp

can preserve the cross-modal similarity in S with the imagehashing layer output hV and text hashing layer output hT .Furthermore, since hash codes are discrete, the quantizationerror is unavoidable in hash codes learning procedure. In orderto reduce quantization error, the quantization loss is designedas:

min�V

c ,�Tc ,�V

h ,�Th

Lq = ‖bV − hV ‖2F + ‖bT − hT ‖2

F

= ‖b − hV ‖2F + ‖b − hT ‖2

F

s.t. b ∈ {−1,+1}K×N , (11)

where bV = sign(hV ) and bT = sign(bT ). We keep themodalities share hash codes b in training and consider hV , hT

to be the continuous surrogate of b, respectively. Because hV

and hT can preserve the cross-modal similarity in S, the binaryhash codes bV and bT can also be expected to preserve thecross-modal similarity in S, which exactly matches the goalof cross-modal hashing. Combing Eq. (10) with Eq. (11),we obtain the objective function of the hash learning as follow:

Lhash = Lp + βLq , (12)

Algorithm 1 The Optimization Algorithm for Our CPAH

where β is a hyper-parameter.4) Objective Function: By merging the above two parts

together (i.e., the multi-task adversarial learning loss Lmtal

and the hash learning loss Lhash), we can obtain the wholeobjective function as the follows:

L = Lmtal + Lhash

= Ldcal + Lg

cal + Lp + α(Lic + Lsc) + βLq . (13)

In oder to yield powerful hash codes, we minimize the Eq. (13)with an alternative manner.

C. Learning Algorithm

Since the objective function is not convex, We adopt analternative learning strategy to train our CPAH modal. Eachtime we learn one parameter with the other parameters arefixed. The model is optimized iteratively until the parametersare converged or the preset maximum number of epochs isreached. The whole alternative learning algorithm is brieflyoutlined in Algorithm 1.

1) Multi-Task Adversarial Learning: Firstly, the discrimi-nator D is trained by Eq. (5). The parameter �D is updated

Authorized licensed use limited to: XIDIAN UNIVERSITY. Downloaded on March 09,2020 at 04:10:45 UTC from IEEE Xplore. Restrictions apply.

Page 6: 3626 IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 29, …see.xidian.edu.cn/faculty/chdeng/Welcome to Cheng... · 3626 IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 29, 2020 Multi-Task

XIE et al.: MULTI-TASK CPAH FOR CROSS-MODAL RETRIEVAL 3631

while �Vc and �T

c are fixed:�D = arg min

�DLd

cal . (14)

Then, GVc , GT

c , GVp , GT

p , C , f cVsc and f cT

sc are trained byupdating �G = {�V

c ,�Tc ,�V

p ,�Tp ,�C ,�V

sc,�Tsc} with �D

is fixed:�G = arg min

�GLg

cal + α(Lic + Lsc). (15)

2) Hash Learning: For hash learning optimization,we firstly optimize the image hash codes generation by trainingGV

c and f Vh , while the parameters of GT

c and f Th are fixed:

�Vc ,�V

h = arg min�V

c ,�Vh

Lhash . (16)

After that, we optimize the parameters of GTc and f T

h for texthash codes generation with the parameters of GV

c and f Vh are

fixed:�T

c ,�Th = arg min

�Tc ,�T

h

Lhash . (17)

3) Out-of-Sample Extension: Any new instance that is notfrom the training data can be represented as a specific hashcode as long as one of its modality (image or text) is observed.For example, given one instance qV from image modality,we can generate hash codes by forward propagating the imagenetwork, as follows:

bVq = sign( f V

h (GVc (qV ))). (18)

Similarly, given an instance qT from text modality, we canalso generate the hash codes by text modality, as follows:

bTq = sign( f T

h (GTc (qT ))). (19)

By this means, it can be seen that our CPAH can be employedfor cross-modal retrieval where the query data and the resultdata are from different modalities.

4) Multiple Modalities Extension: By extending the net-works to multiple networks, the proposed CPAH can beextended to multiple modalities. The extension of CPAHin Eq. (13) from bimodal to multiple modalities is quitestraightforward as:

min Lmult i =M∑

i, j=1,i �= j

Li jmtal +

M∑i, j=1,i �= j

Li jp

+M∑

i=1

(β‖bi − hi‖2F )

s.t. b = b1 = · · · = bM ∈ {−1,+1}K×N , (20)

where M is the number of modalities, Li jmtal is the multi-task

adversarial learning loss function between i -th modality andj -th modality, Li j

p represents the pairwise loss function ofi -th modality and j -th modality, bi is the hash codes of i -th modality, and hi ∈ R

i denotes the output i -th modalityfrom i -th network. It is straightforward to adjust the iterativealgorithm presented above to solve the new optimizationproblem.

TABLE I

CONFIGURATION OF THE PROPOSED ARCHITECTURE

IV. EXPERIMENTS AND DISCUSSIONS

A. Datasets and Settings

1) Datasets: We evaluate our proposed method on threepopular benchmark cross-modal datasets: MIRFLICKR-25K [44], IAPRTC-12 [45] and NUS-WIDE [46].

a) MIRFLICKR-25K: This dataset consistsof 25,000 images collected from Flickr website. Each imageis associated with several textual tags. Hence, each instanceis a image-text pair. Totally, we select 20,961 instances inour experiment. The text for each instance is represented asa 2029-dimensional bag-of-words vector, and each instance

Authorized licensed use limited to: XIDIAN UNIVERSITY. Downloaded on March 09,2020 at 04:10:45 UTC from IEEE Xplore. Restrictions apply.

Page 7: 3626 IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 29, …see.xidian.edu.cn/faculty/chdeng/Welcome to Cheng... · 3626 IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 29, 2020 Multi-Task

3632 IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 29, 2020

is manually annotated with at least one of the 24 uniquelabels.

b) IAPRTC-12: This dataset consists of 20,000 image-text pairs which presents various semantics such as landscape,action and people categories. We use 18,584 instances inour experiment. The text for each instance is representedas a 4027-dimensional bag-of-words vector. The image-textpairs belong to the 22 frequent labels from the 275 semanticconcepts.

c) NUS-WIDE: This dataset contains 269,648 webimages, and some images are associated with textual tags,where each pair is annotated with multiple labels among81 semantic concepts. We select 188,321 image-text pairsthat belong 21 most frequent concepts. The text for eachinstance is represented as an 1000-dimensional bag-of-wordsvector.

We follow dataset split similar as [20] to constructthe test (query) sets, retrieval sets and training sets. ForMIRFLICKR-25K and IAPRTC-12 datasets, 2,000 instancesare randomly sampled as the test set and the remaininginstances as the retrieval set. Moreover, 10,000 instances aresampled from the retrieval set as training set for both datasets.For NUS-WIDE dataset, we take 2,100 instances as the testset and the rest as the retrieval set. In addition, we sample10,500 instances from the retrieval set as training set.

2) Implementation Details: We implement our CPAH basedon open source deep toolbox TensorFlow. All the experimentsare running on a server with one NVIDIA TITAN X GPU.In training procedure, both multi-task adversarial learning andhash learning are optimized in an alternative way throughadaptive moment estimation optimizer (Adam). The batch sizeis 64, and the total epoch is 80. The initial learning rate isset to 0.0001 and decreases to its one-fifth every 30 epochs.In testing procedure, we only use the modality-common repre-sentations of image and text modalities to construct the binarycodes.

For image modality, the input images are first resized to224 × 224 × 3. We use two CNN architectures serve as theextractor E V to extract image representation: CNN-F [43] andVGG16 [47], which both consist of five convolutional blocksand three fully-connected layers. The output of their f c7 layeris served as image representation. For text modality, the textrepresentation extraction ET is constructed by two 1 × 1convolutional layers. After extracting the representations ofimage and text, we design two consistency refined module, i.e.,C RV and C RT , which consist by two different fully-connectedlayers with both 512 neural nodes, respectively. For multi-taskadversarial learning module, we frame a discriminator C ,an information classier C and two semantic classifier f cV

sc andf cT

sc, which are both consisted by fully-connected layers. Forhash learning, two hash layers f cV

h and f cTh are designed for

hash codes generation, which are two fully-connected layerswith K neural nodes. Table I shows the detailed configurationof different modules in the proposed CPAH. Specifically, “f”denotes convolution filters, “st.” is the convolution stride,“pad” is the number of pixels to add to the input, “LRN”denotes Local Response Normalization, and “pool” denotesthe down-sampling factor.

TABLE II

THE MAP SCORES OF TWO RETRIEVAL TASKS ON THE MIRFLICKR-25KDATASET WITH DIFFERENT HASH CODE LENGTHS. THE BASELINES

ARE BASED ON CNN-F FEATURES AND THE BEST ACCURACY IS

SHOWN IN BOLDFACE

B. Evaluation

In this part, we perform two cross-modal retrieval tasks:image-to-text retrieval and text-to-image retrieval, whichsearch texts by a query image and search images by aquery text, respectively. To evaluate the performance of ourmethod, we adopt three evaluation criteria: Mean averageprecision (MAP), the topN-precision curve (TopN curve) andprecision-recall curve (PR curve).

To calculate the MAP, we first evaluate the average precision(AP). Given a query instance, we obtain a ranking list of Rretrieved results, and the value of AP is defined as:

AP = 1

N

R∑r=1

p(r)δ(r) (21)

where N is the number of relevant instances in the retrievedset, p(r) denotes the precision of the top r retrieved instance,and δ(r) = 1 if the r -th retrieved result is relevant to thequery instance, otherwise, δ(r) = 0. Here, we consider twosamples similar as long as they share at least one similarlabel. Generally, MAP measures the discriminative learningability of different cross-modal retrieval methods, where ahigher MAP indicates better retrieval performance. The topN-precision curve reflects the changes in precision accordingto the number of retrieved instances and the precision-recallcurve reflects the precision at different recall levels and canbe obtained by varying the Hamming radius of the retrievedinstances in a certain range in order to evaluate the precisionand recall.

C. Comparison With State-of-the-art Methods

Our CPAH is compared with several state-of-the-art algo-rithms: CVH [31], STMH [48], CMSSH [33], SCM [34],LSSH [32], SePH [49] and DCMH [20] in all three datasets

Authorized licensed use limited to: XIDIAN UNIVERSITY. Downloaded on March 09,2020 at 04:10:45 UTC from IEEE Xplore. Restrictions apply.

Page 8: 3626 IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 29, …see.xidian.edu.cn/faculty/chdeng/Welcome to Cheng... · 3626 IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 29, 2020 Multi-Task

XIE et al.: MULTI-TASK CPAH FOR CROSS-MODAL RETRIEVAL 3633

TABLE III

THE MAP SCORES OF TWO RETRIEVAL TASKS ON THE MIRFLICKR-25KDATASET WITH DIFFERENT HASH CODE LENGTHS. THE BASELINES

ARE BASED ON VGG16 FEATURES AND THE BEST ACCURACY IS

SHOWN IN BOLDFACE

TABLE IV

THE MAP SCORES OF TWO RETRIEVAL TASKS ON THEIAPRTC-12 DATASET WITH DIFFERENT HASH CODE

LENGTHS. THE BASELINES ARE BASED ON

CNN-F FEATURES AND THE BEST ACCURACY

IS SHOWN IN BOLDFACE

and the length of hash codes is presented by 16 bits, 32 bitsand 64 bits. In order to make fair comparisons, we utilizedeep features for image extracted from pre-trained CNN-F andVGG16 for all shallow baseline methods mentioned above.

Table II and Table III, Table IV and Table V, Table VIand Table VII, present the MAP values of our method andother methods with CNN-F features and VGG16 featuresfor text query image task and image query text task forthe MIRFLICKR-25K, IAPRTC-12, and NUS-WIDE datasetsrespectively. From these results, we find that our proposedCPAH is superior to all comparison methods on different

TABLE V

THE MAP SCORES OF TWO RETRIEVAL TASKS ON THEIAPRTC-12 DATASET WITH DIFFERENT HASH CODE

LENGTHS. THE BASELINES ARE BASED ON VGG16FEATURES AND THE BEST ACCURACY

IS SHOWN IN BOLDFACE

TABLE VI

THE MAP SCORES OF TWO RETRIEVAL TASKS ON THE NUS-WIDEDATASET WITH DIFFERENT HASH CODE LENGTHS. THE BASELINES

ARE BASED ON CNN-F FEATURES AND THE BEST ACCURACY IS

SHOWN IN BOLDFACE

retrieval tasks. For example, on IAPRTC-12 dataset, the MAPof our method is, on average, about ten percentage pointshigher than the second best algorithm. Compared with othermethods, our CPAH can effectively capture semantic con-sistency information through consistency refined module andmulti-task adversarial learning module, which significantlyboosts the correlation among different modalities.

Fig. 2 and Fig. 3 illustrate the precision-recall curves andtopN-precision curves on three datasets with 64-bit hashcodes. It can be seen that the curves of CPAH are alwayshigher than all other methods. These results are consistent

Authorized licensed use limited to: XIDIAN UNIVERSITY. Downloaded on March 09,2020 at 04:10:45 UTC from IEEE Xplore. Restrictions apply.

Page 9: 3626 IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 29, …see.xidian.edu.cn/faculty/chdeng/Welcome to Cheng... · 3626 IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 29, 2020 Multi-Task

3634 IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 29, 2020

Fig. 2. Precision-recall curves of different methods for image query text and text query image task on (a) and (b) MIRFLICKR-25K, (e) and (f) IAPRTC-12,and (i) and (j) NUS-WIDE datasets base on CNN-F feature, and (c) and (d) MIRFLICKR-25K, (g) and (h) IAPRTC-12, and (k) and (l) NUS-WIDE datasetsbase on VGG16 feature (the code length is 64).

TABLE VII

THE MAP SCORES OF TWO RETRIEVAL TASKS ON THE NUS-WIDEDATASET WITH DIFFERENT HASH CODE LENGTHS. THE BASELINES

ARE BASED ON VGG16 FEATURES AND THE BEST ACCURACY ISSHOWN IN BOLDFACE

with the MAP evaluation. In short, the proposed methodoutperforms all of the comparisons.

We also make a comparison between our CPAH andother state-of-the-art methods: DVSH [35], CMDVH [19],

TABLE VIII

THE TOP-500 MAP RESULTS ON IAPRTC-12 DATASET

UNDER THE EXPERIMENTAL SETTINGS OF [19]

SSAH [22] and ADAH [30]. Since the codes of these methodsis not publicly available, we compare these methods byutilizing the same experimental settings. For CMDVH andSSAH, we follow the same experimental settings in [19] anddirectly cite the results from [22] for a fair comparison. Thetop-500 MAP results on IAPRTC-12 are shown in Table VIIIand the Top-100 MAP results on IAPRTC-12 are shownin Table IX. It can be seen that our proposed method outper-forms CMDVH as and as SSAH. In additional, The SSAHis also a adversarial learning based method. These resultsdemonstrate that our method has more powerful capability toexplore the semantic correlation and consistency.

In order to compare with DVSH and ADAH, we imple-ment our method under the same experimental settings

Authorized licensed use limited to: XIDIAN UNIVERSITY. Downloaded on March 09,2020 at 04:10:45 UTC from IEEE Xplore. Restrictions apply.

Page 10: 3626 IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 29, …see.xidian.edu.cn/faculty/chdeng/Welcome to Cheng... · 3626 IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 29, 2020 Multi-Task

XIE et al.: MULTI-TASK CPAH FOR CROSS-MODAL RETRIEVAL 3635

Fig. 3. TopN-precision curves of different methods for image query text and text query image task on (a) and (b) MIRFLICKR-25K, (e) and (f) IAPRTC-12,and (i) and (j) NUS-WIDE datasets base on CNN-F feature, and (c) and (d) MIRFLICKR-25K, (g) and (h) IAPRTC-12, and (k) and (l) NUS-WIDE datasetsbase on VGG16 feature (the code length is 64).

TABLE IX

THE TOP-100 MAP RESULTS ON NUS-WIDE DATASETUNDER THE EXPERIMENTAL SETTINGS OF [19]

in [35] and [30]. The top-500 MAP results of DVSH, DCMH,ADAH and our CPAH on IAPRTC-12 are illustrate in Table X.Notably, the DVSH adopts the LSTM recurrent neural net-work to extract sentences features for text representation,while DCMH, ADAH and our method only use bag-of-wordsvectors. The results in Table X can prove that our methodcan achieve better performance than DVSH, DCMH andADAH in all cases, even we use the one-hot bag-of-wordsvectors.

D. Hyper-Parameter Sensitivity Analysis

We studied the influence of the hyper-parameters α and β.Fig. 4 shows the MAP results on IAPRTC-12 dataset withdifferent values of α and β, where the code length is 16 bits.

TABLE X

THE TOP-500 MAP RESULTS ON IAPRTC-12 DATASET UNDERTHE EXPERIMENTAL SETTINGS OF [35] AND [30]

TABLE XI

ABLATION STUDY OF CONSISTENCY REFINED MODULE AND MULTI-TASK

ADVERSARIAL LEARNING MODULE ON MIRFLICKR-25K DATASET

When one parameter is analyzed, the other parameter is fixed.According to the results in Fig. 4, we finally choose α = 0.1and β = 1 for our hyper-parameters.

Authorized licensed use limited to: XIDIAN UNIVERSITY. Downloaded on March 09,2020 at 04:10:45 UTC from IEEE Xplore. Restrictions apply.

Page 11: 3626 IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 29, …see.xidian.edu.cn/faculty/chdeng/Welcome to Cheng... · 3626 IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 29, 2020 Multi-Task

3636 IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 29, 2020

Fig. 4. The influence of hyper-parameters. (a) The MAP on the IAPRTC-12 dataset with 16 bits under different values of α. (b) The MAP on theIAPRTC-12 dataset with 16 bits under different values of β.

E. Ablation Study

In order to justify the effectiveness of the proposed con-sistency refined module and multi-task adversarial learningmodule, we conduct an ablation study for them, which areperformed on MIRFLICKR-25K dataset. The MAP results areshown in Table XI. The “No” means the original networkswithout consistency refined module and multi-task adversariallearning module. The “MA” means we only use multi-taskadversarial learning module. It has about 3%-4% improvementin MAP compared to “No”, which can prove the effectivenessof the proposed multi-task adversarial learning module. The“MA+CR” means we use both multi-task adversarial learningmodule and consistency refined module. It has about 1%improvement in performance compared to “MA”, that canprove the effectiveness of the proposed consistency refinedmodule.

V. CONCLUSION

In this paper, we present a novel deep hashing method,namely multi-task consistency-preserving adversarial Hash-ing (CPAH) for cross-modal retrieval. Compact hash codesof image and text are generated in an end-to-end deeplearning architecture. Specifically, a consistency refined mod-ule (CR) is designed to divide the representation of dif-ferent modalities into uncorrelated parts: modality-commonand modality-private representation. Furthermore, we designa multi-task adversarial learning module (MA), which notonly make the modality-common representation of differ-ent modalities close to each other but also preserve thesemantic consistency effectively. Finally, we employ eachmodality-common representation to generate compact anddiscriminative hash codes. Extensive experimental results onthree datasets, MIRFLICKR-25K, IAPRTC-12 and NUS-WIDE, show that the proposed CPAH yields state-of-the-artperformance in cross-modal retrieval task.

REFERENCES

[1] D. Tao, Y. Guo, M. Song, Y. Li, Z. Yu, and Y. Y. Tang, “Person re-identification by dual–regularized KISS metric learning,” IEEE Trans.Image Process., vol. 25, no. 6, pp. 2726–2738, Jun. 2016.

[2] R. Zhang, L. Lin, R. Zhang, W. Zuo, and L. Zhang, “Bit–scalabledeep hashing with regularized similarity learning for image retrieval andperson re–identification,” IEEE Trans. Image Process., vol. 24, no. 12,pp. 4766–4779, Dec. 2015.

[3] A. Sharma, A. Kumar, H. Daume, and D. W. Jacobs, “Generalizedmultiview analysis: A discriminative latent space,” in Proc. CVPR, 2012,pp. 2160–2167.

[4] K. Wang, R. He, L. Wang, W. Wang, and T. Tan, “Joint feature selectionand subspace learning for cross–modal retrieval,” IEEE Trans. PatternAnal. Mach. Intell., vol. 38, no. 10, pp. 2010–2023, Oct. 2016.

[5] C. Deng, X. Tang, J. Yan, W. Liu, and X. Gao, “Discriminative dictio-nary learning with common label alignment for cross–modal retrieval,”IEEE Trans. Multimedia, vol. 18, no. 2, pp. 208–218, Feb. 2016.

[6] J. Tang, K. Wang, and L. Shao, “Supervised matrix factorization hashingfor cross–modal retrieval,” IEEE Trans. Image Process., vol. 25, no. 7,pp. 3157–3166, Jul. 2016.

[7] X. Xu, F. Shen, Y. Yang, H. T. Shen, and X. Li, “Learning discriminativebinary codes for large-scale cross-modal retrieval,” IEEE Trans. ImageProcess., vol. 26, no. 5, pp. 2494–2507, May 2017.

[8] X. Shi, F. Xing, K. Xu, M. Sapkota, and L. Yang, “Asymmetric discretegraph hashing,” in Proc. AAAI, 2017, pp. 2541–2547.

[9] X. Shi, M. Sapkota, F. Xing, F. Liu, L. Cui, and L. Yang, “Pairwisebased deep ranking hashing for histopathology image classification andretrieval,” Pattern Recognit., vol. 81, pp. 14–22, Sep. 2018.

[10] M. Sapkota, X. Shi, F. Xing, and L. Yang, “Deep convolutional hashingfor low–dimensional binary embedding of histopathological images,”IEEE J. Biomed. Health Inform., vol. 23, no. 2, pp. 805–816, Mar. 2019.

[11] W. Liu, C. Mu, S. Kumar, and S.-F. Chang, “Discrete graph hashing,”in Proc. NeurIPS, 2014, pp. 3419–3427.

[12] F. Feng, X. Wang, and R. Li, “Cross-modal retrieval with correspondenceautoencoder,” in Proc. ACM MM, 2014, pp. 7–16.

[13] H. Liu, R. Ji, Y. Wu, F. Huang, and B. Zhang, “Cross-modality binarycode learning via fusion similarity hashing,” in Proc. CVPR, 2017,pp. 6345–6353.

[14] X. Liu, C. Deng, B. Lang, D. Tao, and X. Li, “Query–adaptive reciprocalhash tables for nearest neighbor search,” IEEE Trans. Image Process.,vol. 25, no. 2, pp. 907–919, Feb. 2016.

[15] X. Liu, Z. Li, C. Deng, and D. Tao, “Distributed adaptive binaryquantization for fast nearest neighbor search,” IEEE Trans. ImageProcess., vol. 26, no. 11, pp. 5324–5336, Nov. 2017.

[16] A. Krizhevsky, I. Sutskever, and G. E. Hinton, “ImageNet classificationwith deep convolutional neural networks,” in Proc. NeurIPS, 2012,pp. 1097–1105.

[17] Y. Bengio, A. Courville, and P. Vincent, “Representation learning:A review and new perspectives,” IEEE Trans. Pattern Anal. Mach. Intell.,vol. 35, no. 8, pp. 1798–1828, Aug. 2013.

[18] C. Deng, Y. Xue, X. Liu, C. Li, and D. Tao, “Active transfer learningnetwork: A unified deep joint spectral–spatial feature learning model forhyperspectral image classification,” IEEE Trans. Geosci. Remote Sens.,vol. 57, no. 3, pp. 1741–1754, Nov. 2019.

[19] V. Erin Liong, J. Lu, Y.-P. Tan, and J. Zhou, “Cross-modal deepvariational hashing,” in Proc. ICCV, 2017, pp. 4077–4085.

[20] Q. Jiang and W. Li, “Deep cross-modal hashing,” in Proc. CVPR, 2017.[21] E. Yang, C. Deng, W. Liu, X. Liu, D. Tao, and X. Gao, “Pairwise

relationship guided deep hashing for cross-modal retrieval,” in Proc.AAAI, 2017, pp. 1618–1625.

[22] C. Li, C. Deng, N. Li, W. Liu, X. Gao, and D. Tao, “Self-supervisedadversarial hashing networks for cross-modal retrieval,” in Proc. CVPR,2018, pp. 4242–4251.

[23] Y. Cao, B. Liu, M. Long, and J. Wang, “Cross-modal Hamminghashing,” in Proc. ECCV, 2018, pp. 202–218.

[24] C. Deng, Z. Chen, X. Liu, X. Gao, and D. Tao, “Triplet–based deephashing network for cross–modal retrieval,” IEEE Trans. Image Process.,vol. 27, no. 8, pp. 3893–3903, Aug. 2018.

[25] C. Deng, E. Yang, T. Liu, J. Li, W. Liu, and D. Tao, “Unsupervisedsemantic–preserving adversarial hashing for image search,” IEEE Trans.Image Process., vol. 28, no. 8, pp. 4032–4044, Aug. 2019.

[26] E. Yang, T. Liu, C. Deng, and D. Tao, “Adversarial examples forHamming space search,” IEEE Trans. Cybern., to be published.

[27] E. Yang, C. Deng, T. Liu, W. Liu, and D. Tao, “Semantic structure-basedunsupervised deep hashing,” in Proc. IJCAI, 2018, pp. 1064–1070.

[28] K. Bousmalis, G. Trigeorgis, N. Silberman, D. Krishnan, and D. Erhan,“Domain separation networks,” in Proc. NeurIPS, 2016, pp. 343–351.

[29] E. Yang, C. Deng, C. Li, W. Liu, J. Li, and D. Tao, “Shared predictivecross–modal deep quantization,” IEEE Trans. Neural Netw. Learn. Syst.,vol. 29, no. 11, pp. 5292–5303, Nov. 2018.

[30] X. Zhang, H. Lai, and J. Feng, “Attention-aware deep adversarial hashingfor cross-modal retrieval,” in Proc. ECCV, 2018, pp. 591–606.

[31] S. Kumar and R. Udupa, “Learning hash functions for cross-viewsimilarity search,” in Proc. IJCAI, 2011, pp. 1360–1365.

Authorized licensed use limited to: XIDIAN UNIVERSITY. Downloaded on March 09,2020 at 04:10:45 UTC from IEEE Xplore. Restrictions apply.

Page 12: 3626 IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 29, …see.xidian.edu.cn/faculty/chdeng/Welcome to Cheng... · 3626 IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 29, 2020 Multi-Task

XIE et al.: MULTI-TASK CPAH FOR CROSS-MODAL RETRIEVAL 3637

[32] J. Zhou, G. Ding, and Y. Guo, “Latent semantic sparse hashing for cross-modal similarity search,” in Proc. ACM SIGIR, 2014, pp. 415–424.

[33] M. M. Bronstein, A. M. Bronstein, F. Michel, and N. Paragios, “Datafusion through cross-modality metric learning using similarity-sensitivehashing,” in Proc. CVPR, 2010, pp. 3594–3601.

[34] D. Zhang and W. J. Li, “Large-scale supervised multimodal hash-ing with semantic correlation maximization,” in Proc. AAAI, 2014,pp. 2177–2183.

[35] Y. Cao, M. Long, J. Wang, Q. Yang, and P. S. Yu, “Deep visual-semantic hashing for cross-modal retrieval,” in Proc. ACM SIGKDD,2016, pp. 1445–1454.

[36] I. Goodfellow et al., “Generative adversarial nets,” in Proc. NeurIPS,2014, pp. 2672–2680.

[37] M. Mirza and S. Osindero, “Conditional generative adversarialnets,” 2014, arXiv:1411.1784. [Online]. Available: https://arxiv.org/abs/1411.1784

[38] A. Radford, L. Metz, and S. Chintala, “Unsupervised representationlearning with deep convolutional generative adversarial networks,” 2015,arXiv:1511.06434. [Online]. Available: https://arxiv.org/abs/1511.06434

[39] Y. Xian, T. Lorenz, B. Schiele, and Z. Akata, “Feature generatingnetworks for zero-shot learning,” in Proc. CVPR, 2018, pp. 5542–5551.

[40] D. Xie, C. Deng, H. Wang, C. Li, and D. Tao, “Semantic adversarialnetwork with multi-scale pyramid attention for video classification,” inProc. AAAI, 2019, pp. 9030–9037.

[41] Y. Cao, B. Liu, M. Long, and J. Wang, “HashGAN: Deep learning tohash with pair conditional Wasserstein GAN,” in Proc. CVPR, 2018,pp. 1287–1296.

[42] B. Wang, Y. Yang, X. Xu, A. Hanjalic, and H. T. Shen, “Adversarialcross-modal retrieval,” in Proc. ACM MM, 2017, pp. 154–162.

[43] K. Chatfield, K. Simonyan, A. Vedaldi, and A. Zisserman, “Return ofthe devil in the details: Delving deep into convolutional nets,” 2014,arXiv:1405.3531. [Online]. Available: https://arxiv.org/abs/1405.3531

[44] M. J. Huiskes and M. S. Lew, “The MIR flickr retrieval evaluation,” inProc. ACM CIVR, 2008, pp. 39–43.

[45] H. J. Escalante et al., “The segmented and annotated IAPR TC-12 benchmark,” Comput. Vis. Image Understand., vol. 114, no. 4,pp. 419–428, 2010.

[46] T.-S. Chua, J. Tang, R. Hong, H. Li, Z. Luo, and Y. Zheng, “NUS-WIDE: A real-world Web image database from National University ofSingapore,” in Proc. ACM CIVR, 2009, p. 48.

[47] K. Simonyan and A. Zisserman, “Very deep convolutional networksfor large-scale image recognition,” 2014, arXiv:1409.1556. [Online].Available: https://arxiv.org/abs/1409.1556

[48] D. Wang, X. Gao, X. Wang, and L. He, “Semantic topic multimodalhashing for cross-media retrieval,” in Proc. IJCAI, 2015, pp. 3890–3896.

[49] Z. Lin, G. Ding, M. Hu, and J. Wang, “Semantics-preserving hashingfor cross-view retrieval,” in Proc. CVPR, 2015, pp. 3864–3872.

De Xie received the B.E. degree from the Xi’anUniversity of Architecture and Technology, Xi’an,China, in 2015. He is currently pursuing the Ph.D.degree with the School of Electronic Engineering,Xidian University. His research interests focus oncomputer vision, natural language processing, andmulti-modal analysis.

Cheng Deng (Member, IEEE) received the B.E.,M.S., and Ph.D. degrees in signal and informationprocessing from Xidian University, Xi’an, China.He is currently a Full Professor with the School ofElectronic Engineering, Xidian University. He hasauthored or coauthored more than 80 scientific arti-cles at top venues, including IEEE TNNLS, TIP,TCYB, TMM, TSMC, ICCV, CVPR, ICML, NIPS,IJCAI, and AAAI. His research interests includecomputer vision, pattern recognition, and informa-tion hiding.

Chao Li received the B.E. degree in electronicand information engineering from the Inner Mon-golia University of Science and Technology, China,in 2014. He is currently pursuing the Ph.D. degreewith the School of Electronic Engineering, XidianUniversity. His main research interests include com-puter vision and machine learning.

Xianglong Liu received the B.S. and Ph.D. degreesin computer science from Beihang University,Beijing, in 2008 and 2014, respectively. From2011 to 2012, he visited the Digital Video andMultimedia Laboratory, Columbia University, as ajoint Ph.D. student. He is currently an AssistantProfessor with Beihang University. His researchinterests include machine learning, computer vision,and multimedia information retrieval.

Dacheng Tao (Fellow, IEEE) is currently a Professorof computer science and an ARC Laureate Fellowwith the School of Computer Science, Faculty ofEngineering and Information Technologies, and theInaugural Director of the UBTECH Sydney Artifi-cial Intelligence Centre, The University of Sydney.He mainly applies statistics and mathematics toartificial intelligence and data science. His researchresults have expounded in one monograph and morethan 200 publications in prestigious journals and atprominent conferences, such as IEEE T-PAMI, T-IP,

T-NNLS,T-CYB, IJCV, JMLR, NIPS, ICML, CVPR, ICCV, ECCV, ICDM,and ACM SIGKDD, with several best paper awards, such as the Best The-ory/Algorithm Paper Runner Up Award in IEEE ICDM 07, the Best StudentPaper Award in IEEE ICDM 13, the 2014 ICDM 10-Year Highest-ImpactPaper Award, the 2017 IEEE Signal Processing Society Best Paper Award,and the Distinguished Paper Award in the 2018 IJCAI. He also received the2015 Australian Scopus-Eureka Prize and the 2018 IEEE ICDM ResearchContributions Award. He is a Fellow of the Australian Academy of Science,AAAS, IAPR, OSA, and SPIE.

Authorized licensed use limited to: XIDIAN UNIVERSITY. Downloaded on March 09,2020 at 04:10:45 UTC from IEEE Xplore. Restrictions apply.