Distilled Person Re-identification: Towards a More ...wuancong/papers/Distilled Person Re... ·...

10
Distilled Person Re-identification: Towards a More Scalable System Ancong Wu 1 , Wei-Shi Zheng 2,3* , Xiaowei Guo 5 , and Jian-Huang Lai 2,4 1 School of Electronics and Information Technology, Sun Yat-sen University, China 2 School of Data and Computer Science, Sun Yat-sen University, China 3 Key Laboratory of Machine Intelligence and Advanced Computing, Ministry of Education, China 4 Guangdong Province Key Laboratory of Information Security, China 5 YouTu Lab, Tencent [email protected], [email protected], [email protected], [email protected] Abstract Person re-identification (Re-ID), for matching pedestri- ans across non-overlapping camera views, has made great progress in supervised learning with abundant labelled da- ta. However, the scalability problem is the bottleneck for applications in large-scale systems. We consider the scala- bility problem of Re-ID from three aspects: (1) low labelling cost by reducing label amount, (2) low extension cost by reusing existing knowledge and (3) low testing computation cost by using lightweight models. The requirements ren- der scalable Re-ID a challenging problem. To solve these problems in a unified system, we propose a Multi-teacher Adaptive Similarity Distillation Framework, which requires only a few labelled identities of target domain to transfer knowledge from multiple teacher models to a user-specified lightweight student model without accessing source domain data. We propose the Log-Euclidean Similarity Distillation Loss for Re-ID and further integrate the Adaptive Knowl- edge Aggregator to select effective teacher models to trans- fer target-adaptive knowledge. Extensive evaluations show that our method can extend with high scalability and the performance is comparable to the state-of-the-art unsuper- vised and semi-supervised Re-ID methods. 1. Introduction With the development of surveillance systems, person re-identification (Re-ID) has drawn much attention in re- cent years. Most researches focus on supervised learning [25, 67, 31, 28, 1, 47] and have made great progress. How- ever, system extension is still a significant obstacle for ap- plying Re-ID in large-scale surveillance systems because of the scalability problem, which is still under-explored. Some previous works attempt to improve scalability from different aspects, such as unsupervised and transfer learn- ing [42, 24, 62, 52, 51, 11] for reducing label amount and * Corresponding author Unlabelled data and little labelled data of new scene Scene 1 Data Scene 2 Data Scene M Data Massive labelled data of other scenes (unavailable) Scene 1 Teacher Model ܪScene 2 Teacher Model ܪScene M Teacher Model ܪTeacher model pool (available) Student Model ܪ(user-specified and lightweight) New scene model learning Knowledge distillation Figure 1. A scalable adaptation Re-ID system. We propose to store the knowledge in teacher models trained in existing scenes. When extending to a new scene, we can flexibly transfer the knowledge from teacher models to a user-specified lightweight student model by using unlabelled data and little labelled data of the new scene, without using source domain data which may not be accessible. fast retrieval [56, 73] for large-scale applications. To build a more scalable Re-ID system, it is expected that these scal- ability problems can be addressed in a unified system. We consider the requirements and challenges of the scal- ability problem of Re-ID mainly from three aspects: (1) Low labelling cost. Existing supervised Re-ID methods require abundant labelled identities, which is unrealistic for a large-scale system. A scalable Re-ID system should be able to learn from unlabelled data and limited labelled data, which provide very limited information for learning. (2) Low extension cost. When extending to a new scene, most existing Re-ID methods apply transfer learning, which requires auxiliary source domain data for pretraining or joint training. In some cases, source domain data of other scenes may not be accessible because of privacy problem or transfer problem. Even if source domain data is available, joint training increases computation costs. Moreover, pre- trained models may not be applicable due to different user- specified requirements of model architecture and capacity. Thus, a scalable Re-ID system should be able to extend to a new scene flexibly with low cost. Knowledge transfer with- out accessing source domain data is a challenge. (3) Low testing computation cost. Existing state-of-the-art

Transcript of Distilled Person Re-identification: Towards a More ...wuancong/papers/Distilled Person Re... ·...

Page 1: Distilled Person Re-identification: Towards a More ...wuancong/papers/Distilled Person Re... · Distilled Person Re-identification: Towards a More Scalable System Ancong Wu1, Wei-Shi

Distilled Person Re-identification: Towards a More Scalable System

Ancong Wu1 , Wei-Shi Zheng2,3∗ , Xiaowei Guo5 , and Jian-Huang Lai2,4

1School of Electronics and Information Technology, Sun Yat-sen University, China2School of Data and Computer Science, Sun Yat-sen University, China

3Key Laboratory of Machine Intelligence and Advanced Computing, Ministry of Education, China4Guangdong Province Key Laboratory of Information Security, China

5YouTu Lab, [email protected], [email protected], [email protected], [email protected]

Abstract

Person re-identification (Re-ID), for matching pedestri-ans across non-overlapping camera views, has made greatprogress in supervised learning with abundant labelled da-ta. However, the scalability problem is the bottleneck forapplications in large-scale systems. We consider the scala-bility problem of Re-ID from three aspects: (1) low labellingcost by reducing label amount, (2) low extension cost byreusing existing knowledge and (3) low testing computationcost by using lightweight models. The requirements ren-der scalable Re-ID a challenging problem. To solve theseproblems in a unified system, we propose a Multi-teacherAdaptive Similarity Distillation Framework, which requiresonly a few labelled identities of target domain to transferknowledge from multiple teacher models to a user-specifiedlightweight student model without accessing source domaindata. We propose the Log-Euclidean Similarity DistillationLoss for Re-ID and further integrate the Adaptive Knowl-edge Aggregator to select effective teacher models to trans-fer target-adaptive knowledge. Extensive evaluations showthat our method can extend with high scalability and theperformance is comparable to the state-of-the-art unsuper-vised and semi-supervised Re-ID methods.

1. Introduction

With the development of surveillance systems, personre-identification (Re-ID) has drawn much attention in re-cent years. Most researches focus on supervised learning[25, 67, 31, 28, 1, 47] and have made great progress. How-ever, system extension is still a significant obstacle for ap-plying Re-ID in large-scale surveillance systems becauseof the scalability problem, which is still under-explored.Some previous works attempt to improve scalability fromdifferent aspects, such as unsupervised and transfer learn-ing [42, 24, 62, 52, 51, 11] for reducing label amount and

∗Corresponding author

Unlabelled data and little labelled data of new scene Scene 1 Data Scene 2 Data Scene M Data

Massive labelled data of other scenes (unavailable)

Scene 1 Teacher

Model

Scene 2 Teacher

Model

Scene M Teacher

Model Teacher model pool (available)

Student Model (user-specified and

lightweight)New scene model learning

Knowledge distillation

Figure 1. A scalable adaptation Re-ID system. We propose to storethe knowledge in teacher models trained in existing scenes. Whenextending to a new scene, we can flexibly transfer the knowledgefrom teacher models to a user-specified lightweight student modelby using unlabelled data and little labelled data of the new scene,without using source domain data which may not be accessible.

fast retrieval [56, 73] for large-scale applications. To builda more scalable Re-ID system, it is expected that these scal-ability problems can be addressed in a unified system.

We consider the requirements and challenges of the scal-ability problem of Re-ID mainly from three aspects:(1) Low labelling cost. Existing supervised Re-ID methodsrequire abundant labelled identities, which is unrealistic fora large-scale system. A scalable Re-ID system should beable to learn from unlabelled data and limited labelled data,which provide very limited information for learning.(2) Low extension cost. When extending to a new scene,most existing Re-ID methods apply transfer learning, whichrequires auxiliary source domain data for pretraining orjoint training. In some cases, source domain data of otherscenes may not be accessible because of privacy problem ortransfer problem. Even if source domain data is available,joint training increases computation costs. Moreover, pre-trained models may not be applicable due to different user-specified requirements of model architecture and capacity.Thus, a scalable Re-ID system should be able to extend to anew scene flexibly with low cost. Knowledge transfer with-out accessing source domain data is a challenge.(3) Low testing computation cost. Existing state-of-the-art

Page 2: Distilled Person Re-identification: Towards a More ...wuancong/papers/Distilled Person Re... · Distilled Person Re-identification: Towards a More Scalable System Ancong Wu1, Wei-Shi

methods for Re-ID are based on large neural network mod-els, e.g. ResNet-50 [20], which cannot meet the require-ment of the camera hardware development trend of front-end processing on the chips. Thus, a scalable Re-ID systemshould be able to learn lightweight models.

To solve the above problems, we propose a scalableadaptation Re-ID system based on knowledge reuse asshown in Fig. 1. The system can extend with unlabelleddata and only a few labelled identities in the target domain,without using source domain data. Instead of storing rawdata, we store the knowledge of Re-ID for an existing scenei in a teacher model HTi. When extending to a new scene,knowledge in a teacher model pool {HTi}Mi=1 is aggregatedand transferred to a new user-specified lightweight studentmodel HS for the target domain. In this system, knowledgetransfer and knowledge aggregation are two key problemsand we address them as follows.

Knowledge transfer in our system is challenging, be-cause limited labelled target data and absence of source datalead to little information for learning and knowledge needbe compressed in a lightweight model. To solve these prob-lems, we need to imitate and distill the knowledge in theteacher models, which is knowledge distillation [21]. ForRe-ID, we exploit knowledge embedded in similarity andpropose the Log-Euclidean Similarity Distillation Loss forimitating the sample pairwise similarity matrix of the teach-er. Most knowledge distillation methods [21, 37, 61] are de-signed for closed-set classification and they convey knowl-edge by soft labels. They are not suitable for Re-ID, be-cause Re-ID is an open-set identification problem, in whichthe identities are non-overlapping in training and testing.

Furthermore, to effectively aggregate knowledge frommultiple teachers, we propose the Adaptive Knowledge Ag-gregator for adjusting the contributions of multiple teach-ers in the distillation loss dynamically, in order to selec-t effective teachers and aggregate effective knowledge forthe target domain. Only a few labelled identities (e.g. 10)are needed as validation data for computing empirical riskto guide knowledge aggregation. We further integrate theLog-Euclidean Similarity Distillation Loss and the Adap-tive Knowledge Aggregator into a Multi-teacher AdaptiveSimilarity Distillation Framework as shown in Fig. 2.

In summary, the contributions of this paper are: (1) Wepropose the Log-Euclidean Similarity Distillation Loss forknowledge distillation for Re-ID, which is an open-set iden-tification problem. (2) We propose the Adaptive KnowledgeAggregator for aggregating effective knowledge from mul-tiple teacher models for learning a lightweight student mod-el. (3) We further integrate them in a Multi-teacher Adap-tive Similarity Distillation Framework for scalable personre-identification, which can simultaneously reduce labellingcost, extension cost and testing computation cost.

2. Related Work

Supervised Person Re-identification. Person Re-identification has witnessed a fast growing development inrecent years, from feature design [19, 15, 32, 31, 35, 69, 58]to distance metric learning [54, 19, 43, 25, 67, 36, 41, 30,58, 33, 38, 31, 9, 68, 60, 63, 69, 29, 52, 6] and end-to-enddeep learning [28, 1, 55, 57, 49, 22, 64, 65, 70, 47]. Mostexisting works rely on abundant labelled data. Althoughhigh performance can be achieved by deep models, heavylabelling cost hinders the scalability of these methods.

Scalable Person Re-identification. Recently, scalableperson re-identification has drawn more attention for re-ducing costs of system extension. Unsupervised learning[42, 24, 62, 51, 14, 53, 13, 7, 27, 71], transfer learning[72, 53, 13, 11], small sample learning [63, 2] and activelearning [34, 50, 45] are for reducing labelling cost by min-imizing the requirement of labelled data or selecting specif-ic data for labelling. Fast adaptation [5, 39] is for reducingcomputation cost in training. Fast retrieval by lightweightmodel [56] and binary representation [8, 73] are for reduc-ing computation cost in testing. These methods only ad-dress one of the problems of labelling cost, extension costor testing computation cost, while our method address theseproblems simultaneously in a unified framework.

Knowledge Transfer/Distillation. Knowledge trans-fer/distillation [21] is for transferring knowledge from alarge teacher model to a smaller student model by imita-tion. A vast majority of knowledge distillation methodssuch as [21, 37, 61, 16] are designed for closed-set classi-fication problems by using soft labels of the teacher modelto guide learning the student model. However, person re-identification is an open-set identification problem, in whichthe identities in training and testing are non-overlapping,and thus the soft-label-based distillation methods are not sosuitable for Re-ID. Some methods also consider using in-formation other than soft labels as knowledge. Fitnets [44]and FSP [59] exploit feature maps and PKT [40] exploitsthe probability distribution of data, which are not directlyrelated to measuring similarity for matching in Re-ID. Tomore effectively represent and convey knowledge, we distillthe knowledge embedded in sample similarity by imitatingthe teacher similarity matrix. As for distilling from mul-tiple teacher models as in our proposed method, [16, 61]exploit the ensemble of multiple teachers, but they are forclosed-set classification and the contributions of differentteachers cannot be adaptively adjusted as in our method.Semi-supervised teacher-student frameworks [48, 17] arefor closed-set classification and cannot solve our problem.

Hypothesis transfer learning (HTL) [26] studies learn-ing from source models without source data. Existing HTLmethods [3, 12] are for closed-set domain adaptation, whichcannot solve the open-set identification problem for Re-ID.

Page 3: Distilled Person Re-identification: Towards a More ...wuancong/papers/Distilled Person Re... · Distilled Person Re-identification: Towards a More Scalable System Ancong Wu1, Wei-Shi

3. Similarity Knowledge DistillationAs designed in the scalable adaptation Re-ID system

(Fig. 1) in Section 1, to learn a model for the target do-main with only a few labelled data, we can transfer knowl-edge from other domains. To transfer knowledge withoutaccessing source domain data, we regard the model as a stu-dent, which learns to imitate the teacher models by knowl-edge distillation. In most knowledge distillation methods[21, 37, 16], soft labels are utilized. However, it is not sosuitable to convey knowledge by identity soft labels for Re-ID, because Re-ID is an open-set identification problem inwhich there is no overlap of identity in training and testing.

To overcome this problem, we exploit the knowledgeembedded in similarity and make the student model imitatethe pairwise similarities of the teacher model. For N un-labelled image samples {Ii}Ni=1, let A denote the pairwisesimilarity matrix, where the element ai,j in the i-th row andthe j-th column of A is the similarity between samples Iiand Ij determined by model H . Let HS denote the studentmodel to be learned andHT denote the teacher model that isfixed. The pairwise similarity matrices of the student modelHS and the teacher model HT are denoted by AS and AT ,respectively. To transfer knowledge from teacher to studen-t, we minimize the distance between the student similaritymatrix AS and the teacher similarity matrix AT as follow:

min dist(AS ,AT ), (1)

where dist(·) is a distance metric for similarity matrices.Note that AT is fixed as the target for learning AS .

3.1. Construction of Similarity Matrices

Student Similarity Matrix AS . To obtain the pairwisesimilarity matrix AS for student model HS , we use cosinesimilarity, which is commonly used in neural networks forRe-ID [47]. For a sample Ii, the student model HS extract-s a d-dimensional feature vector HS(Ii;ΘS), where ΘS

is the parameter of HS . Let xS,i ∈ Rd denote the non-negative normalized unit feature vector formulated by

xS,i = ReLU(HS(Ii;ΘS))/ ‖ReLU(HS(Ii;ΘS))‖ , (2)

where ReLU(x) = max(0, x) is an activation function ap-plied after the feature layer of HS .

Let XS = [xS,1,xS,2, ...,xS,N ] ∈ Rd×N denote the fea-ture matrix for samples {Ii}Ni=1. The student similarity ma-trix AS is computed by

AS = X>SXS , (3)

where aS,i,j = x>S,ixS,j in AS is the cosine similarity be-tween samples Ii and Ij .Properties of Student Similarity Matrix. We derive theproperties of the student similarity matrix AS as follows:(1) The range of similarities in AS is [0, 1]. The cosine sim-ilarity between non-negative unit feature vectors extracted

by Eq. (2) is between 0 and 1.(2) AS is a symmetric positive semi-definite matrix. In E-q. (3), X>SXS is symmetric positive semi-definite. In ourcase of mini-batch learning, the feature dimension d is larg-er than the batch size N . Generally, XS ∈ Rd×N satisfiesrank(XS) = N and has full rank, so that AS ∈ RN×N isa symmetric positive definite (SPD) matrix in this case.Teacher Similarity Matrix AT . The teacher similaritymatrix AT , as the target of student similarity matrix AS ,should satisfy the properties of student similarity matrix.

Generally, when using neural network as teacher model,the teacher similarity matrix AT can be computed as thestudent model in Eq. (3) by using the feature matrix XT =[xT,1,xT,2, ...,xT,N ] extracted by the teacher model HT .

In other cases, the teacher similarity matrix AT may notsatisfy the two properties of the student similarity matrix.For example, when pairwise verification neural network isused as teacher model without constraint, the similarity ma-trix may not be SPD. Some simple transformations can beapplied to make the teacher similarity matrix valid. First, ifthe range of similarities in AT is not [0, 1], it can be mappedto [0, 1] by normalization. Second, if AT is not symmet-ric positive definite, we can project it onto the cone of allpositive semi-definite matrices as in [54]. More details areprovided in the supplementary due to space limitation.

3.2. Log-Euclidean Similarity Distillation

After constructing the similarity matrices for student andteacher, we aim to distill the knowledge by minimizing thedistance between the similarity matrices in Eq. (1). As ana-lyzed above, the student and teacher similarity matrices aresymmetric positive definite (SPD) matrices, which are in-trinsically lying on a Riemannian manifold [4] instead of avector space. Hence, when measuring the distance betweenAS and AT , we take this property into consideration andmeasure the distance in a log-Euclidean Riemannian frame-work [4] instead of using Euclidean metric as follow:

dist(AS ,AT ) = ‖log(AS)− log(AT )‖F , (4)

where log(A) is the matrix logarithm of A. For any SPDmatrix A, the logarithm of it is

log(A) = Udiag(log(λ1), log(λ2), ..., log(λN ))U>, (5)

where U is the orthonormal matrix of eigenvectors and λiis the eigenvalue, which are obtained from the eigendecom-position A = Udiag(λ1, λ2, ..., λN )U>.

We distill the knowledge embedded in the similarityfrom teacher to student by minimizing the Log-Euclideandistance as follow:

minΘS

LT (XS) =∥∥∥log(X>SXS)− log(AT )

∥∥∥2F, (6)

where ΘS is the parameter of student model HS , XS is thefeature matrix extracted by HS and processed by Eq. (2),

Page 4: Distilled Person Re-identification: Towards a More ...wuancong/papers/Distilled Person Re... · Distilled Person Re-identification: Towards a More Scalable System Ancong Wu1, Wei-Shi

Target Domain Data

ID: 1 ID: 2 ID: 10

Validation Data(very few labelled IDs)

Unlabelled Data

Student Model

Teacher Model

Teacher Model

Teacher Model

Validation Empirical Risk

Loss (Eq.(8))

Gradients w.r.t. features

DistillationLosses

Adaptive Aggregated Distillation

Loss (Eq.(7))

Simulated Feature Update

Features

Update Teacher weights

Update Student Model

Similarity Knowledge Distillation Adaptive Knowledge Aggregation

Similaritymatrices

… … …

Figure 2. Overview of the Multi-teacher Adaptive Similarity Distillation Framework. Unlabelled data and a few labelled identities of targetdomain are required. The teacher models are fixed feature extractors for existing scenes in the system. The student model is a new user-specified lightweight model to be trained. The teacher weight αi is for controlling the contribution of Log-Euclidean Similarity DistillationLoss LTi of teacher HTi and it is learned by minimizing validation empirical risk loss LV ER. The student model is trained by unlabelleddata with the Adaptive Aggregated Distillation Loss LTA. (← denotes forward propagation and L99 denotes backward propagation.)

X>SXS is AS as computed in Eq. (3) and AT is fixed sim-ilarity matrix provided by teacher model HT . We call LTthe Log-Euclidean Similarity Distillation Loss for HT . Theeffectiveness of the Log-Euclidean metric is further validat-ed in our experiments as compared to the Euclidean metric.

4. Learning to learn from Multiple TeachersWe have illustrated learning from a single teacher mod-

el by similarity knowledge distillation. In this section, westudy learning from multiple teacher models. Since not allteachers can provide effective and complementary knowl-edge due to variations between cameras (e.g. illuminationand background), we need learning to adjust the contribu-tions of multiple teachers, i.e. learning to learn. We pro-pose the Adaptive Knowledge Aggregator to aggregate ef-fective knowledge from multiple teachers for learning a s-tudent model, which requires only a few labelled identitiesfor reducing labelling cost. It is integrated with similarityknowledge distillation to form the Multi-teacher AdaptiveSimilarity Distillation Framework as shown in Fig. 2.

4.1. Multi-teacher Adaptive Aggregated Distillation

To learn a student modelHS from multiple teacher mod-els in a teacher model pool {HTi}Mi=1 simultaneously, theLog-Euclidean Similarity Distillation Loss LT in Eq. (6)for a single teacher HT is generalized as follow:

minΘS

LTA(XS ; {αi}Mi=1) =

M∑i=1

αiLTi(XS), (7)

where LTi is the Log-Euclidean Similarity Distillation Lossfor teacher model HTi, αi is the teacher weight for control-ling the contribution of LTi. The teacher weight αi shouldsatisfy

∑Mi=1 αi = 1 and αi ≥ 0. XS is the feature matrix

extracted by student model HS parameterized by ΘS .

The constrain∑Mi=1 αi = 1 is for normalizing the scale

of αi. It can be simply satisfied by αi = |αi|/∑Mj=1 |αj |,

where αi is an unconstrained real number parameter.For unsupervised learning without prior knowledge, αi

can be set equally as 1/M . However, in practice, theremay be some ineffective teacher models that provide wrongknowledge. Hence, instead of regarding αi as fixed hyper-parameter which needs tuning, we aim to learn αi dynam-ically to make the loss LTA adaptive to the target domain.We call LTA the Adaptive Aggregated Distillation Loss.

4.2. Adaptive Knowledge Aggregation

To learn the teacher weights {αi}Mi=1, guiding informa-tion is required. Generally, for validating whether a Re-IDsystem is working normally, it is necessary and feasible forhuman operator to label a small amount of identities (e.g. ≤10). Although the small amount of data is far from enoughfor training an effective model from scratch due to overfit-ting, we can compute the validation empirical risk on it toprovide guiding information for aggregating knowledge.

Validation Empirical Risk. For the target domain, we havea large amount of unlabelled data DU = {IUi }Ni=1. Addi-tionally, we label data of a few identities (not in DU ) toform a small validation set DL = {(ILi , yi)}

Nvi=1, where

yi = 1, 2..., Cv is the label (Cv = 10 in our case). Toindicate whether the student is learning correct knowledgefrom teachers, the empirical risk on validation data DL canbe computed. Let xUS,k and xLS,i denote the features of unla-belled sample IUk and labelled sample ILi , respectively. LetPL = {(i, j)|yi = yj} denote the set of index pair (i, j)of positive sample pair (ILi , I

Lj ) in the validation set DL.

As there is no overlap identity in DL and DU , (ILi , IUk ) is

a negative sample pair. We apply a Softmax cross entropy

Page 5: Distilled Person Re-identification: Towards a More ...wuancong/papers/Distilled Person Re... · Distilled Person Re-identification: Towards a More Scalable System Ancong Wu1, Wei-Shi

loss for computing the validation empirical risk by

LV ER(XUS ,X

LS ) =

∑(i,j)∈PL

−logexp(xL>

S,i xLS,j)

exp(xL>S,i x

LS,j) +

∑Nk=1 exp(xL>

S,i xUS,k)

.

(8)

We call LV ER the validation empirical risk loss. Whenthe similarity of positive pair xL>S,i x

LS,j becomes larger and

the similarity of negative pair xL>S,i xUS,k becomes smaller,

the loss LV ER becomes smaller. Thus, it can indicate theempirical risk effectively.Adaptive Knowledge Aggregator. To learn teacherweights {αi}Mi=1 that can minimize the validation empiricalrisk LV ER, we propose the Adaptive Knowledge Aggrega-tor for optimizing {αi}Mi=1 by gradient descent.

In the learning process of student model HS , at eachstep, we have the feature matrices XU

S and XLS for un-

labelled and labelled data, respectively. Feature learningis guided by gradient descent of the Adaptive Aggregat-ed Distillation Loss LTA. To evaluate whether the cur-rent teacher weights {αi}Mi=1 are effective, it is expectedthat the features updated by LTA parameterized by {αi}Mi=1

can decrease the validation empirical risk loss LV ER to themost extent. We simulate one step update of the featuresXUS and XL

S by using gradients of LTA(XUS ; {αi}Mi=1) and

LTA(XLS ; {αi}Mi=1) with respect to XU

S and XLS by

XU′S = X

US − β

∂LTA(XUS ; {αi}Mi=1)

∂XUS

,

XL′S = X

LS − β

∂LTA(XLS ; {αi}

Mi=1)

∂XLS

,

(9)

where XU ′

S and XL′

S are the simulated updated features andβ is the step size of the simulated updating.

Then, we compute the validation empirical riskLV ER(X

U ′

S ,XL′

S ) of the updated features XU ′

S and XL′

S .Note that, the updated features XU ′

S and XL′

S are re-lated to the teacher weights αi because the gradients∂LTA(XU

S ;{αi}Mi=1)

∂XUS

and ∂LTA(XLS ;{αi}

Mi=1)

∂XLS

contain αi. Thus,to minimize the validation empirical risk loss LV ER, the

gradient ∂LV ER(XU′S ,XL′

S )∂αi

can be computed and the teacherweights αi can be learned by gradient descent as follow:

α′i = αi − γα∂LV ER(X

U′S ,XL′

S )

∂αi, (10)

where α′i is the updated value of teacher weight αi by usinglearning rate γα.

With the objective of minimizing the validationempirical risk during training, the learning targetLTA(XS ; {αi}Mi=1) parameterized by teacher weights{αi}Mi=1 can adaptively select effective teacher models byweighting to provide better guidance for student model.Optimization. Before training, the teacher weights α1,α2,..., αM are initialized as 1/M . There are mainly threesteps in training: (1) Feature extraction and similarity ma-trix construction as illustrated in Section 3.1; (2) Updating

2 4 6 8 10 12 14 160

0.1

0.2

0.3

0.4

Teac

her w

eigh

t i MSMT17

DukeMTMCCUHK03ViPER

2 4 6 8 10 12 14 16Epochs

3.3

3.4

3.5

3.6

3.7

Vald

iatio

n Em

prica

l Risk

L VER

w/ Aggregatorw/o Aggregator

Figure 3. Change of teacher weights αi (the upper figure) andchange of validation empirical risk loss LV ER with and with-out using Adaptive Knowledge Aggregator (the lower figure) inthe training process of Market-1501. The teacher weights canselect more effective teacher models trained on larger dataset-s (MSMT17, DukeMTMC). The validation empirical risk lossLV ER can be further minimized with the learned teacher weights.

teacher weights {αi}Mi=1 by Adaptive Knowledge Aggrega-tor; (3) Updating student modelHS by distillation loss LTAwith updated teacher weights {αi}Mi=1. The three steps arerepeated for training student modelHS . This process is alsoshown in Algorithm 1 in the supplementary.Visual Understanding. To better understand the effect ofthe adaptive teacher weights αi, change of αi and changeof the validation empirical risk loss LV ER during trainingon Market-1501 are shown in Fig. 3. The experiment de-tails are illustrated later in Section 5. It can be observedthat, with the Adaptive Knowledge Aggregator, the learnedteacher weights are larger for effective teachers (trained onlarge datasets MSMT17, DukeMTMC) and smaller for theineffective teacher model (trained on small dataset ViPER).Compared with the case without using aggregator (i.e. usingequal teacher weights), adaptive teacher weights can furtherminimize the validation empirical risk loss LV ER.

5. ExperimentsWe conducted extensive experiments on two large per-

son re-identification benchmark datasets Market-1501 [66]and DukeMTMC [70]. Our Multi-teacher Adaptive Similar-ity Distillation Framework was evaluated and compared toknowledge distillation, unsupervised, semi-supervised andsmall sample learning methods. Further evaluations showthe effectiveness of the Log-Euclidean Similarity Distilla-tion Loss and the Adaptive Knowledge Aggregator.Experiment Settings and Datasets. We used Market-1501 [66] and DukeMTMC [70] as target datasets. Sourcedatasets were for training teacher models. To increase thediversity of effectiveness of teacher models for the targetscene to simulate the practical situation, datasets of dif-ferent scales collected in different scenes were used. Wetrained 5 teacher models T1, T2, T3, T4, T5 with labelleddata in the training sets of MSMT17 [53], CUHK03 [28],ViPER [18], DukeMTMC [70] and Market-1501 [66], re-

Page 6: Distilled Person Re-identification: Towards a More ...wuancong/papers/Distilled Person Re... · Distilled Person Re-identification: Towards a More Scalable System Ancong Wu1, Wei-Shi

Table 1. Basic information of datasets.Teacher Dataset Images Identities Cameras SceneT1 MSMT17 [53] 126,441 4,101 15 outdoor, indoorT2 CUHK03 [28] 28,192 1,467 2 indoorT3 VIPeR [18] 1,264 632 2 outdoorT4 DukeMTMC [70] 36,411 1,812 8 outdoorT5 Market-1501 [66] 32,668 1,501 6 outdoor

spectively. Once a teacher model was trained, source dataand extra training were not needed. Basic information of thedatasets is in Table 1. Note that, when forming the teachermodel pool, teacher model of the target dataset was exclud-ed. For Market-1501, the teacher model pool with M =4 teachers was {T1, T2, T3, T4} (MSMT17, CUHK03,ViPER, DukeMTMC); while for DukeMTMC, the teach-er model pool with M = 4 teachers was {T1, T2, T3, T5}(MSMT17, CUHK03, ViPER, Market-1501).

The standard splits of training and testing IDs of Market-1501 and DukeMTMC were adopted as in [66] and [70].Our experiments were conducted under two settings:(1) Unsupervised setting: all data in training set was unla-belled. (2) Semi-supervised setting: Concerning labellingcost, Cv = 10 identities in training set were randomly se-lected to be labelled, and the remaining data was unlabelled.Performance Metrics. In testing, similarities betweenquery and gallery samples were determined by the studentmodel. The performance metrics cumulative matching char-acteristic (CMC) and mean Average Precision (mAP) wereapplied following the standard evaluation protocol in [66]and [70]. Note that, scalability is also important in our e-valuations, including training time, model size indicated byparameter number (#Para) and testing computation cost ofthe model indicated by floating-point operations (FLOPs).Implementation Details. For teacher models of sourcescenes, an advanced Re-ID model PCB [47] was adopted.For student model of the target scene, a lightweight mod-el MobileNetV2 [46] was adopted and a convolution layerwas applied to reduce the last feature map channel numberto 256. It was initialized by ImageNet pretraining, withouttraining on any Re-ID dataset. The input images were re-sized to 384× 128 and feature maps of the last convolutionlayer were extracted as feature vectors. In each batch, forcomputing validation empirical risk, we sampled two im-ages for each identity from labelled data to guarantee posi-tive pairs. More details are provided in the supplementary.

5.1. Comparison under Unsupervised Setting

Compared Methods. For unsupervised setting, we did notuse labelled data for our method and set fixed equal teacherweights αi as 1/M in the Adaptive Aggregated DistillationLoss LTA in Eq. (7) without using the Adaptive Knowl-edge Aggregator. We compared with unsupervised Re-ID methods including unsupervised features LOMO [31],BOW [66] and unsupervised learning models UMDL [42],PTGAN [53], PUL [14], CAMEL [62], SPGAN [13], TJ-AIDL [51] and HHL [71]. Among them, the advanced deep

Table 2. Performance under unsupervised setting. “Ours (unsuper-vised)” denotes the unsupervised version of our method. “Back-bones” denotes model architecture. “#Para” denotes the numberof parameters. “FLOPs” denotes floating-point operations (testingcomputation cost). “Train” denotes training time. “R-1” denotesrank-1 accuracy (%). “mAP” denotes mean average precision (%).

Methods Backbones #Para FLOPs Market-1501 DukeMTMC(M) (G) R-1 mAP Train R-1 mAP Train

LOMO [31] - - - 27.2 8.0 - 12.3 4.8 -BOW [66] - - - 35.8 14.8 - 17.1 8.3 -UMDL [42] - - - 34.5 12.4 - 18.5 7.3 -PTGAN [53] GoogleNet 6.8 1.5 38.6 - - 27.4 - -PUL [14] ResNet-50 25.6 4.1 45.5 20.5 - 30.0 16.4 -CAMEL [62] ResNet-56 0.9 6.2 54.5 26.3 13.7h 42.2 21.0 13.7hSPGAN [13] ResNet-50 25.6 4.1 57.7 26.7 - 46.4 26.2 -TJ-AIDL [51] MobileNet 4.2 0.6 58.2 26.5 - 44.3 23.0 -HHL [71] ResNet-50 25.6 4.1 62.2 31.4 21.0h 46.9 27.2 21.0hFukuda [16] MobileNetV2 3.4 0.3 45.1 23.0 1.1h 27.6 18.9 1.4hOurs (unsupervised) MobileNetV2 3.4 0.3 61.5 33.5 1.1h 48.4 29.4 1.4h

models require source data for transfer learning or pretrain-ing, while our method only requires teacher models. More-over, we compared with a multi-teacher knowledge distilla-tion method Fukuda et al. [16]. For evaluating scalability,training time was tested on a TITAN X GPU. In practice,since the model for target domain is a new user-specifiedmodel without learning from Re-ID data, the time of pre-training on source data was included in training time. Theresults as well as the parameter number (#Para) and testingcomputation costs (FLOPs) are reported in Table 2.

Results and Analysis. Our method outperformed the com-pared unsupervised Re-ID and multi-teacher distillationmethods, except that the rank-1 accuracy on Market-1501is slightly lower than HHL [71]. Although our methoddoes not require source data, the Log-Euclidean Similari-ty Distillation Loss can effectively transfer knowledge ofsource domains to the student model. The complementarityof knowledge of multiple teacher models can increase thegeneralization ability of the student model. The distillationmethod Fukuda et al. [16] also exploited multiple teacher-s, but it is based on soft labels for closed-set classification,which is not effective for the open-set Re-ID problem.

Scalability of the methods is compared as below. Train-ing data required by our method contains only target dataand is smaller than the other methods that require sourcedata for pretraining or joint training. The computation costindicated by FLOPs of our backbone model MobileNetV2is much lower than the others. With smaller training set andlighter backbone model, the training time is much short-er than the methods with comparable performance such asCAMEL [62] and HHL [71]. Thus, our framework is morescalable than the compared methods.

5.2. Comparison under Semi-supervised SettingCompared Methods. For semi-supervised setting, 10 la-belled IDs were available, thus the full version of ourmethod (“Ours (semi)”) can be applied. For comparison, wechose two competitive recent advanced unsupervised Re-IDmethods CAMEL [62] and HHL [71] to extend to semi-

Page 7: Distilled Person Re-identification: Towards a More ...wuancong/papers/Distilled Person Re... · Distilled Person Re-identification: Towards a More Scalable System Ancong Wu1, Wei-Shi

Table 3. Performance (%) under semi-supervised setting with 10labelled identities. Semi-supervised and small sample learningmethods for Re-ID were compared (same notations as Table 2).

Methods Backbones #Para FLOPs Market-1501 DukeMTMC(M) (G) R-1 mAP Train R-1 mAP Train

DNS [63] ResNet-50 25.6 4.1 11.8 5.3 3.2h 10.0 4.6 3.2hCAMEL (semi) ResNet-56 0.9 6.2 54.4 26.2 13.7h 42.1 21.1 13.7hHHL (semi) ResNet-50 25.6 4.1 63.9 34.4 21.0h 46.7 26.5 21.0hCAMEL (semi) MobileNetV2 3.4 0.3 13.2 5.0 1.4h 13.6 5.6 1.4hHHL (semi) MobileNetV2 3.4 0.3 56.7 27.7 21.0h 42.9 23.9 21.0hOurs (semi) MobileNetV2 3.4 0.3 63.7 35.4 1.3h 57.4 36.7 1.7h

supervised version (“CAMEL (semi)” and “HHL (semi)”)by using the positive and negative sample pairs obtainedfrom these labelled samples. We also tested using Mo-bileNetV2 trained on MSMT17 (source dataset of the bestteacher model) for CAMEL (semi) and using MobileNetV2as backbone for HHL (semi). A small sample learning Re-ID method DNS [63] was also compared, for which we useda PCB [47] model trained on MSMT17 as the best teachermodel T1. Training time was tested as the unsupervisedsetting in Section 5.1. The results are reported in Table 3.Results and Analysis. Among the compared methods, theperformance of our method is the best, except that rank-1 accuracy is slightly lower than HHL (semi) on Market-1501. Compared with the unsupervised results in Table 2,CAMEL (semi) and HHL (semi) benefited little from the 10extra labelled identities. The small sample learning methodDNS failed due to overfitting. Compared to our unsuper-vised version in Table 2, our method benefited more fromthe labelled identities especially on DukeMTMC (explainedin Section 5.3). Although 10 labelled identities can pro-vide little information for directly learning a model, theyare sufficient for our method to compute the validation em-pirical risk to select better teacher models. Moreover, ourmethod can learn MobileNetV2 more effectively than HHLand CAMEL. Scalability analysis is similar to Section 5.1.

5.3. Further EvaluationsIn this section, we further evaluate and analyze the com-

ponents and capabilities of our method.Evaluation of Knowledge Distillation. Knowledge distil-lation is the key technique in our Multi-teacher AdaptiveSimilarity Distillation Framework. To fairly compare ourLog-Euclidean Similarity Distillation Loss LT in Eq. (6)with existing knowledge distillation losses, we evaluatedlearning from a single teacher model T1 (MSMT17). Toevaluate the effectiveness of the Log-Euclidean metric inLT , we also tested using Euclidean metic, which is denot-ed by “LT w/o log”. We compared with a representativesoft-label-based distillation method Hinton et al. [21] anda recent advanced probability-distribution-based distillationmethod PKT [40]. The results are reported in Table 4.

The performance of our lossLT is the best and very closeto the teacher model, the upper bound for distillation meth-ods. The performance of Hinton et al. [21] is lower thanother methods, because it is designed for closed-set classi-

Table 4. Performance (%) of distilling a single teacher model T1.Our lossLT without logarithm “LT w/o log” and other knowledgedistillation methods were compared. Please see text for details.

Methods Market-1501 DukeMTMCR-1 mAP R-1 mAP

Teacher T1 (MSMT17) 51.5 24.9 47.6 30.6Hinton et al. [21] 41.2 20.0 32.5 20.8PKT [40] 46.1 22.3 44.7 28.6LT w/o log 47.7 23.0 45.0 29.8LT (Eq. (6)) 49.7 24.6 47.6 31.1

fication by using soft labels, which is not suitable for theopen-set Re-ID problem. PKT [40] used probability distri-bution for knowledge distillation, which is not so effectiveas using similarity in our method for Re-ID. Comparisonsbetween “LT ” and “LT w/o log” show the effectiveness ofthe Log-Euclidean metric, which considers the symmetricpositive definite (SPD) property of similarity matrices.

Effect of the Learned Teacher Weights αi. Another keycomponent in our method is the Adaptive Knowledge Ag-gregator for learning teacher weights αi to adjust the dis-tillation loss LTA in Eq. (7). We conducted some exper-iments as follows: (1) Evaluating the performance of al-l teachers individually. (2) Ensembles of all teachers bydistance fusion with weights learned by RankSVM [43] onthe validation set and with equal weights. Joint training aPCB model [47] with all source data. (3) Using multi-taskweighting methods Uncertainty [23] and GradNorm [10] tolearn teacher weights in our framework. (4) Training ourframework without learning teacher weights αi (“Ours (un-supervised)”) and training our framework with subsets ofteachers of top-k αi. The backbone of our method was Mo-bileNetV2 and the others were ResNet-50. The results andthe teacher weights αi after training are reported in Table 5.- Individual Teacher. For Market-1501, teachers T1(MSMT17), T2 (CUHK03) and T4 (DukeMTMC) are ef-fective and comparable. For DukeMTMC, only teacher T1(MSMT17) is effective, since DukeMTMC is more chal-lenging with more camera views than Market-1501. Forboth datasets, T3 (ViPER) is the worst because its trainingset is small. T3 provides weak knowledge as interference totest the robustness of the system. Our method outperformedthe best teacher by about 10% and is more lightweight.- w/ and w/o Learning αi. Teacher weights αi learned byAdaptive Knowledge Aggregator can indicate the effective-ness of the teachers. The weights for the worst teacher T3are nearly zero. When comparing “Ours (semi)” with “Ours(unsupervised)”, for Market-1501, the selection by teacherweights brings limited improvement; whilst for DukeMTM-C, the improvement is much more significant, because dis-tance fusion with equal weights is already better than in-dividual teachers for Market-1501 but it is not effective forDukeMTMC. Ensemble by distance fusion increases testingcomputation costs and is not as scalable as our method.

Furthermore, we ranked αi in descending order to selecta subset of teachers for training. The teachers with large αi

Page 8: Distilled Person Re-identification: Towards a More ...wuancong/papers/Distilled Person Re... · Distilled Person Re-identification: Towards a More Scalable System Ancong Wu1, Wei-Shi

Table 5. Performance (%) of evaluating the Adaptive KnowledgeAggregator for selecting teachers. “αi” is the teacher weightlearned by the Adaptive Knowledge Aggregator. Ensembles ofteachers, joint training and task weighting were compared.

Market-1501 (teacher model pool {T1, T2, T3, T4})Models #Para FLOPs αi R-1 mAP TrainTeacher T1 (MSMT17) 25.6 4.1 0.398 51.5 24.9 3.4hTeacher T2 (CUHK03) 25.6 4.1 0.145 51.7 26.2 1.4hTeacher T3 (ViPER) 25.6 4.1 0.000 28.5 12.0 0.1hTeacher T4 (DukeMTMC) 25.6 4.1 0.458 49.4 24.2 1.7hRankSVM weighted fusion (all teachers) 102.4 16.4 - 57.6 31.6 -Distance fusion (all teachers) 102.4 16.4 - 56.5 30.7 -Joint training (all source data) 25.6 4.1 - 62.8 35.6 6.6hUncertainty [23] 3.4 0.3 - 61.3 33.3 1.1hGradNorm [10] 3.4 0.3 - 60.5 32.8 1.1hOurs (unsupervised) 3.4 0.3 - 61.5 33.5 1.1hOurs (Ti of top 1αi {T4}) 3.4 0.3 - 48.1 23.8 1.1hOurs (Ti of top 2αi {T4, T1}) 3.4 0.3 - 63.0 34.6 1.3hOurs (Ti of top 3αi {T4, T1, T2}) 3.4 0.3 - 63.5 35.5 1.3hOurs (semi) 3.4 0.3 - 63.7 35.4 1.3h

DukeMTMC (teacher model pool {T1, T2, T3, T5})Models #Para FLOPs αi R-1 mAP TrainTeacher T1 (MSMT17) 25.6 4.1 0.581 47.6 30.6 3.4hTeacher T2 (CUHK03) 25.6 4.1 0.071 25.3 14.8 1.4hTeacher T3 (ViPER) 25.6 4.1 0.029 19.7 10.7 0.1hTeacher T5 (Market-1501) 25.6 4.1 0.320 30.8 18.6 1.3hRankSVM weighted fusion (all teachers) 102.4 16.4 - 41.8 28.3 -Distance fusion (all teachers) 102.4 16.4 - 39.7 26.8 -Joint training (all source data) 25.6 4.1 - 53.6 36.1 6.2hUncertainty [23] 3.4 0.3 - 51.0 30.6 1.4hGradNorm [10] 3.4 0.3 - 44.8 26.5 1.4hOurs (unsupervised) 3.4 0.3 - 48.4 29.4 1.4hOurs (Ti of top 1αi {T1}) 3.4 0.3 - 47.6 31.1 1.4hOurs (Ti of top 2αi {T1, T5}) 3.4 0.3 - 57.9 36.7 1.7hOurs (Ti of top 3αi {T1, T5, T2}) 3.4 0.3 - 57.5 36.7 1.7hOurs (semi) 3.4 0.3 - 57.4 36.7 1.7h

(> 1/4) can bring significant improvement while those withsmall αi (< 1/4) cannot bring improvement because theyare weak or cannot provide complementary knowledge.- Comparison with Ensemble and Task Weighting. Ourmethod outperformed ensemble, joint training and taskweighting [23, 10]. Moreover, our testing computation costwas much lower than ensemble and our training time wasshorter than joint training. These show the advantages ofour similarity knowledge distillation and Adaptive Knowl-edge Aggregator for aggregating knowledge of teachers.

The Number of Validation IDs. We tested using differentnumbers of validation identities from 0 to 50. As shown inTable 6, our method can achieve comparable performancewith 5 to 50 identities. Since the validation identities areonly for learning teacher weights αi and not for trainingthe student model parameters, there is no overfitting prob-lem even with only 1 labelled identity. Comparing using 1ID with using 0 ID, the performance dropped significantlyespecially on DukeMTMC, which indicates the importanceof validation empirical risk. Thus, our Adaptive KnowledgeAggregator is robust with only a few labelled identities.

Different Student Model Architectures. To show the flex-ibility of our method, we used MobileNetV2 [46], ResNet-18 and ResNet-50 [20] as backbones for the student model,which are with different architectures and capacities. Theresults in Table 7 show that, our method achieved compara-ble performance for all three models. Thus, knowledge canbe effectively distilled to models of different architectures.

Finetuning with Our Method as Initialization. Whenmore labelled data is given, the MobileNetV2 [46] studen-t model learned by our method can be used as initializa-tion for finetuning. We finetuned it on 20% labelled identi-

Table 6. Performance (%) of using different numbers of labelledidentities in validation set.

Validation set IDs 0 1 5 10 20 50

Market R-1 61.5 62.6 63.2 63.7 63.9 64.2mAP 33.5 34.2 34.6 35.4 35.4 35.8

DukeMTMC R-1 48.4 55.9 57.6 57.4 57.5 57.9mAP 29.4 35.0 36.6 36.7 36.8 37.1

Table 7. Performance (%) of different backbones of student model.Backbones Market-1501 DukeMTMC

R-1 mAP R-1 mAPResNet-50 63.5 35.2 56.1 35.4ResNet-18 63.1 34.9 56.7 36.4

MobileNetV2 63.7 35.4 57.4 36.7

Table 8. Performance (%) of finetuning MobileNetV2 [46] withour method as initialization on a small subset of 20% IDs.

Initialization Labelled IDs Market-1501 DukeMTMCR-1 mAP R-1 mAP

ImageNet 100% 78.3 55.5 64.7 45.4ImageNet 20% 60.1 35.6 49.9 29.7

Ours 20% 77.1 57.7 66.7 46.0

ties and compared with the models finetuned with 20% and100% labelled identities initialized by ImageNet pretrain-ing. The results are shown in Table 8. The model initializedby our method finetuned on 20% IDs can achieve compa-rable performance with the model initialized by ImageNetpretraining finetuned on 100% IDs, while the performanceof finetuning on 20% IDs with ImageNet pretraining wasmuch lower. With prior knowledge of Re-ID, our methodcan generalize better with fewer target labelled samples.

6. ConclusionIn this paper, we aim to address the scalability of per-

son re-identification from three aspects, including labellingcost, extension cost and testing computation cost. Wepropose a Multi-teacher Adaptive Similarity DistillationFramework, which can flexibly train a new user-specifiedlightweight model, with only a few labelled identities andwithout source data. The framework stores knowledge ofRe-ID in a teacher model pool. When extending to a newscene, knowledge can be adaptively aggregated and distilledto a lightweight student model. For knowledge distillationfor Re-ID, an open-set identification problem, we proposethe Log-Euclidean Similarity Distillation Loss to imitatethe sample pairwise similarity matrix of the teacher mod-el. To effectively learn from multiple teachers, we proposethe Adaptive Knowledge Aggregator to adjust the contri-bution of each teacher model by minimizing the validationempirical risk computed on a few labelled identities. Ex-tensive evaluations show that our method is more scalableand can achieve performance comparable to state-of-the-artunsupervised and semi-supervised Re-ID methods.

AcknowledgementThis work was supported partially by the National Key

Research and Development Program of China (2016YF-B1001002), NSFC(61522115, 61573387, U1611461),Guangdong Province Science and Technology InnovationLeading Talents (2016TX03X157), and the Royal SocietyNewton Advanced Fellowship (NA150459).

Page 9: Distilled Person Re-identification: Towards a More ...wuancong/papers/Distilled Person Re... · Distilled Person Re-identification: Towards a More Scalable System Ancong Wu1, Wei-Shi

References[1] Ejaz Ahmed, Michael Jones, and Tim K. Marks.

An improved deep learning architecture for person re-identification. In CVPR, 2015. 1, 2

[2] TM Feroz Ali and Subhasis Chaudhuri. Maximum marginmetric learning over discriminative nullspace for person re-identification. In CVPR, 2018. 2

[3] Xiang Li Ao, Shuang and Charles X. Ling. Fast generalizeddistillation for semi-supervised domain adaptation. In AAAI,2017. 2

[4] Vincent Arsigny, Pierre Fillard, Xavier Pennec, and NicholasAyache. Fast and simple calculus on tensors in the log-euclidean framework. In MICCAI, 2005. 3

[5] Song Bai, Xiang Bai, and Qi Tian. Scalable person re-identification on supervised smoothed manifold. In CVPR,2017. 2

[6] Slawomir Bak and Peter Carr. One-shot metric learning forperson re-identification. In CVPR, 2017. 2

[7] Slawomir Bak, Peter Carr, and Jean-Francois Lalonde. Do-main adaptation through synthesis for unsupervised personre-identification. In ECCV, 2018. 2

[8] Jiaxin Chen, Yunhong Wang, Jie Qin, Li Liu, and Ling Shao.Fast person re-identification via cross-camera semantic bina-ry transformation. In CVPR, 2017. 2

[9] Ying-Cong Chen, Wei-Shi Zheng, Jian-Huang Lai, and PongYuen. An asymmetric distance model for cross-view featuremapping in person re-identification. IEEE TCSVT, 2015. 2

[10] Zhao Chen, Vijay Badrinarayanan, Chen-Yu Lee, and An-drew Rabinovich. Gradnorm: Gradient normalization foradaptive loss balancing in deep multitask networks. In ICM-L, 2018. 7, 8

[11] De Cheng, Yihong Gong, Zhihui Li, Dingwen Zhang, Wei-wei Shi, and Xingjun Zhang. Cross-scenario transfer metriclearning for person re-identification. PR, 2018. 1, 2

[12] Boris Chidlovskii, Stephane Clinchant, and Gabriela Csurka.Domain adaptation in the absence of source domain data. InSIGKDD, 2016. 2

[13] Weijian Deng, Liang Zheng, Qixiang Ye, Guoliang Kang, YiYang, and Jianbin Jiao. Image-image domain adaptation withpreserved self-similarity and domain-dissimilarity for personreidentification. In CVPR, 2018. 2, 6

[14] Hehe Fan, Liang Zheng, Chenggang Yan, and Yi Yang.Unsupervised person re-identification: Clustering and fine-tuning. ACM Transactions on Multimedia Computing, Com-munications, and Applications (TOMM), 2018. 2, 6

[15] Michela Farenzena, Loris Bazzani, Alessandro Perina, Vitto-rio Murino, and Marco Cristani. Person re-identification bysymmetry-driven accumulation of local features. In CVPR,2010. 2

[16] Takashi Fukuda, Masayuki Suzuki, Gakuto Kurata, SamuelThomas, Jia Cui, and Bhuvana Ramabhadran. Efficien-t knowledge distillation from an ensemble of teachers. Proc.Interspeech, 2017. 2, 3, 6

[17] Chen Gong, Xiaojun Chang, Meng Fang, and Jian Yang.Teaching semi-supervised classifier via generalized distilla-tion. In IJCAI, 2018. 2

[18] Douglas Gray, Shane Brennan, and Hai Tao. Evaluating ap-pearance models for recognition, reacquisition, and tracking.In PETS, 2007. 5, 6

[19] Douglas Gray and Hai Tao. Viewpoint invariant pedestrianrecognition with an ensemble of localized features. In ECCV.2008. 2

[20] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun.Deep residual learning for image recognition. In CVPR,2016. 2, 8

[21] Geoffrey Hinton, Oriol Vinyals, and Jeff Dean. Distilling theknowledge in a neural network. In NeurIP workshop, 2015.2, 3, 7

[22] Lin Ji, Liangliang Ren, Jiwen Lu, Jianjiang Feng, andZhou Jie. Consistent-aware deep learning for person re-identification in a camera network. In CVPR, 2017. 2

[23] Alex Kendall, Yarin Gal, and Roberto Cipolla. Multi-tasklearning using uncertainty to weigh losses for scene geome-try and semantics. In CVPR, 2018. 7, 8

[24] Elyor Kodirov, Tao Xiang, Zhenyong Fu, and ShaogangGong. Person re-identification by unsupervised `1 graphlearning. In ECCV, 2016. 1, 2

[25] Martin Kostinger, Martin Hirzer, Paul Wohlhart, Peter MRoth, and Horst Bischof. Large scale metric learning fromequivalence constraints. In CVPR, 2012. 1, 2

[26] Ilja Kuzborskij and Francesco Orabona. Stability and hy-pothesis transfer learning. In ICML, 2013. 2

[27] Minxian Li, Xiatian Zhu, and Shaogang Gong. Unsupervisedperson re-identification by deep learning tracklet association.In ECCV, 2018. 2

[28] Wei Li, Rui Zhao, Tong Xiao, and Xiaogang Wang. Deep-reid: Deep filter pairing neural network for person re-identification. In CVPR, 2014. 1, 2, 5, 6

[29] Xiang Li, Wei-Shi Zheng, Xiaojuan Wang, Tao Xiang, andShaogang Gong. Multi-scale learning for low-resolution per-son re-identification. In ICCV, 2015. 2

[30] Zhen Li, Shiyu Chang, Feng Liang, Thomas S Huang, Lian-gliang Cao, and John R Smith. Learning locally-adaptivedecision functions for person verification. In CVPR, 2013. 2

[31] Shengcai Liao, Yang Hu, Xiangyu Zhu, and Stan Z Li. Per-son re-identification by local maximal occurrence represen-tation and metric learning. In CVPR, 2015. 1, 2, 6

[32] Chunxiao Liu, Shaogang Gong, Chen Change Loy, and X-inggang Lin. Person re-identification: what features are im-portant? In ECCV Workshop, 2012. 2

[33] Lianyang Ma, Xiaokang Yang, and Dacheng Tao. Personre-identification over camera networks using multi-task dis-tance metric learning. IEEE TIP, 2014. 2

[34] Niki Martinel, Abir Das, Christian Micheloni, and Amit KRoy-Chowdhury. Temporal model adaptation for person re-identification. In ECCV, 2016. 2

[35] T. Matsukawa, T. Okabe, E. Suzuki, and Y. Sato. Hierarchi-cal gaussian descriptor for person re-identification. In CVPR,2016. 2

[36] Alexis Mignon and Frederic Jurie. Pcca: A new approach fordistance learning from sparse pairwise constraints. In CVPR,2012. 2

Page 10: Distilled Person Re-identification: Towards a More ...wuancong/papers/Distilled Person Re... · Distilled Person Re-identification: Towards a More Scalable System Ancong Wu1, Wei-Shi

[37] Asit Mishra and Debbie Marr. Apprentice: Using knowledgedistillation techniques to improve low-precision network ac-curacy. arXiv preprint arXiv:1711.05852, 2017. 2, 3

[38] Sakrapee Paisitkriangkrai, Chunhua Shen, and Antonvan den Hengel. Learning to rank in person re-identificationwith metric ensembles. In CVPR, 2015. 2

[39] Rameswar Panda, Amran Bhuiyan, Vittorio Murino, andAmit K Roy-Chowdhury. Unsupervised adaptive re-identification in open world dynamic camera networks. InCVPR, 2017. 2

[40] Nikolaos Passalis and Anastasios Tefas. Learning deep rep-resentations with probabilistic knowledge transfer. In ECCV,2018. 2, 7

[41] Sateesh Pedagadi, James Orwell, Sergio Velastin, andBoghos Boghossian. Local fisher discriminant analysis forpedestrian re-identification. In CVPR, 2013. 2

[42] Peixi Peng, Tao Xiang, Yaowei Wang, Massimiliano Pon-til, Shaogang Gong, Tiejun Huang, and Yonghong Tian.Unsupervised cross-dataset transfer learning for person re-identification. In CVPR, 2016. 1, 2, 6

[43] Bryan Prosser, Wei-Shi Zheng, Shaogang Gong, Tao Xiang,and Q Mary. Person re-identification by support vector rank-ing. In BMVC, 2010. 2, 7

[44] Adriana Romero, Nicolas Ballas, Samira Ebrahimi Kahou,Antoine Chassang, Carlo Gatta, and Yoshua Bengio. Fitnets:Hints for thin deep nets. arXiv preprint arXiv:1412.6550,2014. 2

[45] Sourya Roy, Sujoy Paul, Neal E Young, and Amit K Roy-Chowdhury. Exploiting transitivity for learning person re-identification models on a budget. In CVPR, 2018. 2

[46] Mark Sandler, Andrew Howard, Menglong Zhu, Andrey Zh-moginov, and Liang-Chieh Chen. Mobilenetv2: Invertedresiduals and linear bottlenecks. In CVPR, 2018. 6, 8

[47] Yifan Sun, Liang Zheng, Yi Yang, Qi Tian, and ShengjinWang. Beyond part models: Person retrieval with refinedpart pooling. In ECCV, 2018. 1, 2, 3, 6, 7

[48] Antti Tarvainen and Harri Valpola. Mean teachers are betterrole models: Weight-averaged consistency targets improvesemi-supervised deep learning results. In NeurIP, 2017. 2

[49] Rahul Rama Varior, Mrinal Haloi, and Gang Wang. Gatedsiamese convolutional neural network architecture for humanre-identification. In ECCV, 2016. 2

[50] Hanxiao Wang, Shaogang Gong, Xiatian Zhu, and Tao Xi-ang. Human-in-the-loop person re-identification. In ECCV,2016. 2

[51] Jingya Wang, Xiatian Zhu, Shaogang Gong, and Wei Li.Transferable joint attribute-identity deep learning for unsu-pervised person re-identification. In CVPR, 2018. 1, 2, 6

[52] Xiaojuan Wang, Wei-Shi Zheng, Xiang Li, and JianguoZhang. Cross-scenario transfer person reidentification. IEEETCSVT, 2016. 1, 2

[53] Longhui Wei, Shiliang Zhang, Wen Gao, and Qi Tian.Person transfer gan to bridge domain gap for person re-identification. In CVPR, 2018. 2, 5, 6

[54] Kilian Q Weinberger, John Blitzer, and Lawrence K Saul.Distance metric learning for large margin nearest neighborclassification. In NeurIP, 2005. 2, 3

[55] Shangxuan Wu, Ying-Cong Chen, Xiang Li, Ancong Wu,Jinjie You, and Wei-Shi Zheng. An enhanced deep featurerepresentation for person re-identification. In WACV, 2016.2

[56] Weiji Wu, Ancong Wu, and Wei-Shi Zheng. Light personre-identification by multi-cue tiny net. In ICIP, 2018. 1, 2

[57] Tong Xiao, Hongsheng Li, Wanli Ouyang, and XiaogangWang. Learning deep feature representations with domainguided dropout for person re-identification. In CVPR, 2016.2

[58] Fei Xiong, Mengran Gou, Octavia Camps, and Mario Sz-naier. Person re-identification using kernel-based metriclearning methods. In ECCV, 2014. 2

[59] Junho Yim, Donggyu Joo, Jihoon Bae, and Junmo Kim. Agift from knowledge distillation: Fast optimization, networkminimization and transfer learning. In CVPR, 2017. 2

[60] Jinjie You, Ancong Wu, Xiang Li, and Wei-Shi Zheng. Top-push video-based person re-identification. In CVPR, 2016.2

[61] Shan You, Chang Xu, Chao Xu, and Dacheng Tao. Learningfrom multiple teacher networks. In SIGKDD, 2017. 2

[62] Hong-Xing Yu, Ancong Wu, and Wei-Shi Zheng. Cross-view asymmetric metric learning for unsupervised person re-identification. In ICCV, 2017. 1, 2, 6

[63] Li Zhang, Tao Xiang, and Shaogang Gong. Learning a dis-criminative null space for person re-identification. In CVPR,2016. 2, 7

[64] Haiyu Zhao, Maoqing Tian, Shuyang Sun, Shao Jing, JunjieYan, Yi Shuai, Xiaogang Wang, and Xiaoou Tang. Spindlenet: Person re-identification with human body region guidedfeature decomposition and fusion. In CVPR, 2017. 2

[65] Liming Zhao, Li Xi, Yueting Zhuang, and Jingdong Wang.Deeply-learned part-aligned representations for person re-identification. In ICCV, 2017. 2

[66] Liang Zheng, Liyue Shen, Lu Tian, Shengjin Wang, Jing-dong Wang, and Qi Tian. Scalable person re-identification:A benchmark. In ICCV, 2015. 5, 6

[67] Wei-Shi Zheng, Shaogang Gong, and Tao Xiang. Reidentifi-cation by relative distance comparison. IEEE TPAMI, 2013.1, 2

[68] Wei-Shi Zheng, Shaogang Gong, and Tao Xiang. Towardsopen-world person re-identification by one-shot group-basedverification. IEEE TPAMI, 2016. 2

[69] Wei-Shi Zheng, Xiang Li, Tao Xiang, Shengcai Liao,Jianhuang Lai, and Shaogang Gong. Partial person re-identification. In ICCV, 2015. 2

[70] Zhedong Zheng, Liang Zheng, and Yi Yang. Unlabeled sam-ples generated by gan improve the person re-identificationbaseline in vitro. In ICCV, 2017. 2, 5, 6

[71] Zhun Zhong, Liang Zheng, Shaozi Li, and Yi Yang. Gener-alizing a person retrieval model hetero-and homogeneously.In ECCV, 2018. 2, 6

[72] Zhun Zhong, Liang Zheng, Zhedong Zheng, Shaozi Li,and Yi Yang. Camera style adaptation for person re-identification. In CVPR, 2018. 2

[73] Xiatian Zhu, Botong Wu, Dongcheng Huang, and Wei-ShiZheng. Fast open-world person re-identification. IEEE TIP,2017. 1, 2