pure.ulster.ac.uk · Web viewNetwork architecture comparison of two kind of identity models. If the...

13
Adversarially Erased Learning for Person Re-identification by Fully Convolutional Networks Author Email: Abstract—Despite recent remarkable progress, person re-identification is still subject to failure cases, in particular, when pedestrians are occluded, moreso when key parts of their body are missed. To address this issue, we propose a feature-level augmentation strategy, Adversarially Erased Learning Module (AELM), using two adversarial classifiers. Specifically, we utilize a classifier to identify discriminative regions and erase them to increase the variety of features. Meanwhile we input the erased feature maps into another classifier to discover new associated body regions which effectively resist detection due to occlusion of key parts. To easily perform end-to-end training for AELM, we propose a novel Identity model based on Fully Convolutional Networks (IFCN) to directly obtain body response heatmaps during the forward pass by selecting corresponding class- specific feature maps. Thus, the discriminative regions can be identified and erased in a convenient way. Moreover, to capture discriminative regions for AELM, we present a Complementary Attention Module (CoAM) combined with channel and spatial attention to automatically focus on which feature types and positions are meaningful in the feature maps. In this paper, CoAM and AELM are cascaded into one module which is applied to the outputs of different convolutional layers to mix mid- and high-level semantic features. Experimental results on three challenging benchmarks demonstrate the efficacy of the proposed method. Keywords—Complementary Attention, Adversarially Erased Learning, Fully Convolutional Networks I. INTRODUCTION Given a person-of-interest image, person re-identification (Re-ID) is the task of matching all the same identity images captured across different security cameras. Despite recent remarkable progress, person Re-ID is still a challenging task due to environmental settings like background clutter, illumination, as well as human attributes like posture and clothing changes. This can result in visual cues being dramatically different from image to image. To address such challenges, deep- learning methods have been widely applied and have obtained promising performance compared with hand-crafted techniques [1] [2] [3]. However, deep Re-ID models often contain too many parameters which are usually optimized using a limited dataset,

Transcript of pure.ulster.ac.uk · Web viewNetwork architecture comparison of two kind of identity models. If the...

Page 1: pure.ulster.ac.uk · Web viewNetwork architecture comparison of two kind of identity models. If the parameters of the two identity models are initialized in the same way, and are

Adversarially Erased Learning for Person

Re-identification by Fully Convolutional Networks

Author

Email:

Abstract—Despite recent remarkable progress, person re-identification is still subject to failure cases, in particular, when pedestrians are occluded, moreso when key parts of their body are missed. To address this issue, we propose a feature-level augmentation strategy, Adversarially Erased Learning Module (AELM), using two adversarial classifiers. Specifically, we utilize a classifier to identify discriminative regions and erase them to increase the variety of features. Meanwhile we input the erased feature maps into another classifier to discover new associated body regions which effectively resist detection due to occlusion of key parts. To easily perform end-to-end training for AELM, we propose a novel Identity model based on Fully Convolutional Networks (IFCN) to directly obtain body response heatmaps during the forward pass by selecting corresponding class-specific feature maps. Thus, the discriminative regions can be identified and erased in a convenient way. Moreover, to capture discriminative regions for AELM, we present a Complementary Attention Module (CoAM) combined with channel and spatial attention to automatically focus on which feature types and positions are meaningful in the feature maps. In this paper, CoAM and AELM are cascaded into one module which is applied to the outputs of different convolutional layers to mix mid- and high-level semantic features. Experimental results on three challenging benchmarks demonstrate the efficacy of the proposed method.

Keywords—Complementary Attention, Adversarially Erased Learning, Fully Convolutional Networks

I. INTRODUCTION Given a person-of-interest image, person re-identification

(Re-ID) is the task of matching all the same identity images captured across different security cameras. Despite recent remarkable progress, person Re-ID is still a challenging task due to environmental settings like background clutter, illumination, as well as human attributes like posture and clothing changes. This can result in visual cues being dramatically different from image to image. To address such challenges, deep-learning methods have been widely applied and have obtained promising performance compared with hand-crafted techniques [1] [2] [3]. However, deep Re-ID models often contain too many parameters which are usually optimized using a limited dataset, which increases the risk of

over-fitting and weakens generalization capability. Hence, improving the generalization ability of deep Re-ID models is a significant, yet important, problem .

For deep-learning Re-ID methods, increasing the variety of training data is an effective way to improve the generalization ability of Convolutional Neural Networks (CNNs). Unlike other tasks such as object recognition, Re-ID requires the collection of cross-camera data in order t to obtain an adequately large dataset and this is exceedingly expensive with respect to time. To tackle such issues, data augmentation is an alternative technique which maximises the current dataset without the need to increase costs. Recent studies have generated image datasets of persons with different body poses [4] and camera styles [5] using Generative Adversarial Networks (GANs) [6]. However, such approaches suffer from long training times, difficulty in convergence, and low image resolution. In addition to explicitly synthesizing new images, other common methods include randomly cropping, flipping and mirroring the original image datasets.

Occlusion is also a critical influencing factor on the generalization of CNNs. One possible way to address it is to collect large-scale datasets which have occluded pedestrian images. However, again this is rather expensive but another reasonable solution is to accurately simulate occlusion. For instance, [7] selects a rectangular region with random position and size within an input image, and erases its pixels with random values. In, [8] network visualization techniques are used to search for a discriminative region , then this region is occluded with a rectangle to generate new samples, and these samples are combined with original regions to re-train the Re-ID model, yet it is not an end-to-end learning process. In addition, both strategies are used to occlude the original images, which is considered as image-level augmentation.

In this paper, we aim to increase the variety of features and resist the occlusion of key image regions to improve generalization ability. We first propose a novel Identity model using Fully Convolutional Networks (IFCN) which replaces the fully connected layer with a convolutional layer. With this

Page 2: pure.ulster.ac.uk · Web viewNetwork architecture comparison of two kind of identity models. If the parameters of the two identity models are initialized in the same way, and are

design, during the forward pass the body response heatmap can be obtained by directly choosing class-specific feature maps corresponding to the newly added convolutional layer, rather than using a post-processing manner as in [18]. Then, we introduce an end-to-end feature-level augmentation strategy, Adversarially Erased Learning Module (AELM). Benefiting from IFCN, we can utilize a main classifier to identify the discriminative body regions which are used to guide the operation of erasing feature maps to increase the variety of features. Meanwhile erased features are fed into an auxiliary classifier to discover ew body regions which can then deal with occlusions or scenarios where key body parts are missing. Moreover, as more discriminative body regions are erased, it becomes more difficult to feed feature maps i into the auxiliary classifier. It is beneficial to exploit more generalized features that exist in most images instead of targeting specialised patterns only in the current image of interest. In order to capture discriminative regions for the AELM as much as possible, we present a Complementary Attention Module (CoAM) combined with channel and spatial attention. Since features extracted by convolutional operations are combinations of simultaneous cross-channel and spatial information , we employ CoAM to emphasize meaningful features along these two axes, on the fly, to capture discriminative regions. Most existing methods employ the output of the final convolution layer as an identity representation, which mainly contains high-level semantic features and discards mid-level semantic feature. Therefore, we apply the cascaded CoAM and AELM to the outputs of different convolutional layers to combine mid- and high-level semantic features. Generally speaking, the novel contributions are three-fold.

● We propose a novel identity model, IFCN, which provides a simple, yet effective, way to obtain body response heatmaps, and mathematically prove that it is equal to the existing identity model.

● We introduce feature-level augmentation, AELM, in an end-to-end manner, which erases discriminative regions to increase the variety of feature maps, and resists occlusion of key parts.

● We present a CoAM to learn which information should be emphasized or suppressed, and employ CoAM and AELM to different layers to combine both mid- and high-level semantic features.

II. THE PROPOSED METHOD

In this section, we present the structure of the Complementary Attention and Adversarially Erased Learning framework. The backbone of the proposed method is a ResNet-50 [19] network, which has a relatively concise architecture and can be used to obtain comparative performance evaluation and to be consistent with previous methods [8] [10] [14]. As shown in Fig. 1, there are three

branches used for combining mid- and high-level semantic features in the proposed network. We call these branches Mid_1 Branch, Mid_2 Branch and High Branch from top to bottom. Each branch is cascaded by CoAM and AELM. Among them, CoAM is sequentially composed of a Channel Attention Module (CAM) and a Spatial Attention Module (SAM) to automatically emphasize meaningful features. AELM is utilized to simulate occlusion and detect new body parts. During training, the proposed method minimizes the summation of the three branches of semantic feature predictions. During testing, the feature maps inherited by CoAM from the three branches are fed into the Global Average Pooling (GAP) layers to obtain 2048-dimensional feature vectors which are subsequently concatenated together to form a final descriptor.

Fig. 1. The architecture of proposed framework.

A. Identity Model based on Fully Convolutional Networks1) Revisiting the existing identity model: Due to the

ability to learn a discriminative feature representation, the identity model is widely used in Re-ID research. However, the standard identity model follows the structure in Fig. 2(a). Firstly, the input image goes through the backbone ResNet50 network to obtain the feature maps . Next, S is fed into a GAP layer and a fully connected layer. Finally, a softmax layer is used to obtain the probability of the image belonging to each identity. The average score of the n-th feature map is computed as:

where n = 1, 2, …, N, N is the number of channels, H and E are the height and width of the feature maps respectively, and

is the (a,b)-th element of the n-th feature map . The weight of the fully connected layer is defined as , where C is the number of person identities, and the bias term is ignored here for convenience. For the person identity c, the input for c-th softmax node can be recorded as:

Page 3: pure.ulster.ac.uk · Web viewNetwork architecture comparison of two kind of identity models. If the parameters of the two identity models are initialized in the same way, and are

where represents the (n,c)-th element value of the

weight matrix . After a forward pass, the heatmap of the person’s identity c can be calculated as described in [18] by a weighted summation of feature maps S:

2) Proposed identity model: We have observed two critical issues that exist in the current identity model: the first is that we can take advantage of spatial information due to the fully connected layer used for classification. The second is that while discriminative regions can be located [18], they are obtained by an additional operation after forward propagation. In this paper, we replace the fully connected layer with a convolutional layer in reference to semantic segmentation [20] [21] and object localization [22] [23]. introduce an Identity model Fully Convolutional Network (IFCN) into the Re-ID system. With this approach, we can conveniently obtain the body response heatmap by directly choosing the class-specific feature map of the newly added convolutional layer.

Using the proposed identity model in Fig. 2 (b), we feed the input image into the backbone network and a convolutional layer with a kernel size of , width 1 and C

channels to generate heatmap for each identity, c = 1, 2, …,C. Afterwards, the probability of an identity is obtained through the GAP and softmax layers. If we define the weight of the convolutional layer as , we can

calculate as follows:

where is the (n,c)-th element value of the weight matrix . At this point, the input value for c-th node of the

softmax layer, , can be computed according to the mean

response value of :

(a) Identity model based on fully connected layer

(b) Identity model based on Fully Convolutional Networks

Fig. 2. Network architecture comparison of two kind of identity models.

If the parameters of the two identity models are initialized

in the same way, and are equal due to the same mathematical forms. Similarly, when networks converge the

body response heatmaps and can be captured with the same precision as shown in Fig. 2. Compared with an identity model based on a fully connected layer, we can directly get the heatmap during a forward pass with IFCN, which contributes to identity discriminative regions in a simple way and provides supporting theories for AELM. We illustrate additional heatmaps produced by both kinds of identity models in Fig. 3.

Fig. 3. Heatmaps for two kind of identity models.

B. Complementary Attention Module (CoAM)The goal of this element is to capture discriminative body

regions for AELM by using an attention mechanism which has been shown to boost the performance of Re-ID when used alone. In particular, we present a Complementary Attention Module (CoAM), combined with channel and spatial attention, to emphasize important features and suppress unnecessary features.

Page 4: pure.ulster.ac.uk · Web viewNetwork architecture comparison of two kind of identity models. If the parameters of the two identity models are initialized in the same way, and are

1) Channel Attention Module (CAM): For an input image, each channel in the feature maps can be regarded as a feature detector where channel attention is concerned with which type of features are thought to be more meaningful. In this paper, we adopt the idea of SE Block [24] as the proposed Channel Attention Module (CAM). Here, it is applied to the output of ResNet50 to explore the relationship of inter-channel features and is illustrated in Fig. 4.

Fig. 4. Channel Attention Module.

A GAP operation is used to aggregate the spatial information of each channel in the feature maps to generate the channel feature descriptor

, and (6) is used to get the statistical

value of each channel:

Then, s is passed through a gating mechanism to obtain channel attention feature map e:

where and represent the Sigmodal and RELU activation

functions, and respectively. These refer to the weight of the fully connected layer Fc1 and Fc2, and r is the reduction ratio.

Finally, the refined feature map can be calculated by multiplying the channel attention feature map e with the original input feature map S :

where represents an element-wise multiplying operation.

2) Spatial Attention Module (SAM): Unlike channel attention, spatial attention focuses on which positions in the feature maps are more informative, which is complementary to CoAM. However, some previous works [17] [25] perform spatial attention through a series of convolutional operations which only get a local view on pedestrian images and ignore

the spatial relation of different body parts. To encode spatial context information into pedestrian local features., we align our approach to be consistent with Non-local Neural Networks [26] and Self-Attention [27]. Next, we elaborate on the process of adaptively aggregating spatial contexts.

Our Spatial Attention Module (SAM) is as illustrated in

Fig. 5, where we feed S into the convolutional layers

and are used to extract two new feature maps T and U, and to adjust their shape to , where is the number of features. After transposing T and employing a multiplying matrix to it and U, we perform a softmax layer in a row direction to generate a spatial attention map . Each element of D can be calculated as:

where measures the impact of the i-th position to the j-th position. The more similar that the feature representations are at the two positions, the higher the relation between them.

Meanwhile, is sent to the convolutional layer .This operation generates a new feature map which is then adjusted to . After that, the transpose of V and D is obtained by a matrix multiplication. By adjusting the shape of the result to and passing it through the

convolutional layer , a feature map is obtained. Finally, X is multiplied by the scaling parameter , and the input feature map is added element-wise to get final output :

The element of can be expressed as:

Where is a learnable parameter, initially set to 0, gradually increasing to a greater weight. As can be seen from (11), each

feature in the feature map is the weighted sum between

the features of all positions and the original input feature . Hence, it contains a global view, which can selectively aggregate the spatial context information of the related local

region according to the value of in the spatial attention map D.

Page 5: pure.ulster.ac.uk · Web viewNetwork architecture comparison of two kind of identity models. If the parameters of the two identity models are initialized in the same way, and are

Fig. 5. Spatial Attention Module.

3) Combining CAM and SAM: The two attention modules, CAM and SAM, can produce complementary attention features by focusing on the type and position of valid features respectively. Thus, they can be combined together to take advantage ofboth. As shown in Fig.1, we arrange CAM and SAM sequentially:

Obviously, there are two other arrangement ways: SAM-CAM order and a parallel arrangement of CAM and SAM. We have found that CAM-first order is superior to the other arrangements, and this is detailed in the experimental results section . D. ⅢC. Adversarially Erased Learning Module (AELM)

Occlusion is a critical influencing factor on the generalization of CNNs. When some parts of a pedestrian image are occluded, an robust Re-ID model should be able to classify it correctly. In addition, deep-learning person Re-ID models usually distinguish one pedestrian from others by the unique pattern where the heatmap always highlights only a small part of the body, rather than the whole body. In contrast, our AELM resists occlusions and discovers relatively integrated body regions in an adversarial manner.

As shown in Fig. 6, the AELM network consists of three components: the feature maps refined by CoAM, the main classifier A, and the auxiliary classifier B. is sent to two parallel classifiers, and the heatmap produced by each classifier can be obtained by IFCN. Both branches contain the same number of convolutional layers, a GAP layer, and a SoftMax layer for classification.

Fig. 6. Adversarially Erased Learning Module.

However, the input feature maps of the two classifiers are different. Under the guidance of the discriminative region mined by the main classifier A, the input of the auxiliary classifier B is partially erased to increase the variety of the features. Specially, a heatmap produced by classifier A, where the response is above given erased threshold , is considered as a discriminative region. The corresponding region in the input feature maps of the auxiliary classifier B is then erased in an adversarial manner by replacing the response values with zeros. Such an operation assigns the classifier B an incomplete feature map to learn occlusion cases and obtain new regions.

While our method is inspired by [23], our motivation is different . In [23], two localization maps are merged using element-wise summation, thus the output is a localization map. In this paper, the erasing operation in the feature maps increases the variety of features and simulates the missing key parts, which improves the generalization ability. Meanwhile, the discriminative regions are erased, which drives network to discover new part regions during the training phase and improves the discriminability of feature representations during the testing phase.

D. Multi-level Semantic Branches Most existing person Re-ID systems directly use a Deep

Neural Network (DNN) designed for object categorization and employ a final layer output with high-level semantic features as a representation. As a result, the mid-level features are missed and cannot help to effectively distinguish an identity. Considering this, we also employ CoAM and AELM to the outputs of res5a and res5b to exploit the mid-level semantic features. In [28], the mid-layer and final-layer feature maps are fused as a representation, followed by a softmax function to predict the persons identity. Compared with [28], each branch in our model is supervised by an independent SoftMax loss at training phase, and the features from three branches are then concatenated together to form descriptor.

Page 6: pure.ulster.ac.uk · Web viewNetwork architecture comparison of two kind of identity models. If the parameters of the two identity models are initialized in the same way, and are

E. Loss functionWith the aim of improving robustness to a variety of

multi-class classification tasks, SoftMax loss is used to unleash the discriminative capability of deep the representation. The loss function is the sum of the three branches used to predict identity:

where P is the size of mini-batch, C, M, K are the number of person identities, branches, classifiers, respectively,

is the input for softmax node of k-th classifier

of m-th branch of p-th example, and is the label of p-th

example. controls the relative importance of two

classifiers, is 1 corresponding to the main classifier and

is 0.5 corresponds to the auxiliary classifier.

III. EXPERIMENTS

A. Datasets And Evaluation Protocols1) Datasets: We evaluate the proposed method on three

mainstream datasets, Market-1501 [29], DukeMTMC-reID [30] and CUHK03 [31]. The Market-1501 dataset consists of 1,501 identities from 6 camera viewpoints, 12,936 training images and 19,732 gallery images. The DukeMTMC-reID dataset contains 36,411 pedestrian images of 1,404 identities captured by 8 manually installed cameras. DukeMTMC-reID is divided into a training set with 16,522 images and a testing set with 2,228 query images and 17,661 gallery images. The CUHK03 includes 1,467 labelled person with a total of 14,097 images in the Chinese University of Hong Kong (CUHK) campus. This dataset includes bounding boxes by hand and automatic detections by DPM [32], and the evaluation in this paper is based this dataset.

2) Evaluation Protocols: We report the Cumulative Matching Characteristics (CMC) and mean Average Precision (mAP) for all candidate datasets. All the evaluations are performed in a single query mode. We adopt packages proposed in [29] and [30] to evaluate the Market-1501 and DukemTMC-reID datasets, respectively. For CUHK03, we use the new protocol provided by [33].

B. Implementation detailsThe proposed model is trained and fine-tuned on the

Pytorch platform. For the backbone network, we adopt the ResNet-50 model with weights pre-trained on ImageNet [34]. There are some slight modifications from the original ResNet-50. The global average pooling layer and what follows have all been removed. Furthermore, the stride of the res_conv4a block is set to 1. Finally, two mid-level semantic branches

directly after res_conv5a and res_conv5b are added. During training, we employ only horizontal flipping to train with pedestrian images for data augmentation. In order to obtain proper sized feature maps containing detailed information, all the training images are resized to pixels. The mini-batch size is set to 64 for all experiments and we train the model for 60 epochs in total. The erased threshold is 0.75. With respect to the learning rate strategy, we set the base learning rate to 0.1, and decay the learning rate to 0.07 and 0.01 after 15 and 30 epochs, respectively. Notice that the learning rate for all the layers from the pre-trained ResNet network is set to 0.1 the base learning rate. As for the optimizer, we choose Stochastic Gradient Descent (SGD) with momentum 0.9 and weight decay factor 0.0005 to update parameters.

C. Comparision with State-of-the-Art methods1) Evaluation using the Market-1501 dataset: The

compelling results using the Market-1501 dataset are shown in TABLE . Concretely, Ⅰ the proposed method achieves a mAP of 75.6% and a Rank1 accuracy of 89.3%. The compared methods are divided into three categories, hand-crafted methods, deep-learning methods based on global or local features and deep learning methods based on data augmentation. Deep learning methods feature significantly ahead of hand-crafted features. The local feature representation method, DPFL [35], has the closest performance to the proposed method. Noting that DPFL adopt multi-scale images as input, with this design, we believe that the performance of the proposed model can be further enhanced.

TABLE I. COMPARISION RESULTS ON MARKET-1501 DATASET

Method Rank1 Rank5 Rank10 mAPBow+Kissme [4] 44.4 63.9 72.2 20.8WARCA [2] 45.2 68.1 76.0 -SVDNet [10] 82.3 92.3 95.2 62.1Triplet Loss [9] 84.9 94.2 - 69.1ACNN [13] 85.9 - - 66.9PSE [12] 87.7 94.5 96.8 69.0DPFL [35] 88.6 - - 72.6IDE+RE [7] 85.2 - - 68.3AOS [8] 86.5 - - 70.4Proposed Method 89.3 95.7 97.2 75.6

In the data augmentation strategies, AOS [8] is similar to the proposed method. Nevertheless, there are three main disadvantage of AOS, a) due to fully connected layers being used for classification, it is necessary to take an extra step to obtain a body response heatmap. b) it first occludes discriminative regions on the original images to produce new samples and then combines new and original samples to retrain Re-ID model, which is not an end-to-end learning process. c) During searching for body regions to occlude, AOS utilizes multiple classifiers to monitor the variation of classification probability and has to confirm several hyper-

Page 7: pure.ulster.ac.uk · Web viewNetwork architecture comparison of two kind of identity models. If the parameters of the two identity models are initialized in the same way, and are

parameters such as the spatial size H and E of rectangle and occlusion probability threshold. In this paper, we propose IFCN to directly obtain heatmaps during forward passes. Owing to IFCN, the proposed feature augmentation, AELM, can be trained in an end-to-end manner, and only uses two classifiers and a hyper-parameter, erased threshold .

2) Evaluation on DukeMTMC-reID: For the DukeMTMC-reID dataset, we compare the proposed method with some state-of-the-art methods. As presented in TABLE

, we achieves 81.2% Ⅱ for Rank1 accuracy and 65.9% for mAP. PSE [12] is similar to the proposed method, but unfortunately it makes use of external human pose information, which again is not an end-to-end method. At Rank1 accuracy, the proposed method outperforms PSE [12], AOS [8] and DPFL [35] by 1.4%, 2% and 2%, respectively. Additionally the proposed approach surpasses the state-of-the-art methods with respect to mAP by a large margin (+3.9%, +3.8%, 5.3%). In fact, Rank1 accuracy indicates the ability to match the easiest gallery in different cameras, while mAP characterizes the ability to retrieve all the galleries, which is a more important evaluation metric.

TABLE II. COMPARISION RESULTS ON DUKEMTMC-REID DATASET

Method Rank1 mAPBow+Kissme [4] 25.1 12.2LOMO+XQDA [1] 30.8 17.0IDE+RE [7] 74.2 56.2SVDNet [10] 76.7 56.8ACNN [13] 76.8 59.3DPFL [35] 79.2 60.6AOS [8] 79.2 62.1PSE [12] 79.8 62.0Proposed Method 81.2 65.9

3) Evaluation on the CUHK03 dataset: The comparison results using the CUHK03 dataset are reported in TABLE .Ⅲ In general, deep learning models have compromised performance on small datasets, whereas the proposed model performs well on small datasets. Concretely, for the proposed approahc mAP is higher 8.0% and 12.7% higher than AOS [8] and HA-CNN [25], respectively. We attribute this positive result to the effectiveness of the proposed feature augmentation strategy, AELM, which increases the variety of feature maps thus reducing the risk that the approach is prone to overfit especially on this, and other, slightly smaller scale dataset.

TABLE III. COMPARISION RESULTS ON CUHK03 DATASET

Method Rank1 mAPBOW+XQDA [4] 6.4 6.4LOMO+XQDA [1] 12.8 11.5DPFL [35] 40.7 37.0IDE+RE[7] 41.5 36.8SVDNet [10] 41.5 37.3HA-CNN [25] 41.7 38.6

AOS [8] 47.1 43.3Proposed Method 52.6 51.3

D. Ablation Study To verify the validity of each individual component in the

proposed method, we conduct several ablation studies using the Market-1501 dataset.

1) Effectiveness of attention modules and their arrangements: We add two convolutional layers of 1024 channels with a kernel size of, stride 1 on top of the backbone as described in Fig. 2 (b) and adopt this structure as the Baseline method. Note that the baseline model only contains high-level semantic information. The results of applying CAM or SAM independently and combining these two modules in different arrangements are shown in TABLE . When theⅣ CAM or SAM is employed independently, the performance of Re-ID demonstrates a promising improvement. Therefore, it is natural to integrate these two attention modules. There are three different ways of arranging the CAM and SAM, we observe that CAM-first order is better than parallel use of CAM and SAM, and SAM-first order.

TABLE IV. COMPARISION RESULTS OF DIFFERENT ATTENTION MODULE

Model Rank1 Rank5 Rank10 mAPBaseline 84.9 93.9 95.8 66.8Baseline+SAM 85.9 94.5 96.4 69.5Baseline+CAM 87.1 94.8 96.9 70.8Baseline+SAM+CAM 86.9 94.4 95.9 67.8Baseline+SAM&CAM 85.5 94.2 96.0 69.3Baseline+CAM+SAM 88.2 95.1 96.8 71.6

2) Effectiveness of Adversarially Erased Learning Module: The experimental result when using AELM with CoAM of High Branch is given in TABLE . AELMⅤ enhances local features by discovering new human body regions, and increases Rank1 and mAP by 0.6% and 2.1%, respectively. Fig. 7 shows that two classifiers mine the discriminative regions from different poses of differing pedestrians. Due to the input being partially erased, classifier B is driven to discover different, but complementary, body regions to classifier A. For instance, in the first two columns of Fig. 7, classifier A focus on the hip region while classifier B pays attention to the upper body region . In the actual scene, the position where the pedestrian is occluded and the area of the occlusion region are uncertain. Thus, we generate a rectangular box with arbitrary size at any position of the image to simulate examples of occlusion. From Fig. 8, we can observe that when the upper or lower body of the pedestrian is occluded, the CoAM model fails to match the correct gallery images which is undesirable. After using AELM, the proposed model can resist occlusion more robustly.

Page 8: pure.ulster.ac.uk · Web viewNetwork architecture comparison of two kind of identity models. If the parameters of the two identity models are initialized in the same way, and are

Fig. 7. Heatmaps for two classifiers of AELM.

Fig. 8. Ranklist of query to occlusion before or after using AELM

TABLE V. COMPARISION RESULTS OF AELM AND DIFFERENT NUMBER OF BRANCHES

Model Rank1 Rank5 Rank10 mAPCoAM 88.2 95.1 96.8 71.6CoAM+AELM (High) 88.8 95.5 97.1 73.7High+Mid_2 89.0 95.7 97.2 75.2High+Mid_2+Mid_1 89.3 95.7 97.2 75.6

3) Effectiveness of different number of branches: In Fig. 1, we add two mid-level semantic branches immediately after res_conv5a (Mid_1) and res_conv5b (Mid_2), which are also cascaded by CoAM and AELM in the same way. From TABLE , we can observe that as the number of branchesⅤ increases, the performance of Re-ID has a similar consistent improvement. Hence, it is essential to mix mid- and high-level semantic features for deep person Re-ID. We also conduct ablation studies on the DukeMTMC-reID and CUHK03 datasets and obtain the same conclusion.

IV. CONCLUSION

This paper presents a deep learning part method, cascaded by CoAM and AELM, to improve the discriminability and generalization of feature representation. In contrast to existing identity models for obtaining body response heatmaps by an extra step, we introduce a FCN into person re-ID, and directly

obtaine the heatmaps during forward propagation. Based on this, we propose a feature augmentation strategy, AELM, via end-to-end training. The two adversarial classifiers can increase the variety of feature maps, simulate occlusion of key parts, and mine different, but complementary, local regions. We also propose a CoAM to refine feature maps along with channel and spatial axes for better location of body regions. Furthermore, we employ CoAM and AELM at different layers to combine mid- and high-level semantic features. Experiments are conducted using three challenging datasets and the significant improvements demonstrate the effectiveness of the proposed method, especially when using small scale datasets such as CUHK03.

ACKNOWLEDGMENT This work is supported by National Natural Science

Foundation of China (No. 61471110,61733003), National Key R&D Program of China (NO.2017YFC0805000/5005), Fundamental Research Funds for the Central Universities (N172608005, N160413002).

REFERENCES

[1] S. Liao, Y. Hu, X. Zhu, and S. Z. Li, “Person re-identification by local maximal occurrence representation and metric learning,” in the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2015, pp. 2197-2206.

[2] C. Jose and F. Fleuret, “Scalable metric learning via weighted approximate rank com-ponent analysis,” in European Conference on Computer Vision (ECCV). Springer, 2016, pp. 875-890.

[3] L. Zhang, T. Xiang, and S. Gong, “Learning a discriminative null space for person re-identification,” in the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2016, pp. 1239-1248.

[4] R. Huang, S. Zhang, T. Li, and R. He, “Beyond face rotation: Global and local perception gan for photorealistic and identity preserving frontal view synthesis,” arXiv preprint arXiv:1704.04086, 2017.

[5] Z. Zhong, L. Zheng, Z. Zheng, S. Li, and Y. Yang, “Camera style adaptation for person re-identification,” in the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2018, pp. 5157-5166.

[6] I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu,D.Warde-Farley, S. Ozair, A. Courville, and Y. Bengio, “Generative adversarial nets,” in Advances in neural information processing systems, 2014, pp. 2672–2680.

[7] Z. Zhong, L. Zheng, G. Kang, S. Li, and Y. Yang, “Random erasing data augmentation,” arXiv preprint arXiv:1708.04896, 2017.

[8] H. Huang, D. Li, Z. Zhang, X. Chen, and K. Huang, “Adversarially Occluded Samples for Person Re-Identification,” in the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2018, pp. 5098-5107.

[9] A. Hermans, L. Beyer, and B. Leibe, “In defense of the triplet loss for person re-identification,” arXiv preprint arXiv:1703.07737, 2017.

[10] Y. Sun, L. Zheng, W. Deng, and S. Wang, “Svdnet for pedestrian retrieval,” in the IEEE International Conference on Computer Vision (ICCV), 2017, pp. 3820-3828.

[11] C. Su, J. Li, S. Zhang, J. Xing, W. Gao, and Q. Tian, “Pose driven deep convolutional model for person re-identification,” in the IEEE International Conference on Computer Vision (CVPR), 2017, pp. 3980–3989.

[12] S. M. Sarfraz, A. Schumann, A. Eberle, and R. Stiefelhagen, “A pose-sensitive embedding for person re-identification with expanded cross neighborhood re-ranking,” in the proceedings of IEEE Conference on Computer Vision and Pattern Recognition, June 2018.

Page 9: pure.ulster.ac.uk · Web viewNetwork architecture comparison of two kind of identity models. If the parameters of the two identity models are initialized in the same way, and are

[13] J. Xu, R. Zhao, F. Zhu, H. Wang, and W.Ouyang, “Attention-Aware Compositional Network for Person Re-identification,” in the proceedings of IEEE Conference on Computer Vision and Pattern Recognition, June 2018.

[14] D. Yi, Z. Lei, S. Liao, and S. Z. Li, “Deep metric learning for person re-identification,” in 22nd International Conference on Pattern Recognition (ICPR), 2014, pp. 34-39.

[15] D. Li, X. Chen, Z. Zhang, and K. Huang, “Learning deep context-aware features over body and latent parts for personre-identification,” in the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2017, pp. 384–393.

[16] X. Liu, H. Zhao, M. Tian, L. Sheng, J. Shao, S. Yi, J. Yan, and X. Wang, “Hydraplus-net: Attentive deep features for pedestrian analysis,” in the Proceedings of IEEE International Conference on Computer Vision, Oct 2017.

[17] L. Zhao, X. Li, Y. Zhuang, and J.Wang, “Deeply-learned partaligned representations for person re-identification,” in the IEEE International Conference on Computer Vision (CVPR), 2017, pp. 3239–3248.

[18] B. Zhou, A. Khosla, L. A., A. Oliva, and A. Torralba, “Learning Deep Features for Discriminative Localization,” in the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2016, pp. 2921-2929.

[19] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” in the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2016, pp. 770-778.

[20] J. Long, E. Shelhamer, and T. Darrell, “Fully Convolutional Networks for Semantic Segmentation,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 39, no.3, pp. 640-651, 2014.

[21] Y. Wei, H. Xiao, H. Shi, Z. Jie, J. Feng, and T. S. Huang, “Revisiting Dilated Convolution: A Simple Approach for Weakly-and Semi-Supervised Semantic Segmentation,” in the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2018, pp. 7268-7277).

[22] X. Zhang, Y. Wei, G. Kang, Y. Yang, and T. Huang, “Self-produced guidance for weakly-supervised object localization,” in European Conference on Computer Vision(ECCV). Springer, Sep 2018.

[23] X. Zhang, Y. Wei, J. Feng, Y. Yang, and T. Huang, “Adversarial complementary learning for weakly supervised object localization,” in the proceedings of IEEE Conference on Computer Vision and Pattern Recognition, June 2018.

[24] J. Hu, L. Shen, and G. Sun, “Squeeze-and-Excitation Networks,” arXiv preprint arXiv:1709.01507, 2017.

[25] W. Li, X. Zhu, and S. Gong, “Harmonious attention network for person re-identification,” in the proceedings of IEEE Conference on Computer Vision and Pattern Recognition, June 2018.

[26] X. Wang, R. Girshick, A. Gupta, and K. He, “Non-local neural networks,” in the proceedings of IEEE Conference on Computer Vision and Pattern Recognition, June 2018.

[27] J. Fu, J. Liu, H. Tian, Z. Fang, and H. Lu, “Dual attention network for scene segmentation,” arXiv preprint arXiv:1809.02983, 2018.

[28] Q. Yu, X. Chang, Y. Z. Song, T. Xiang, and T. M. Hospedales, “The devil is in the middle: exploiting mid-level representations for cross-domain instance matching,” arXiv preprint arXiv:1711.08106, 2017.

[29] L. Zheng, L. Shen, L. Tian, S. Wang, J. Wang, and Q. Tian, “Scalable person re-identification: A benchmark,” in the IEEE International Conference on Computer Vision (ICCV), 2015, pp. 1116–1124.

[30] Z. Zheng, L. Zheng, and Y. Yang, “Unlabeled samples generated by gan improve the person re-identification baseline in vitro,” in the proceedings of IEEE International Conference on Computer Vision, Oct 2017.

[31] W. Li, R. Zhao, T. Xiao, and X.Wang, “Deepreid: Deep filter pairing neural network for person re-identification,” in the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2014, pp. 152–159.

[32] P. Felzenszwalb, D. McAllester, and D. Ramanan, “A discriminatively trained, multiscale, deformable part model,” in the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2008, pp. 1-8.

[33] Z. Zhong , L. Zheng , D. Cao , and S. Li, “Re-ranking person re-identification with k-reciprocal encoding,” in the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2017, pp. 3652-3661.

[34] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei. “Imagenet: A large-scale hierarchical image database,” in the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2009, pp. 248–255.

[35] Y. Chen, X. Zhu and S. Gong, “Person re-identification by deep learning multi-scale representations,” in the IEEE International Conference on Computer Vision Workshop, 2017.