[IEEE 2014 IEEE International Conference on Multimedia and Expo (ICME) - Chengdu, China...

BAGGING BASED METRIC LEARNING FOR PERSON RE-IDENTIFICATION

Bohuai Yao� Zhicheng Zhao�† Kai Liu� Anni Cai�†

�School of Information and Communication Engineering†Beijing Key Laboratory of Network System and Network Culture

Beijing University of Posts and Telecommunications, Beijing, China{ybohuai, 1997.liukai}@gmail.com, {zhaozc, annicai}@bupt.edu.cn

ABSTRACT

Person re-identification is a challenging problem in comput-

er vision due to large variations of appearance among dif-

ferent cameras. Recently, metric learning is widely used to

model the transformation between cameras. However, tradi-

tional metric learning based methods only learn one metric

for the whole feature space, which cannot model differen-

t kinds of appearance variations well. In this paper, we in-

troduce bagging into metric learning, and propose a bagging-

based large margin nearest neighbor (LMNN) method for per-

son re-identification. That is, multiple LMNN predictors are

generated on sub-regions of the feature space and leveraged

to obtain an aggregated predictor for performance improve-

ment. Two bagging strategies, sample-bagging and feature-

bagging, are proposed and compared. Extensive experiments

on three benchmarks demonstrate the superiority of proposed

approach over state-of-the-art methods.

Index Terms— Person re-identification, sample-bagging,

feature-bagging, LMNN

1. INTRODUCTION

With the rapid development of security systems and progress

of computer vision techniques, person re-identification (PRID),

which aims to recognize an individual from images observed

over a video surveillance network, has attracted increasing

attentions of researchers over the past years. Because of the

handleability and uniqueness of feature extraction, clothing

appearance is widely used in PRID for pedestrian represen-

tation. However, the large variations of appearance caused

by changes on pedestrians pose, camera viewpoint and light-

ing condition between different cameras often make different

persons appear even more similar than the same person. As a

result, PRID remains a challenging task.

This work is supported by Chinese National Natural Science Foundation

(90920001, 61101212, 61372169), National High Technology R&D Pro-

gram of China (863 Program) (No.2012AA012505, 2012AA012504), Na-

tional Key Technology R&D Program ( 2012BAH63F00, 2012BAH41F03),

and the Fundamental Research Funds for the Central Universities.

Existing works try to solve this problem in two ways:

(1) Seeking distinctive and stable feature representations for

individuals appearance, such as color histogram [1], princi-

pal axis histogram [2] and rectangular region histogram [3].

Some approaches tried to find a global weighting which can

aggregate a number of features together to improve the per-

formance [4]; (2) Learning a distance metric or projecting

features from different views into a common space for match-

ing in order to suppress inter-camera variations. For instance,

LMNN-R [5] through finding large margin nearest neighbor

with rejection improved classification accuracy. PRDC [6]

maximized the probability of a truly matched pair having a

smaller distance than that of a mismatched pair.

However in practice, large variations in view angle, light-

ing, background clutter and occlusion encountered in PRID

make the changes of appearance between different cameras

complex. One distance metric usually cannot model all kinds

of appearance variations in the whole image space. There-

fore, learning a local metric for each configuration of pedes-

trian images becomes an effective way to attack this problem

[7]. Based on this idea, we propose two methods in this pa-

per to partition the whole dataset into a number of subsets

and use each of the subsets to train a local transform matrix

with LMNN. Then, the distances under every transform ma-

trix are aggregated to form a final distance metric which is

used for matching. We introduce bagging [8] into our subset

selection strategies (but somewhat different from the tradi-

tional bagging). To the best of our knowledge, bagging has

not been applied to PRID before. The two proposed meth-

ods are respectively named as sample-bagging based LMNN

and feature-bagging based LMNN according as the data par-

tition is performed in the sample space or in the feature space.

Bagging methods have advantages on improving the stability

and accuracy of machine learning algorithms. It also helps

to avoid overfitting. With the benefits of bagging and local

metric learning our proposed methods achieve good perfor-

mance. Especially, the feature-bagging method gives further

improvement since the subsets of feature-bagging are usually

less correlated than that of sample-bagging.

In addition, the inter-class appearance difference in

Fig. 1: (a) Different persons with similar appearance. (b) Col-

or distortion makes different persons similar.

PRID sometimes can be quite small. For example, differ-

ent pedestrians may wear the same or similar kinds of clothes

(Fig.1(a)),and color distortion caused by lighting or camera

setting may lead different pedestrians clothes look like the

same (Fig.1(b)). In such circumstances, keeping the subtle

information in features is critical for matching of individual

identities. It is well known that the computational expense in

metric learning grows rapidly with the increases of training

set size and feature dimensions. Thus dimension reduction

is commonly employed when metric learning is adopted in

PRID, and in consequence the subtle information may be lost.

However in our bagging-based methods, the size of the sub-

sets is much smaller than that of the original data/feature set.

Therefore, our sample-bagging based method could reduce

the size of the problem with a large dataset, while the feature-

bagging based method could do so with high dimensional

features, with no information loss. In addition, by randomly

and independently selecting small size subsets, our methods

are allowed to learn multiple transform metrics on subsets in

parallel.

Our bagging-based methods are described in Section 2,

and are experimentally evaluated in Section 3 on three bench-

mark data sets for person re-identification.

2. THE APPROACH

In this section, we first briefly describe the LMNN method,

and then give detail explanations of our bagging based LMNN

methods.

2.1. LMNN

LMNN [9] learns a Mahanalobis distance metric, which en-

forces the k-nearest neighbors to always belong to the same

class while samples from different classes are separated by a

large margin.

Given a training set of N samples and the corresponding

class labels {(xi, yi)}Ni=1. Let yij ∈ {0, 1} indicate whether

yi and yj match, and ηij ∈ {0, 1} indicate whether xj is a

target neighbor of xi. The goal of LMNN is to learn a lin-

ear transformation L : Rd → R

d, to compute the squared

distance as:

DL(xi, xj) = ‖L(xi − xj)‖2, (1)

The squared distance can be rewritten as:

DM (xi, xj) = (xi − xj)TM(xi − xj), (2)

where the Mahalanobis distance metric M is induced by the

linear transformation L as M = LTL. Thus D12 is a valid

distance because M is a symmetric positive-semidefinite ma-

trix.

On one hand, we could minimize the distance between

each training point and its K nearest similarly labeled neigh-

bors by minimizing εpull as follow.

εpull(M) = ΣNi,jηijDM (xi, xj). (3)

On another hand, we could maximize the distance be-

tween all differently labeled points which are closer than the

aforementioned K nearest neighbors’ distances plus a con-

stant margin by minimizing εpush.

εpush(M) =

ΣNi,jΣ

Nl=1ηij(1− yil)[1 +DM (xi, xj)−DM (xi, xl)]+.

(4)

The affine combination of εpull and εpush could define the

overall cost εLMNN .

εLMNN (M) = (1− μ)εpull(M) + μεpush(M). (5)

where μ is a tuning parameter, [z]+ = max(z, 0) denotes the

standard loss. The cost function consist of two terms, the first

term penalizes large distances between each training point

and its target neighbors, while the second term penalizes s-

mall distances between each training point and the impostors

(i.e., all other differently labeled training points that nearer

than the target neighbors).

2.2. Sample-Bagging based LMNN

The traditional bagging is presented by Breiman [8]. It is

an approach that can give substantial gains in accuracy and

stability by generating multiple versions of a predictor and

leveraging them to get an aggregated predictor.

In our proposed sample-bagging based LMNN method,

suppose Q = {(xi, yi)}Ni=1 is the training set, where xi is the

feature vector of sample i, and yi is the class label. In order to

model various kinds of appearance variations between cam-

eras, we randomly divide the whole dataset into several sub-

sets Rg , g = 1, ..., G, and try to learn one particular LMNN

model to copy with the variations in each of the subsets. The

sample selection strategy in our sample-bagging based LMN-

N method is different from the traditional bagging method.

Firstly, the size of Rg is n, where n < N , because we want to

Algorithm 1 Algorithm of sample-bagging based LMNN

Input: Q = {(xi, yi)}Ni=1: labeled training dataset.

Output: {Lg}Gg=1: linear transformations, {wg}Gg=1: weights.

1: Feature dimension reduction by PCA;

2: Feature normalization: Q = ‖Q‖1;

3: for g = 1 : G do4: Randomly select n samples (xj , yj) to form subset Rg ,

where n < N ;

5: Lg = LMNN(Rg);6: Tg = Q−Rg;

7: wg = MRrank−m(Tg,Mg);8: end for

apply localized learning on small subsets to tackle the com-

plex distributions over the whole dataset. Secondly, Rg is ex-

tracted at random from Q without replacement, which means

that there are no repeated samples in one subset. All subsets

are independently selected from the complete training dataset

Q, so they may be partially overlapped with each other.

Because the dimension of feature is very high, we reduce

the dimension with PCA and normalize the feature vector be-

fore training models. Due to the independence of Rg , one

metric Mg can be learned by minimizing the cost function of

LMNN, i.e. Eq.(5), only on subset Rg , and the distance be-

tween two samples, xi and xj , under this metric is denoted

as:

DMg(xi, xj) = (xi − xj)

TMg(xi − xj). (6)

To aggregate all G metrics, we define a final distance between

xi and xj as follows:

DF (xi, xj) = ΣGg=1wgDMg

(xi, xj), (7)

where wg is a weighting factor. Since subsets Rg , g =1, ..., G, are randomly extracted from Q, they may not have

equal abilities to partially represent the characteristics of the

original dataset. The weighting factor reflects the confidence

on distance DMg , which is learned from subset Rg . We per-

form matching with DMg on a validation dataset which is

the complementary set of Rg , Tg = Q − Rg . The weighting

factor can then be expressed by:

wg = MRrank−m(Tg,Mg), (8)

where MRrank−m(·) denotes the matching rate at rank m on

Tg .

It is worth noting, because of the independence of train-

ing process of different metrics, the proposed method can be

implemented by parallel algorithms with high efficiency.

A graphical illustration of the proposed sample-bagging

based LMNN method is given in Fig.2. In the figure, the red

lines indicate the processes in training, the blue lines indicate

those in testing and the green dotted lines indicate those in

weight learning.

Fig. 2: Graphical illustration of the proposed sample-bagging

based LMNN method.

2.3. Feature-Bagging based LMNN

In PRID, a pedestrian’s image is commonly divided into sev-

eral patches to cope with pose variations and occlusions, and

color features at different color spaces and various texture fea-

tures are often extracted from each of the patches in order to

have high discriminative abilities. All these features in one

image are then concatenated into a single feature vector, thus

the dimension of the feature vector is very high. Reducing

the dimension of the feature vector is always desired for met-

ric learning (our sample-bagging method reduces the feature

dimension by PCA), which causes information loss. Another

problem of using the original high-dimensional feature vector

is that some elements that represent the important subtle in-

formation may turn out to be too small after normalization and

would not play a sufficient part in model learning. However,

such problems would be alleviated in our attribute-bagging

method since only a subset of features (a relatively low di-

mensional vector) is involved in metric learning. In addition,

it is suggested [10] that attribute bagging is capable of per-

formance superior to sample bagging in ensemble learning

because randomly selecting the feature subset, the correlation

between features can be reduced.

Suppose D-dimensional feature vector F ∈ RD. G sub-

feature {fg}Gg=1 ∈ Rd, where d < D, are formed by random

set selections from F with a similar strategy like that in our

sample-bagging method. Then, we learn metric Mg by mini-

mizing the cost function of LMNN using sub-feature vectors

defined by fg on the whole dataset. The distance between xi

and xj under metric Mg can be denoted as:

DMg(xi, xj) = (xg

i − xgj )

TMg(xgi − xg

j ), (9)

where xgi is the �1-normalized sub-feature vector of xi , which

is defined by fg . To aggregate all G metrics, the average

of DMg, g = 1, ..., G, is then taken as the final measure of

distance between xi and xj :

DA(xi, xj) =1

GΣG

g=1DMg (xgi , x

gj ). (10)

Fig.3 gives the graphical illustration of the proposed feature-

bagging based LMNN method.

Algorithm 2 Algorithm of feature-bagging based LMNN

Input: Q = {(xi, yi)}Ni=1: labeled training dataset.

Output: {Lg}Gg=1: linear transformations.

1: for g = 1 : G do2: Randomly select d dimensions from original feature F to for-

m feature subset fg , where d < D;

3: Qg = fg(Q), where fg(·) means feature selection;

4: Qg = ‖Qg‖1;

5: Lg = LMNN(Qg);6: end for

Fig. 3: Graphical illustration of the proposed feature-bagging

based LMNN method.

3. EXPERIMENTS

3.1. Experiment setting

We evaluate our approach by comparing it with four baseline

methods and a number of state-of-the-art methods on three

publicly available datasets VIPeR [11], PRID2011 [12] and

Campus [13]. We follow the methodology used in [1] [5] for

every dataset. The group of pedestrians in each dataset is ran-

domly split into two halves, one halve for training and another

for testing. In Campus, each person has two images in each

view, which are also randomly selected. In the test phase, we

randomly select one image of each pedestrian as the query

image and others as the target images. The four baseline

methods we compared with are CCA [14], Nearest Neighbor

search using Euclidean distance and two public metric learn-

ing methods, ITML [15] and LMNN [9]. To reduce the bias,

we repeat the whole procedure 10 times and the average of

the results is given as the final performance.

3.2. Feature representation

Color feature (RGB, HSV, Lab) and texture feature (LBP) are

used for feature representation. Firstly, the image is normal-

ized to 128*48 pixels and divided into patches of 16*24 pix-

els with 8 pixels overlap in vertical direction and 12 pixels

in horizontal. Color histogram with 8 bins in each channel

and a uniformed LBP histogram with 59 bins are then ex-

tracted from each rectangular patch. Feature histograms from

all rectangular patches of an image are cascaded, forming a

5895 dimensional feature vector. Finally, we reduce the fea-

Fig. 4: Matching rate at rank 50 vs. size of the feature subset

on dataset VIPeR.

ture dimensions to 500 by PCA and normalize them in the

sample-bagging method and other baseline methods.

3.3. Parameter selection

The appropriate selection of feature subsets plays an impor-

tant role in the feature-bagging method. We will study this

issue experimentally in this section.

Size of the subset: Fig.4 shows the matching rate at rank

50 with respect to the size of the feature subset on dataset

VIPeR for feature-bagging based LMNN, where the number

of predictors, G, is set to 10. As can been seen, when the sub-

set size d = 0.05 ∗D (or 0.06 ∗D), the best performance is

obtained. The size of feature subset cannot be either too large

or too small. It may not contain enough features when the

dimension of the feature vector is too low, while the elements

of the vector that represent important subtle information may

turn out to be too small after normalization when the dimen-

sion is too high.

The number of predictors: Fig.5, we compare the per-

formance of our feature-bagging method under different num-

bers of predictors on dataset VIPeR. From the figure we ob-

serve that the matching rate increases with the increase of the

number of predictors because more and more features are tak-

en into account. However, the improvement of performance

is getting small after the number of predictors, G, reaches a

certain value, such as G = 35. The reason is that the feature

subsets could be largely overlapped with each other when the

number of subsets is large.

Subset selection strategy: Fig.6 we compare the perfor-

mance of our method on three different ways of selecting fea-

ture subsets on dataset VIPeR when d = 0.05∗D and G = 20.

The three strategies are:

RR- randomly sample d adjacent dimensions of original

feature vector;

RNR- randomly sample d dimensions from the remainder

dimensions of the feature vector after previous subset select-

ed;

R- randomly sample d dimensions from the original fea-

ture vector.

As can be seen from Fig.6, RNR performs better than the

Fig. 5: Matching rate at rank 50 vs. the number of predictors

on dataset VIPeR.

Fig. 6: Matching rate at rank 50 vs. subset selection strategy

on dataset VIPeR.

other two strategies.

Although the above experiments were performed on

VIPeR, parameter selections in our feature-bagging method

are not critical to datasets. We will use d = 0.05∗D, G = 35and R in all the following experiments for the feature-bagging

method on three different datasets.

3.4. Results and analysis

In the following experiments involving the sample-bagging

method, we set the number of predictors G = 10 and the size

of the sample subset n = 0.5 ∗ N . We also optimize the

parameters of both bagging-based LMNN methods by setting

step size of iteration to 1e-6 and setting the maximum number

of iterations to 1000.

We compare our approaches with baseline methods CCA,

Nearest Neighbor search using Euclidean distance, ITML and

LMNN. Fig.7 shows the average CMC curves of those meth-

ods on dataset VIPeR, while Fig.8and Fig.9 show the average

CMC curves on datasets PRID2011 and Campus respectively.

We find that our two approaches achieve significant improve-

ments compared with baseline methods. As can be seen, our

feature-bagging method obtains the best performance because

no dimension reduction (PCA) is involved and useful subtle

information may be kept in this method. In addition, corre-

lations between features could be reduced by random feature

selection in this method, which helps performance improve-

ment in ensemble learning.

0 25 50 100 200 3000

10

20

30

40

50

60

70

80

90

100

Rank Score

Matchining Rate(%)

F−BaggingS−BaggingCCANNITMLLMNN

0 25 50 100 200 3000

10

20

30

40

50

60

70

80

90

100

Rank Score

Matchining Rate(%)

F−BS−BCCANNITMLLMNN

Fig. 7: Average CMC curves of our approaches, CCA, NN,

ITML and LMNN on the VIPeR dataset.

0 100 200 300 400 500 6000

10

20

30

40

50

60

70

80

90

100

Rank ScoreMatchining Rate(%)



ITML and LMNN on the PRID2011 dataset.

Moreover, we compare our approaches with the state-

of-the-art methods on VIPeR (see Table.1). It shows that

our sample-bagging approach obtains comparable perfor-

mance with other methods, and the feature-bagging approach

achieves the best performance in rank 10 and rank 25. The

overall performance of our feature-bagging approach is com-

parable to the best state-of-the-art method [7]. However, the

method in [7] involves joint optimization of a gating func-

tion and local experts, which is much more computationally

complex than our approach.

Finally, we use the ”expected search time” defined in

[1] to evaluate our approaches (see Table.2). This measure

denotes the expected search time of reviewing a query image

when an average review time of 1s per image is assumed.

The expected target rank of our feature-bagging approach

is 10.2s, which has an improvement of about 57% over the

state-of-the-art method LMNN-R. The expected target rank

of our sample-bagging approach is 14.9s, which also has an

improvement of 37% over LMNN-R.

4. CONCLUSION

In this paper, we have applied bagging to LMNN to solve

the problem of large variations on appearance in person

re-identification. To the best of our knowledge, bagging

has not been investigated in person re-identification before.

Experimental results demonstrate that our sample-bagging

based LMNN approach can achieve a comparable perfor-

0 50 100 200 300 4000

10

20

30

40

50

60

70

80

90

100

Rank Score

Matchining Rate(%)



ITML and LMNN on the Campus dataset.

Table 1: Comparisons of matching rates (%) at rank-n on

VIPeR. (� indicates the best run)

Methods Top 1 Top 10 Top 25 Top 50

F-bagging 28.4 73.54 90.32 96.74

S-bagging 20.2 61.1 82.2 93.9

LAFT [7] 29.6 69.3 88.7 96.8KISSME [16] 19.6 62.2 80.7 91.8

SDALF [4] 19.9 49.4 70.5 84.8

PRDC [6] 15.7 53.9 76 87

LDML [16] 10.4 31.3 44.6 60.4

MCC [6] 15.2 57.6 80 91

MC [17] 12.7 56.1 77 88

LMNN-R� [5] 23.7 68 84 93

mance compared with the state-of-the-art methods on dataset

VIPeR, and our feature-bagging based LMNN approach can

improve the performance even further.

5. REFERENCES

[1] D. Gray and H. Tao, “Viewpoint invariant pedestrian

recognition with an ensemble of localized features,” in

ECCV’08.

[2] Weiming Hu, Min Hu, Xue Zhou, Tieniu Tan, Jianguang

Lou, and Steve Maybank, “Principal axis-based corre-

spondence between multiple cameras for people track-

ing,” PAMI’06.

[3] Piotr Dollar, Zhuowen Tu, Hai Tao, and Serge Belongie,

“Feature mining for image classification,” in CVPR’07.

[4] M. Farenzena, L. Bazzani, et al., “Person re-

identification by symmetry-driven accumulation of local

features,” in CVPR’10.

[5] M. Dikmen, E. Akbas, et al., “Pedestrian recognition

with a learned metric,” in ACCV’10.

[6] Zheng W S, Gong S, and Xiang T, “Person re-

identification by probabilistic relative distance compari-

son,” in CVPR’11.

Table 2: Excpected search times for our approaches and other

methods

Method Expected Search Time (s)

Chance [1] 158.0

Template [1] 109.0

Histogram [1] 82.9

Hand Localized Histogram [1] 69.2

Principal Axis Histogram [1] 59.8

ELF [1] 28.9

LMNN-R [5] 23.7

S-bagging 14.9

F-bagging 10.2

[7] Wei Li and Xiaogang Wang, “Locally aligned feature

transforms across views,” in CVPR’13.

[8] Leo Breiman, “Bagging predictors,” Machine learning,

vol. 24, no. 2, pp. 123–140, 1996.

[9] John Blitzer, Kilian Q Weinberger, and Lawrence K

Saul, “Distance metric learning for large margin nearest

neighbor classification,” in NIPS’05.

[10] Robert Bryll, Ricardo Gutierrez-Osuna, and Francis

Quek, “Attribute bagging: improving accuracy of clas-

sifier ensembles by using random feature subsets,” Pat-tern recognition, vol. 36, no. 6, pp. 1291–1302, 2003.

[11] D. Gray, S. Brennan, and H. Tao, “Evaluating appear-

ance models for recognition, reacquisition, and track-

ing,” in Workshop on PETS’07.

[12] Martin Hirzer, Csaba Beleznai, Peter M Roth, and Horst

Bischof, “Person re-identification by descriptive and

discriminative classification,” in Image Analysis, pp.

91–102. Springer, 2011.

[13] Wei Li, Rui Zhao, and Xiaogang Wang, “Human reiden-

tification with transferred metric learning,” in ACCV’12.

[14] David R Hardoon, Sandor Szedmak, and John Shawe-

Taylor, “Canonical correlation analysis: An overview

with application to learning methods,” Neural Compu-tation, vol. 16, no. 12, pp. 2639–2664, 2004.

[15] Jason V Davis, Brian Kulis, Prateek Jain, Suvrit Sra, and

Inderjit S Dhillon, “Information-theoretic metric learn-

ing,” in Proceedings of the 24th international confer-ence on Machine learning. ACM, 2007, pp. 209–216.

[16] M Kostinger, Martin Hirzer, Paul Wohlhart, Peter M

Roth, and Horst Bischof, “Large scale metric learning

from equivalence constraints,” in CVPR’12.

[17] Kai Liu, Xin Guo, Zhicheng Zhao, and Anni Cai, “Per-

son re-identification using matrix complex,” in ICIP’13.

[IEEE 2014 IEEE International Conference on Multimedia and Expo (ICME) - Chengdu, China...

Documents

Transcript of [IEEE 2014 IEEE International Conference on Multimedia and Expo (ICME) - Chengdu, China...