BAGGING BASED METRIC LEARNING FOR PERSON RE-IDENTIFICATION
Bohuai Yao� Zhicheng Zhao�† Kai Liu� Anni Cai�†
�School of Information and Communication Engineering†Beijing Key Laboratory of Network System and Network Culture
Beijing University of Posts and Telecommunications, Beijing, China{ybohuai, 1997.liukai}@gmail.com, {zhaozc, annicai}@bupt.edu.cn
ABSTRACT
Person re-identification is a challenging problem in comput-
er vision due to large variations of appearance among dif-
ferent cameras. Recently, metric learning is widely used to
model the transformation between cameras. However, tradi-
tional metric learning based methods only learn one metric
for the whole feature space, which cannot model differen-
t kinds of appearance variations well. In this paper, we in-
troduce bagging into metric learning, and propose a bagging-
based large margin nearest neighbor (LMNN) method for per-
son re-identification. That is, multiple LMNN predictors are
generated on sub-regions of the feature space and leveraged
to obtain an aggregated predictor for performance improve-
ment. Two bagging strategies, sample-bagging and feature-
bagging, are proposed and compared. Extensive experiments
on three benchmarks demonstrate the superiority of proposed
approach over state-of-the-art methods.
Index Terms— Person re-identification, sample-bagging,
feature-bagging, LMNN
1. INTRODUCTION
With the rapid development of security systems and progress
of computer vision techniques, person re-identification (PRID),
which aims to recognize an individual from images observed
over a video surveillance network, has attracted increasing
attentions of researchers over the past years. Because of the
handleability and uniqueness of feature extraction, clothing
appearance is widely used in PRID for pedestrian represen-
tation. However, the large variations of appearance caused
by changes on pedestrians pose, camera viewpoint and light-
ing condition between different cameras often make different
persons appear even more similar than the same person. As a
result, PRID remains a challenging task.
This work is supported by Chinese National Natural Science Foundation
(90920001, 61101212, 61372169), National High Technology R&D Pro-
gram of China (863 Program) (No.2012AA012505, 2012AA012504), Na-
tional Key Technology R&D Program ( 2012BAH63F00, 2012BAH41F03),
and the Fundamental Research Funds for the Central Universities.
Existing works try to solve this problem in two ways:
(1) Seeking distinctive and stable feature representations for
individuals appearance, such as color histogram [1], princi-
pal axis histogram [2] and rectangular region histogram [3].
Some approaches tried to find a global weighting which can
aggregate a number of features together to improve the per-
formance [4]; (2) Learning a distance metric or projecting
features from different views into a common space for match-
ing in order to suppress inter-camera variations. For instance,
LMNN-R [5] through finding large margin nearest neighbor
with rejection improved classification accuracy. PRDC [6]
maximized the probability of a truly matched pair having a
smaller distance than that of a mismatched pair.
However in practice, large variations in view angle, light-
ing, background clutter and occlusion encountered in PRID
make the changes of appearance between different cameras
complex. One distance metric usually cannot model all kinds
of appearance variations in the whole image space. There-
fore, learning a local metric for each configuration of pedes-
trian images becomes an effective way to attack this problem
[7]. Based on this idea, we propose two methods in this pa-
per to partition the whole dataset into a number of subsets
and use each of the subsets to train a local transform matrix
with LMNN. Then, the distances under every transform ma-
trix are aggregated to form a final distance metric which is
used for matching. We introduce bagging [8] into our subset
selection strategies (but somewhat different from the tradi-
tional bagging). To the best of our knowledge, bagging has
not been applied to PRID before. The two proposed meth-
ods are respectively named as sample-bagging based LMNN
and feature-bagging based LMNN according as the data par-
tition is performed in the sample space or in the feature space.
Bagging methods have advantages on improving the stability
and accuracy of machine learning algorithms. It also helps
to avoid overfitting. With the benefits of bagging and local
metric learning our proposed methods achieve good perfor-
mance. Especially, the feature-bagging method gives further
improvement since the subsets of feature-bagging are usually
less correlated than that of sample-bagging.
In addition, the inter-class appearance difference in
Fig. 1: (a) Different persons with similar appearance. (b) Col-
or distortion makes different persons similar.
PRID sometimes can be quite small. For example, differ-
ent pedestrians may wear the same or similar kinds of clothes
(Fig.1(a)),and color distortion caused by lighting or camera
setting may lead different pedestrians clothes look like the
same (Fig.1(b)). In such circumstances, keeping the subtle
information in features is critical for matching of individual
identities. It is well known that the computational expense in
metric learning grows rapidly with the increases of training
set size and feature dimensions. Thus dimension reduction
is commonly employed when metric learning is adopted in
PRID, and in consequence the subtle information may be lost.
However in our bagging-based methods, the size of the sub-
sets is much smaller than that of the original data/feature set.
Therefore, our sample-bagging based method could reduce
the size of the problem with a large dataset, while the feature-
bagging based method could do so with high dimensional
features, with no information loss. In addition, by randomly
and independently selecting small size subsets, our methods
are allowed to learn multiple transform metrics on subsets in
parallel.
Our bagging-based methods are described in Section 2,
and are experimentally evaluated in Section 3 on three bench-
mark data sets for person re-identification.
2. THE APPROACH
In this section, we first briefly describe the LMNN method,
and then give detail explanations of our bagging based LMNN
methods.
2.1. LMNN
LMNN [9] learns a Mahanalobis distance metric, which en-
forces the k-nearest neighbors to always belong to the same
class while samples from different classes are separated by a
large margin.
Given a training set of N samples and the corresponding
class labels {(xi, yi)}Ni=1. Let yij ∈ {0, 1} indicate whether
yi and yj match, and ηij ∈ {0, 1} indicate whether xj is a
target neighbor of xi. The goal of LMNN is to learn a lin-
ear transformation L : Rd → R
d, to compute the squared
distance as:
DL(xi, xj) = ‖L(xi − xj)‖2, (1)
The squared distance can be rewritten as:
DM (xi, xj) = (xi − xj)TM(xi − xj), (2)
where the Mahalanobis distance metric M is induced by the
linear transformation L as M = LTL. Thus D12 is a valid
distance because M is a symmetric positive-semidefinite ma-
trix.
On one hand, we could minimize the distance between
each training point and its K nearest similarly labeled neigh-
bors by minimizing εpull as follow.
εpull(M) = ΣNi,jηijDM (xi, xj). (3)
On another hand, we could maximize the distance be-
tween all differently labeled points which are closer than the
aforementioned K nearest neighbors’ distances plus a con-
stant margin by minimizing εpush.
εpush(M) =
ΣNi,jΣ
Nl=1ηij(1− yil)[1 +DM (xi, xj)−DM (xi, xl)]+.
(4)
The affine combination of εpull and εpush could define the
overall cost εLMNN .
εLMNN (M) = (1− μ)εpull(M) + μεpush(M). (5)
where μ is a tuning parameter, [z]+ = max(z, 0) denotes the
standard loss. The cost function consist of two terms, the first
term penalizes large distances between each training point
and its target neighbors, while the second term penalizes s-
mall distances between each training point and the impostors
(i.e., all other differently labeled training points that nearer
than the target neighbors).
2.2. Sample-Bagging based LMNN
The traditional bagging is presented by Breiman [8]. It is
an approach that can give substantial gains in accuracy and
stability by generating multiple versions of a predictor and
leveraging them to get an aggregated predictor.
In our proposed sample-bagging based LMNN method,
suppose Q = {(xi, yi)}Ni=1 is the training set, where xi is the
feature vector of sample i, and yi is the class label. In order to
model various kinds of appearance variations between cam-
eras, we randomly divide the whole dataset into several sub-
sets Rg , g = 1, ..., G, and try to learn one particular LMNN
model to copy with the variations in each of the subsets. The
sample selection strategy in our sample-bagging based LMN-
N method is different from the traditional bagging method.
Firstly, the size of Rg is n, where n < N , because we want to
Algorithm 1 Algorithm of sample-bagging based LMNN
Input: Q = {(xi, yi)}Ni=1: labeled training dataset.
Output: {Lg}Gg=1: linear transformations, {wg}Gg=1: weights.
1: Feature dimension reduction by PCA;
2: Feature normalization: Q = ‖Q‖1;
3: for g = 1 : G do4: Randomly select n samples (xj , yj) to form subset Rg ,
where n < N ;
5: Lg = LMNN(Rg);6: Tg = Q−Rg;
7: wg = MRrank−m(Tg,Mg);8: end for
apply localized learning on small subsets to tackle the com-
plex distributions over the whole dataset. Secondly, Rg is ex-
tracted at random from Q without replacement, which means
that there are no repeated samples in one subset. All subsets
are independently selected from the complete training dataset
Q, so they may be partially overlapped with each other.
Because the dimension of feature is very high, we reduce
the dimension with PCA and normalize the feature vector be-
fore training models. Due to the independence of Rg , one
metric Mg can be learned by minimizing the cost function of
LMNN, i.e. Eq.(5), only on subset Rg , and the distance be-
tween two samples, xi and xj , under this metric is denoted
as:
DMg(xi, xj) = (xi − xj)
TMg(xi − xj). (6)
To aggregate all G metrics, we define a final distance between
xi and xj as follows:
DF (xi, xj) = ΣGg=1wgDMg
(xi, xj), (7)
where wg is a weighting factor. Since subsets Rg , g =1, ..., G, are randomly extracted from Q, they may not have
equal abilities to partially represent the characteristics of the
original dataset. The weighting factor reflects the confidence
on distance DMg , which is learned from subset Rg . We per-
form matching with DMg on a validation dataset which is
the complementary set of Rg , Tg = Q − Rg . The weighting
factor can then be expressed by:
wg = MRrank−m(Tg,Mg), (8)
where MRrank−m(·) denotes the matching rate at rank m on
Tg .
It is worth noting, because of the independence of train-
ing process of different metrics, the proposed method can be
implemented by parallel algorithms with high efficiency.
A graphical illustration of the proposed sample-bagging
based LMNN method is given in Fig.2. In the figure, the red
lines indicate the processes in training, the blue lines indicate
those in testing and the green dotted lines indicate those in
weight learning.
Fig. 2: Graphical illustration of the proposed sample-bagging
based LMNN method.
2.3. Feature-Bagging based LMNN
In PRID, a pedestrian’s image is commonly divided into sev-
eral patches to cope with pose variations and occlusions, and
color features at different color spaces and various texture fea-
tures are often extracted from each of the patches in order to
have high discriminative abilities. All these features in one
image are then concatenated into a single feature vector, thus
the dimension of the feature vector is very high. Reducing
the dimension of the feature vector is always desired for met-
ric learning (our sample-bagging method reduces the feature
dimension by PCA), which causes information loss. Another
problem of using the original high-dimensional feature vector
is that some elements that represent the important subtle in-
formation may turn out to be too small after normalization and
would not play a sufficient part in model learning. However,
such problems would be alleviated in our attribute-bagging
method since only a subset of features (a relatively low di-
mensional vector) is involved in metric learning. In addition,
it is suggested [10] that attribute bagging is capable of per-
formance superior to sample bagging in ensemble learning
because randomly selecting the feature subset, the correlation
between features can be reduced.
Suppose D-dimensional feature vector F ∈ RD. G sub-
feature {fg}Gg=1 ∈ Rd, where d < D, are formed by random
set selections from F with a similar strategy like that in our
sample-bagging method. Then, we learn metric Mg by mini-
mizing the cost function of LMNN using sub-feature vectors
defined by fg on the whole dataset. The distance between xi
and xj under metric Mg can be denoted as:
DMg(xi, xj) = (xg
i − xgj )
TMg(xgi − xg
j ), (9)
where xgi is the �1-normalized sub-feature vector of xi , which
is defined by fg . To aggregate all G metrics, the average
of DMg, g = 1, ..., G, is then taken as the final measure of
distance between xi and xj :
DA(xi, xj) =1
GΣG
g=1DMg (xgi , x
gj ). (10)
Fig.3 gives the graphical illustration of the proposed feature-
bagging based LMNN method.
Algorithm 2 Algorithm of feature-bagging based LMNN
Input: Q = {(xi, yi)}Ni=1: labeled training dataset.
Output: {Lg}Gg=1: linear transformations.
1: for g = 1 : G do2: Randomly select d dimensions from original feature F to for-
m feature subset fg , where d < D;
3: Qg = fg(Q), where fg(·) means feature selection;
4: Qg = ‖Qg‖1;
5: Lg = LMNN(Qg);6: end for
Fig. 3: Graphical illustration of the proposed feature-bagging
based LMNN method.
3. EXPERIMENTS
3.1. Experiment setting
We evaluate our approach by comparing it with four baseline
methods and a number of state-of-the-art methods on three
publicly available datasets VIPeR [11], PRID2011 [12] and
Campus [13]. We follow the methodology used in [1] [5] for
every dataset. The group of pedestrians in each dataset is ran-
domly split into two halves, one halve for training and another
for testing. In Campus, each person has two images in each
view, which are also randomly selected. In the test phase, we
randomly select one image of each pedestrian as the query
image and others as the target images. The four baseline
methods we compared with are CCA [14], Nearest Neighbor
search using Euclidean distance and two public metric learn-
ing methods, ITML [15] and LMNN [9]. To reduce the bias,
we repeat the whole procedure 10 times and the average of
the results is given as the final performance.
3.2. Feature representation
Color feature (RGB, HSV, Lab) and texture feature (LBP) are
used for feature representation. Firstly, the image is normal-
ized to 128*48 pixels and divided into patches of 16*24 pix-
els with 8 pixels overlap in vertical direction and 12 pixels
in horizontal. Color histogram with 8 bins in each channel
and a uniformed LBP histogram with 59 bins are then ex-
tracted from each rectangular patch. Feature histograms from
all rectangular patches of an image are cascaded, forming a
5895 dimensional feature vector. Finally, we reduce the fea-
Fig. 4: Matching rate at rank 50 vs. size of the feature subset
on dataset VIPeR.
ture dimensions to 500 by PCA and normalize them in the
sample-bagging method and other baseline methods.
3.3. Parameter selection
The appropriate selection of feature subsets plays an impor-
tant role in the feature-bagging method. We will study this
issue experimentally in this section.
Size of the subset: Fig.4 shows the matching rate at rank
50 with respect to the size of the feature subset on dataset
VIPeR for feature-bagging based LMNN, where the number
of predictors, G, is set to 10. As can been seen, when the sub-
set size d = 0.05 ∗D (or 0.06 ∗D), the best performance is
obtained. The size of feature subset cannot be either too large
or too small. It may not contain enough features when the
dimension of the feature vector is too low, while the elements
of the vector that represent important subtle information may
turn out to be too small after normalization when the dimen-
sion is too high.
The number of predictors: Fig.5, we compare the per-
formance of our feature-bagging method under different num-
bers of predictors on dataset VIPeR. From the figure we ob-
serve that the matching rate increases with the increase of the
number of predictors because more and more features are tak-
en into account. However, the improvement of performance
is getting small after the number of predictors, G, reaches a
certain value, such as G = 35. The reason is that the feature
subsets could be largely overlapped with each other when the
number of subsets is large.
Subset selection strategy: Fig.6 we compare the perfor-
mance of our method on three different ways of selecting fea-
ture subsets on dataset VIPeR when d = 0.05∗D and G = 20.
The three strategies are:
RR- randomly sample d adjacent dimensions of original
feature vector;
RNR- randomly sample d dimensions from the remainder
dimensions of the feature vector after previous subset select-
ed;
R- randomly sample d dimensions from the original fea-
ture vector.
As can be seen from Fig.6, RNR performs better than the
Fig. 5: Matching rate at rank 50 vs. the number of predictors
on dataset VIPeR.
Fig. 6: Matching rate at rank 50 vs. subset selection strategy
on dataset VIPeR.
other two strategies.
Although the above experiments were performed on
VIPeR, parameter selections in our feature-bagging method
are not critical to datasets. We will use d = 0.05∗D, G = 35and R in all the following experiments for the feature-bagging
method on three different datasets.
3.4. Results and analysis
In the following experiments involving the sample-bagging
method, we set the number of predictors G = 10 and the size
of the sample subset n = 0.5 ∗ N . We also optimize the
parameters of both bagging-based LMNN methods by setting
step size of iteration to 1e-6 and setting the maximum number
of iterations to 1000.
We compare our approaches with baseline methods CCA,
Nearest Neighbor search using Euclidean distance, ITML and
LMNN. Fig.7 shows the average CMC curves of those meth-
ods on dataset VIPeR, while Fig.8and Fig.9 show the average
CMC curves on datasets PRID2011 and Campus respectively.
We find that our two approaches achieve significant improve-
ments compared with baseline methods. As can be seen, our
feature-bagging method obtains the best performance because
no dimension reduction (PCA) is involved and useful subtle
information may be kept in this method. In addition, corre-
lations between features could be reduced by random feature
selection in this method, which helps performance improve-
ment in ensemble learning.
0 25 50 100 200 3000
10
20
30
40
50
60
70
80
90
100
Rank Score
Matchining Rate(%)
F−BaggingS−BaggingCCANNITMLLMNN
0 25 50 100 200 3000
10
20
30
40
50
60
70
80
90
100
Rank Score
Matchining Rate(%)
F−BS−BCCANNITMLLMNN
Fig. 7: Average CMC curves of our approaches, CCA, NN,
ITML and LMNN on the VIPeR dataset.
0 100 200 300 400 500 6000
10
20
30
40
50
60
70
80
90
100
Rank ScoreMatchining Rate(%)
F−BS−BCCANNITMLLMNN
Fig. 8: Average CMC curves of our approaches, CCA, NN,
ITML and LMNN on the PRID2011 dataset.
Moreover, we compare our approaches with the state-
of-the-art methods on VIPeR (see Table.1). It shows that
our sample-bagging approach obtains comparable perfor-
mance with other methods, and the feature-bagging approach
achieves the best performance in rank 10 and rank 25. The
overall performance of our feature-bagging approach is com-
parable to the best state-of-the-art method [7]. However, the
method in [7] involves joint optimization of a gating func-
tion and local experts, which is much more computationally
complex than our approach.
Finally, we use the ”expected search time” defined in
[1] to evaluate our approaches (see Table.2). This measure
denotes the expected search time of reviewing a query image
when an average review time of 1s per image is assumed.
The expected target rank of our feature-bagging approach
is 10.2s, which has an improvement of about 57% over the
state-of-the-art method LMNN-R. The expected target rank
of our sample-bagging approach is 14.9s, which also has an
improvement of 37% over LMNN-R.
4. CONCLUSION
In this paper, we have applied bagging to LMNN to solve
the problem of large variations on appearance in person
re-identification. To the best of our knowledge, bagging
has not been investigated in person re-identification before.
Experimental results demonstrate that our sample-bagging
based LMNN approach can achieve a comparable perfor-
0 50 100 200 300 4000
10
20
30
40
50
60
70
80
90
100
Rank Score
Matchining Rate(%)
F−BS−BCCANNITMLLMNN
Fig. 9: Average CMC curves of our approaches, CCA, NN,
ITML and LMNN on the Campus dataset.
Table 1: Comparisons of matching rates (%) at rank-n on
VIPeR. (� indicates the best run)
Methods Top 1 Top 10 Top 25 Top 50
F-bagging 28.4 73.54 90.32 96.74
S-bagging 20.2 61.1 82.2 93.9
LAFT [7] 29.6 69.3 88.7 96.8KISSME [16] 19.6 62.2 80.7 91.8
SDALF [4] 19.9 49.4 70.5 84.8
PRDC [6] 15.7 53.9 76 87
LDML [16] 10.4 31.3 44.6 60.4
MCC [6] 15.2 57.6 80 91
MC [17] 12.7 56.1 77 88
LMNN-R� [5] 23.7 68 84 93
mance compared with the state-of-the-art methods on dataset
VIPeR, and our feature-bagging based LMNN approach can
improve the performance even further.
5. REFERENCES
[1] D. Gray and H. Tao, “Viewpoint invariant pedestrian
recognition with an ensemble of localized features,” in
ECCV’08.
[2] Weiming Hu, Min Hu, Xue Zhou, Tieniu Tan, Jianguang
Lou, and Steve Maybank, “Principal axis-based corre-
spondence between multiple cameras for people track-
ing,” PAMI’06.
[3] Piotr Dollar, Zhuowen Tu, Hai Tao, and Serge Belongie,
“Feature mining for image classification,” in CVPR’07.
[4] M. Farenzena, L. Bazzani, et al., “Person re-
identification by symmetry-driven accumulation of local
features,” in CVPR’10.
[5] M. Dikmen, E. Akbas, et al., “Pedestrian recognition
with a learned metric,” in ACCV’10.
[6] Zheng W S, Gong S, and Xiang T, “Person re-
identification by probabilistic relative distance compari-
son,” in CVPR’11.
Table 2: Excpected search times for our approaches and other
methods
Method Expected Search Time (s)
Chance [1] 158.0
Template [1] 109.0
Histogram [1] 82.9
Hand Localized Histogram [1] 69.2
Principal Axis Histogram [1] 59.8
ELF [1] 28.9
LMNN-R [5] 23.7
S-bagging 14.9
F-bagging 10.2
[7] Wei Li and Xiaogang Wang, “Locally aligned feature
transforms across views,” in CVPR’13.
[8] Leo Breiman, “Bagging predictors,” Machine learning,
vol. 24, no. 2, pp. 123–140, 1996.
[9] John Blitzer, Kilian Q Weinberger, and Lawrence K
Saul, “Distance metric learning for large margin nearest
neighbor classification,” in NIPS’05.
[10] Robert Bryll, Ricardo Gutierrez-Osuna, and Francis
Quek, “Attribute bagging: improving accuracy of clas-
sifier ensembles by using random feature subsets,” Pat-tern recognition, vol. 36, no. 6, pp. 1291–1302, 2003.
[11] D. Gray, S. Brennan, and H. Tao, “Evaluating appear-
ance models for recognition, reacquisition, and track-
ing,” in Workshop on PETS’07.
[12] Martin Hirzer, Csaba Beleznai, Peter M Roth, and Horst
Bischof, “Person re-identification by descriptive and
discriminative classification,” in Image Analysis, pp.
91–102. Springer, 2011.
[13] Wei Li, Rui Zhao, and Xiaogang Wang, “Human reiden-
tification with transferred metric learning,” in ACCV’12.
[14] David R Hardoon, Sandor Szedmak, and John Shawe-
Taylor, “Canonical correlation analysis: An overview
with application to learning methods,” Neural Compu-tation, vol. 16, no. 12, pp. 2639–2664, 2004.
[15] Jason V Davis, Brian Kulis, Prateek Jain, Suvrit Sra, and
Inderjit S Dhillon, “Information-theoretic metric learn-
ing,” in Proceedings of the 24th international confer-ence on Machine learning. ACM, 2007, pp. 209–216.
[16] M Kostinger, Martin Hirzer, Paul Wohlhart, Peter M
Roth, and Horst Bischof, “Large scale metric learning
from equivalence constraints,” in CVPR’12.
[17] Kai Liu, Xin Guo, Zhicheng Zhao, and Anni Cai, “Per-
son re-identification using matrix complex,” in ICIP’13.
Top Related