Discriminative pose-free descriptors for face and object ... · Discriminative pose-free...
Transcript of Discriminative pose-free descriptors for face and object ... · Discriminative pose-free...
Pattern Recognition 67 (2017) 353–365
Contents lists available at ScienceDirect
Pattern Recognition
journal homepage: www.elsevier.com/locate/patcog
Discriminative pose-free descriptors for face and object matching
Soubhik Sanyal, Sivaram Prasad Mudunuri, Soma Biswas ∗
Department of Electrical Engineering, Indian Institute of Science, Bangalore, India
a r t i c l e i n f o
Article history:
Received 25 July 2016
Revised 25 December 2016
Accepted 10 February 2017
Available online 17 February 2017
Keywords:
Face recognition
Object recognition
Pose invariant matching
Metric learning
Canonical correlation
Subspace to point representation.
a b s t r a c t
Pose invariant matching is a very important problem with various applications like recognizing faces
in uncontrolled scenarios in which the facial images appear in wide variety of pose and illumination
conditions along with low resolution. Here we propose two discriminative pose-free descriptors, Sub-
space Point Representation (DPF-SPR) and Layered Canonical Correlated (DPF-LCC) descriptor, for match-
ing faces and objects across pose. Training examples at very few poses are used to generate virtual in-
termediate pose subspaces. An image is represented by a feature set obtained by projecting its low-level
feature on these subspaces and a discriminative transform is applied to make this feature set suitable for
recognition. We represent this discriminative feature set by two novel descriptors. In one approach, we
transform it to a vector by using subspace to point representation technique. In the second approach, a
layered structure of canonical correlated subspaces are formed, onto which the feature set is projected.
Experiments on recognizing faces and objects across pose and comparisons with state-of-the-art show
the effectiveness of the proposed approach.
© 2017 Elsevier Ltd. All rights reserved.
1
v
w
f
a
u
(
d
t
i
t
m
t
i
(
e
t
a
i
G
t
d
j
s
f
p
d
i
s
u
l
h
0
. Introduction
Matching faces (or objects) across wide variety of poses is a
ery important area of research in the field of computer vision
ith many applications. For example, in surveillance setting, the
ace of a person captured by the overhead cameras may be in any
rbitrary pose and poor resolution as opposed to the frontal image
nder high resolution that is typically captured during enrolment
Fig. 1 , column 1 and 2). For object matching, the images captured
uring testing can be taken from a different view-point compared
o the images stored in the database which again requires compar-
ng objects in different poses ( Fig. 1 , column 3–6). The aforesaid
asks are challenging because the appearance of the images to be
atched can be very different due to significant pose variations.
In this paper, we propose two discriminative pose-free descrip-
ors, Subspace Point Representation ( DPF-SPR ) descriptor (which
s also termed as DPFD in [1] ) and Layered Canonical Correlated
DPF-LCC ) descriptor, for matching faces and objects across differ-
nt poses. During training phase, images from a few poses (two
o three) are used to generate virtual subspaces for the intermedi-
te poses. We generate the virtual intermediate subspaces by treat-
ng the subspaces generated by the training data as points on the
rassmann manifold and sampling the shortest geodesic path be-
ween those points. Then, we represent an image (or image region
epending on application) by a set of features, computed by pro-
∗ Corresponding author.
E-mail address: [email protected] (S. Biswas).
r
d
t
ttp://dx.doi.org/10.1016/j.patcog.2017.02.016
031-3203/© 2017 Elsevier Ltd. All rights reserved.
ecting its low level feature vector onto all the intermediate sub-
paces, which will ensure that at least one or more of the features
rom the entire feature set will match if the images with different
oses are compared. Since our final goal is recognition, we use a
iscriminative transform learned using the class labels of the train-
ng data to transform the feature set. Then DPF-SPR or DPF-LCC de-
criptor is computed from the feature set which can be directly
sed for matching. In this paper, our focus is on the following chal-
enging tasks:
1. Unconstrained face recognition, where the gallery consists of
frontal images captured during enrolment and the probe im-
ages can be in any arbitrary pose. We also address the problem
where, in addition to non-frontal pose, the probe images also
have low-resolution as is usually the case in surveillance set-
ting when the images are taken from a large distance from the
subject. We perform extensive experiments on the CMU PIE [2] ,
Multi-PIE [3] and the SCface database [4] .
2. Object recognition across pose, where the objects of different
poses are to be matched. For this purpose we evaluate the pro-
posed approach on COIL-20 [5] and RGB-D object datasets [6] .
We also consider the task of matching depth images of ob-
jects across pose to test the generalizability of the proposed
approach.
We compare the proposed approach with state-of-the-art met-
ic learning, cross-pose methods, domain adaptation and coupled
ictionary learning approaches to show the effectiveness of the
wo descriptors. The novelty/contribution of the work is as follows:
354 S. Sanyal et al. / Pattern Recognition 67 (2017) 353–365
Fig. 1. Applications requiring matching across pose variations. Column 1,2: Face
recognition in uncontrolled setting; and Column 3–6: Object recognition across
viewpoint.
t
a
e
s
f
n
t
a
i
f
[
c
K
c
a
a
i
d
s
a
s
m
d
h
i
e
o
p
i
n
n
m
v
[
a
u
j
[
R
t
p
fi
i
r
p
a
i
s
t
a
f
t
o
c
i
n
a
3
d
T
• Two novel discriminative pose-free descriptors, Subspace Point
Representation ( DPF-SPR ) and Layered Canonical Correlated
( DPF-LCC ) descriptor, for matching faces and objects across dif-
ferent views are proposed. • The approach does not require separate training for different
probe poses/view points. This is an advantage over many other
approaches which work well when separate training is per-
formed for different poses encountered during testing. • Very few poses (as little as two or three) are required during
the training phase and the method can generalize well to un-
seen poses. • Extensive experiments illustrate the applicability of the pro-
posed approach in diverse domains like face and object recog-
nition.
A preliminary version of this work appeared in [1] . The rest of
the paper is organized as follows. Section 2 describes the related
literature. Details of the proposed approach are given in Section 3 .
The experimental results and analysis of the descriptors are re-
ported in Section 4 . The paper concludes with a brief discussion
section.
2. Related work
In this section, we provide pointers to some of the related work
in the area of recognizing faces and objects across pose. Recog-
nizing faces across pose is an important research area. Li et al.
[7] propose maximal likelihood correspondence estimation after
encoding face specific structure information of semantic corre-
spondence. Ding et al. [8] learn a transformation dictionary which
transforms the features of different poses into a discriminative
subspace where face matching is performed at patch level rather
than at holistic level. Ding et al. [9] extract multi-directional multi-
level dual cross patterns as pose invariant face descriptor. Zhang
et al. [10] propose a mixed norm approach which is achieved by
a trade-off between sparse representation classification and joint
sparse representation classification. Castillo and Jacobs [11] pro-
pose a method to compute stereo matching cost between two fa-
cial images by using epipolar geometry. Wang et al. [12] formulate
a general representation of kernel collaborative framework and de-
velop an l 2 regularized algorithm within it. Cament et al. [13] ex-
tract Gabor features from modified grids using mesh to model face
deformations produced by varying pose. Li et al. [14] train a Gaus-
sian mixture model to capture the spatial appearance distribution
of all face images in the training corpus. Yin et al. [15] design a
model for recognizing faces under large variations which can pre-
dict the appearance and likelihood of the given query face against
the collected generic identities. Arandjelovic et al. [16] propose a
framework that can improve the performance of any baseline face
retrieval algorithm by leveraging the structure of the database.
Recently, matching of low resolution facial images has gained
considerable attention [17,18] . Bhatt et al. [19] propose using a
combination of transfer learning and co-training paradigms for
cross resolution matching. Ren et al. [20] propose a method of ex-
racting and representing discriminant feature from faces and then
lternatively optimize across different data domains. Al-Maadeed
t al. [21] learn a pairwise dictionary and utilize a random pooling
trategy to select a subset of visual words. Zhao et al. [22] combine
orward and backward sparse representation for robust face recog-
ition. A piecewise linear regression model is developed to learn
he relationship between the high resolution (HR) image space
nd the low resolution (LR) image space for face super resolution
n [23] . Metric learning approaches have shown a lot of promise
or matching faces in unconstrained environments. Kostinger et al.
24] propose a method that learns a distance metric from the
o-variance matrices of similar and dissimilar pairs. Moutafis and
akadiaris [25] propose an algorithm that can match HR and LR fa-
ial images by learning individual basis for optimal representation
nd coupled distance metrics to enhance the classification. Domain
daptation techniques have also been successfully used for match-
ng face images across pose, illumination, blur, etc. [26] . In [27] ,
ictionary learning is used to interpolate subspaces to link the
ource and target domains. The main drawback of most of these
pproaches is that they perform well only if the test faces have the
ame pose as those of the faces used for training. This is the key
otivation in developing our algorithm, so that the training can be
one using a few representative poses, and the testing image can
ave a different pose than those used for training.
There has been a lot of progress in the field of deep learning
n face and object recognition tasks [28] in recent times. Taigman
t al. [29] apply a piecewise affine transformation for 3D modeling
f faces using deep convolutional networks. Schroff et al. [30] pro-
ose to learn a deep convolutional network by mapping from face
mages to an Euclidean space. Chan et al. [31] build a deep learning
etwork with the help of cascade principle component analysis, bi-
ary hashing and block-wise histograms. Though the deep learning
ethods have shown better performance under wide range of pose
ariations, the performance degrades when the resolution is low
32,33] . Handling the low resolution problem with deep learning
pproaches requires sufficient amount of training samples captured
nder poor resolution conditions for learning the model.
There has also been a lot of research in the area of general ob-
ect recognition across different viewpoints [34] . Hsiao and Hebert
35] model occlusions by reasoning about 3D interactions of object.
ubio et al. [36] use generative non-negative matrix factorization
o find out relevant parts for training instances. Wu et al. [37] pro-
ose a query-expanded collaborative representation based classi-
er with class-specific prototypes. A model that separates a view-
nvariant category representation from category-invariant pose rep-
esentation is proposed in [38] . He et al. [39] use spatial pyramid
ooling strategy to eliminate the need of fixed size input image for
convolutional neural network for object recognition.
One of the proposed descriptors is inspired by [40] in which
mages are matched across varying scales. Features at different
cales can be computed from the same image itself, unlike fea-
ures at different poses which is the focus of our work. Gener-
ting intermediate subspaces by sampling the Grassmann mani-
old has also been exploited by [26] , and then the projections on
hese subspaces are used to train discriminative classifiers for each
bject. Instead, using the intermediate subspaces, we form a dis-
riminative feature vector which can directly be used for match-
ng. Our approach is more suitable for applications like face recog-
ition, where there may not be any overlap between the training
nd testing subjects.
. Proposed approach
In this section, we describe in detail the computation of the two
iscriminative pose-free descriptors, namely, DPF-SPR and DPF-LCC .
he different steps required during training are:
S. Sanyal et al. / Pattern Recognition 67 (2017) 353–365 355
Fig. 2. Flow chart of the proposed framework showing the training stage and construction of the two descriptors: DPF-SPR and DPF-LCC .
a
u
T
c
D
i
d
i
d
u
t
d
3
s
i
t
l
t
o
c
i
t
i
t
P
p
f
P
r
T
w
∈
d
s
G
t
b
R
P
i
t
�
w
a
P
R
w
a
e
B
s
o
1. First, the virtual intermediate subspaces are generated from the
given training examples of a few pose regions. The feature vec-
tor from the input image is then projected onto all the sub-
spaces to form a feature set.
2. Then, from the training class labels, a discriminative transform
is learned. This completes the training stage for DPF-SPR .
3. For DPF-LCC computation, after learning the transformation ma-
trices, a layered model of correlated subspaces and the corre-
sponding projection matrices are learned.
During testing, after computing the feature set for a given im-
ge by projecting onto all the subspaces, they are transformed
sing the discriminative transform learned in the training phase.
hen, subspace to point representation technique can be used to
onstruct the DPF-SPR descriptor for the image. For computing
PF-LCC , the discriminative feature set is projected onto the canon-
cal correlated subspace learned during training to compute the
escriptor. The flowcharts of the proposed approaches are provided
n Fig. 2 : Top left: Portion of the training stage common to both the
escriptors, Top Right: Training stage of DPF-LCC which is contin-
ed from the left side, Bottom: Testing stage showing the compu-
ation of DPF-SPR and DPF-LCC . We describe each of the steps in
etail in the following subsections.
.1. Feature representation using intermediate subspaces
Suppose we have training images from some parts of the pose
pace, say, P 1 , P 2 to P K , ( K is as small as two/three) ( Fig. 3 ). Our aim
s to generate a descriptor for an image with any unknown pose so
hat it can be used for matching across poses. Assume that f is the
ow level feature descriptor of an image. We propose to represent
he actual image using a collection of features { f 1 , f 2 , . . . } instead
f by a single f because our goal is to match it in any pose. The
ollection of features { f 1 , f 2 , . . . } are the feature vectors computed
f we have that image at different poses. The chances of matching
wo images of the same object which only differ by pose, is higher
f we now compare the feature sets from the two images, rather
han only using f .
We compute virtual poses by learning the path between P k and
k +1 in order to generate the features at different poses. For this
urpose, we exploit the idea of sampling on the Grassmann mani-
old [26,41] . Suppose N trn is the number of training images in pose
k as well as in pose P k +1 . We denote the low level features cor-
esponding to the images in P k as f k,i ∈ R
D , where i = 1 , 2 , . . . N trn .
hus, we have a data matrix of dimension D × N trn for pose P k as
ell as P k +1 . Now we obtain the generative subspaces P k and P k +1
R
(D ×d) by applying principal component analysis (PCA) on the
ata matrix P k and P k +1 respectively. The space of d -dimensional
ubspaces in R
D can be identified with the Grassmann manifold
d,D and thus, P k and P k +1 are points on G d,D . We aim to generate
he virtual features corresponding to the intermediate subspaces
etween P k and P k +1 .
Let P k has an orthogonal complement R k ∈ R
D ×(D −d) such that,
T k
P k = 0 . Then the geodesic flow, �( t ): t ∈ [0, 1], between P k and
k +1 is such that, �(t) ∈ G d,D and �(0) = P k and �(1) = P k +1 . This
mplies that starting from P k , the geodesic flow reaches P k +1 in unit
ime. The expression for the flow at any time t is given by
(t) = P k U 1 �(t) − R k U 2 �(t) (1)
here U 1 ∈ R
d×d and U 2 ∈ R
(D −d) ×d are orthonormal matrices. U 1
nd U 2 can be obtained using the following equations
′ k P k +1 = U 1 �V
′ (2)
′ k P k +1 = −U 2 �V
′ (3)
here �, � ∈ R
d×d are diagonal matrices whose diagonal elements
re cos θ i and sin θ i for i = 1 , 2 , , . . . d. {} ′ denotes the transpose op-
rator. θ i are known as the principal angles between P k and P k +1 .
y using different values of t , we can obtain different intermediate
ubspaces.
This work is motivated by the claim that if we project an image
f any unknown pose onto any interpolated subspace, the recon-
356 S. Sanyal et al. / Pattern Recognition 67 (2017) 353–365
Fig. 3. Top: Training images from 3 different parts of the pose space (left pose, frontal, right pose) denoted by P 1 , P 2 and P 3 respectively. Bottom: Virtual subspaces generated
from training data, the two rows indicating the second and fourth eigenvectors of the subspaces.
Fig. 4. Illustration of pose reconstruction on Geodesic flow curve for two subjects.
a: unknown pose; b: synthesized from interpolated subspace; c: actual image at
interpolated pose.
r
m
P
l
p
r
p
d
f
j
t
f
a
F
p
p
t
a
structed image obtained will have a pose similar to that of the in-
terpolated pose. To show this, we have projected images of frontal
pose and 30 ° pose ( Fig. 4 (a)) onto an interpolated pose. We ob-
serve that the reconstructed images ( Fig. 4 (b)) are close to the ac-
tual 15 ° pose ( Fig. 4 (c)) in both cases, thus justifying the subspace
interpolation.
The enhanced feature set { f 1 , f 2 , . . . f H } for each image is ob-
tained by projecting the feature vector f onto all the intermedi-
ate subspaces after generating them. Here H is the total number of
subspaces, out of which Z subspaces are computed from the actual
training data and (H − Z) are the intermediate virtual subspaces.
Fig. 3 (bottom) shows virtual subspaces generated from training
data from three parts of the pose space ( Fig. 3 (top)), the two rows
indicating the second and fourth eigenvectors of the subspaces.
Fig. 5. Proposed feature set representation generated using projections on all intermedi
difference of the feature vector for that pose from that computed from the frontal image.
We illustrate the effectiveness of the proposed feature set rep-
esentation over the standard feature vector representation for
atching across poses by performing an experiment on the Multi-
IE dataset [3] . We use images of 100 subjects, under frontal il-
umination condition and five different poses including the frontal
ose ( Fig. 5 ). In this experiment, we represent the corner of the
ight eye using the SIFT descriptor [42] and also using the pro-
osed feature set representation (generated using the original SIFT
escriptor). The difference between the descriptor at that pose
rom that of the frontal image, averaged over all the 100 sub-
ects, is shown by each point in Fig. 5 . We compute distances be-
ween all the features and took the minimum to compute the dif-
erence between two feature sets for our descriptor. We observe
n increase in the difference with increase in the pose difference.
or the proposed feature set, this difference is much less as com-
ared to the baseline SIFT descriptor. This indicates that the pro-
osed feature set is more robust to change in pose. But we need
o address two issues before using such a feature set for matching
cross pose variations:
1. The feature sets may not be discriminative enough for recogni-
tion or classification task because they are computed from gen-
erative subspaces.
2. It will be also computationally expensive if two feature sets are
matched using some measure like minimum distance, which
requires H × H comparisons to compute the distance between
two feature sets.
Both the issues are addressed in the following subsections.
ate virtual subspaces vs. the SIFT descriptor. Each point in the curves indicate the
S. Sanyal et al. / Pattern Recognition 67 (2017) 353–365 357
3
s
t
d
t
a
i
a
a
l
t
d
w
l
b
m
a
r
e
d
s
s
m
s
o
f
a
v
a
c
b
l
δ
w
m
f
o
f
w
w
a
�
δ
I
δ
A
3
t
s
a
T
N
d
s
b
d
p
t
[
i
H
f
s
r
L
m
s
D
F
g
f
D
3
v
v
a
m
b
t
t
f
t
o
p
L
d
w
a
d
f
r
0
b
t
t
t
(
o
m
i
d
t
.2. Discriminative features
Here we describe the computation of a discriminative feature
et for a given input image from the feature set computed from
he generative subspaces. This is done to make the final descriptors
iscriminative for applications like face and object recognition. For
his purpose, we utilize the class labels of the training data to learn
transformation. The feature sets of images are then transformed
n such a way that those from the same class come closer to one
nother, and those belonging to different classes are moved further
part. In this work, the framework of Mahalanobis distance metric
earning is used for making the features discriminative. In general,
he squared distance between two features x i , x j can be defined as
2 (x i , x j ) = (x i − x j ) T M(x i − x j ) (4)
here M �0 is the positive semi-definite matrix that we want to
earn. Learning one metric for our approach may not be sufficient
ecause the difference in pose between the images to be matched
ay be significant (considering the two extremes of the pose space
s in Fig. 3 ). Therefore, we divide the whole pose space into say T
egions and propose to learn a metric for each of these regions. For
xample, if there are 12 subspaces in the entire pose space and we
ivide the space into 4 regions, then each region will consist of 3
ubspaces. We jointly use the feature vectors for the constituent
ubspaces for each region as input features for the Mahalanobis
etric learning. Here, we utilize a formulation similar to the large
cale metric learning (LSML) [24] for learning the metrics for each
f these T regions.
The approach considers two independent generation processes
or match and non-match pairs. For example, for face recognition
pplication, we consider features from the same subject having
ariations in pose, (jointly) pose and resolution as the match pairs
nd those from different subjects as the non-match pairs. We de-
ide on whether any given pair of features x i and x j of a region,
elong to same class or not, from likelihood ratio test as formu-
ated below
(x i , x j ) = log
(p(x i , x j | H 0 )
p(x i , x j | H 1 )
)(5)
here, H 0 and H 1 are the hypotheses that a pair is non-match and
atch respectively. The value of δ( x i , x j ) is small when a pair of
eatures belong to the same class and its value is large when a pair
f features belong to different class. Assuming a Gaussian structure
or the difference space of features, the probabilities in (5) can be
ritten as
p(x i , x j | H 0 ) =
1 √
2 π | �n i j =0 | exp
(−1
2
x T i j �−1 n i j =0 x i j
)(6)
p(x i , x j | H 1 ) =
1 √
2 π | �n i j =1 | exp
(−1
2
x T i j �−1 n i j =1 x i j
)(7)
here, x i j = x i − x j is a vector in the difference space; n i j = 1 for
match pair and its value is 0 for a non-match pair. �n i j =1 and
n i j =0 are the corresponding covariance matrices.
Now we can reformulate (5) with the help of (6) and (7) as
(x i j ) = log
⎛
⎜ ⎜ ⎝
1 √
2 π | �n i j =0 | exp
(−1
2
x T i j �−1
n i j =0 x i j
)
1 √
2 π | �n i j =1 | exp
(−1
2
x T i j �−1
n i j =1 x i j
)⎞
⎟ ⎟ ⎠
(8)
t can be further simplified as
(x i j ) = x T i j
(�−1
n i j =1 − �−1 n i j =0
)x i j (9)
nalyzing (4) and (9) , the Mahalanobis Metric is given by M =(�−1
n i j =1 − �−1
n i j =0 ) .
.3. DPF-SPR computation
After computing the discriminative features, the distance be-
ween feature sets of two images can be directly computed using a
uitable set comparison metric. But this approach is computation-
lly inefficient which has motivated the development of DPF-SPR .
he details of the computational time is discussed in Section 4.5 .
ow, we explain the computation of DPF-SPR descriptor from the
iscriminative feature set for efficient matching of two feature sets.
The set of descriptors corresponding to each region in pose
pace can be approximated to lie on a linear subspace. This is
ecause there is a gradual change of the feature vectors for the
ifferent virtual intermediate poses that has been generated. Sup-
ose the basis vectors for region t spanning the space of the fea-
ures is given by g t, 1 , g t, 2 , . . . , g t,N s ∈ R
D . The D × N s matrix G t = g t, 1 , g t, 2 , . . . , g t,N s ] represents the subspace for region t , where N s
s the number of subspaces in a particular region, given by N s =/T . The dimension of each feature vector is D which can be dif-
erent for different applications. Now the subspace to vector repre-
entation for each region is computed. We can compute the vector
epresentation by rearranging the elements of the D × D matrix
= G t G
T t using the following operator (considering only the ele-
ents of the upper triangular matrix with the diagonal elements
caled by 1 / √
2 ) [40]
P F − SP R t =
(l 11 √
2
, l 12 , . . . , l 1 D , l 22 √
2
, l 23 , . . . , l DD √
2
)′
(10)
inally, we concatenate the vector representation for all the T re-
ions into a single vector denoted by DPF-SPR which is given as
ollows
P F − SP R = [ DP F − SP R 1 ; DP F − SP R 2 ; . . . ; DP F − SP R T ] (11)
.4. DPF-LCC computation
As we will see in the experimental section, DPF-SPR performs
ery well for matching faces and objects across pose. Another ad-
antage is that it can generalize to unseen poses, i.e. poses which
re not available during training. But one limitation is that the di-
ension of the descriptor can be considerably large as the num-
er of intermediate subspaces increase. This increases the compu-
ational time during testing (discussed in Section 4.5 ). Because of
he same reason, it is difficult to use this descriptor with low-level
eatures which are themselves high dimensional (eg. AlexNet fea-
ures which are of dimension 4096) as discussed in Section 4.5 . To
vercome this limitation, we propose another novel discriminative
ose-free descriptor, termed as Layered Canonical Correlated ( DPF-
CC ) descriptor based on canonical correlation analysis (CCA) [43] .
Motivation: CCA has proved to be very effective for cross-
omain or cross-modal data. CCA learns a common subspace into
hich the projected features from the source and target domains
re maximally correlated. We can think of face images from two
ifferent poses as two different domains and apply CCA to match
ace images across poses. To evaluate its performance for face
ecognition across pose, we took images of frontal pose and of pose
4_1 (refer to Fig. 8 ) as source and target views respectively for
oth training as well as testing, but there was no overlap between
he training and testing subjects. We see from Fig. 6 (first blue bar)
hat CCA performs very well for this application. Similar observa-
ion can also be made if pose 05_0 is used instead of pose 04_1
second blue bar). To evaluate how CCA performs for unseen poses
n which it has not been trained on, we perform another experi-
ent where pose 04_1 is used for training, and pose 05_0 for test-
ng. We observe from Fig. 6 (third blue bar) that the performance
ecreases drastically. We can make the following observations from
hese experiments - 1) CCA performs better when the difference
358 S. Sanyal et al. / Pattern Recognition 67 (2017) 353–365
Fig. 6. Motivation for deriving the proposed DPF-LCC descriptor.
Fig. 7. Detailed look of the layered structure for DPF-LCC computation.
t
a
t
a
p
e
a
i
ρ
w
E
a
i
p
w
b
q
I
i
i
q
q
I
{
j
fi
t
4
i
{
j
between source and target poses is smaller, since performance is
better for pose 05_0 as compared to pose 04_1; and 2) CCA per-
forms well when trained and tested on the same pose, but does
not generalize well to unseen poses. Based on these observations,
we propose the novel descriptor termed DPF-LCC .
Descriptor Computation: As in the case of DPF-SPR , the in-
put features are first projected onto all the intermediate subspaces
which are constructed by sampling on the geodesic curve between
the source and target subspaces. The resultant features are then
transformed by the discriminant metric. Hence we have a fea-
ture set corresponding to each intermediate subspace. Now, for
any two successive feature sets, we compute the projection vec-
tors for those sets using CCA, which are used to project them to
a subspace where they are maximally correlated as illustrated in
Fig. 2 (Top right). This ensures that the difference in pose between
the two subspaces used for computing CCA is small. This also en-
sures that irrespective of the actual pose of the input image, CCA
projection matrix is always applied on the pose it has been trained
on. Fig. 6 also shows the performance using the proposed descrip-
tor. We observe that for known pose (first and second red bar),
its performance is comparable to that of CCA. But for unseen pose
(third red bar), it significantly outperforms CCA.
We now describe the computation of the descriptor using a
toy example illustrated in Fig. 7 . Let the total number of dis-
criminative subspaces obtained after Section 3.2 be denoted by H
(Layer 0). In our example, H = 4 . Let Y
1 = [ y 1 1 , y 1 2 , . . . , y
1 N ] ∈ R
D ×N
and Y
2 = [ y 2 , y 2 , . . . , y 2 ] ∈ R
D ×N be two feature sets corresponding
1 2 No two successive subspaces, s 0 1
and s 0 2
in Layer 0, where y i j ∈ R
D is
transformed feature. N is the total number of features (each fea-
ure comes from one training example) in that particular subspace
nd i = 1 , 2 , . . . , H for this layer. Now, CCA is performed for each
air of neighbouring subspace, for example, s 0 i
and s 0 i +1
. Consid-
ring s 0 1
and s 0 2 , the goal is to find two projection vectors q 1 ∈ R
D
nd q 2 ∈ R
D , such that the correlation coefficient ρ ∈ [0, 1] is max-
mized. It is given by
= max q 1 ,q 2
(q 1 ) ′ 12 q
2 √
(q 1 ) ′ 11 q 1 (q 2 ) ′ 22 q 2 (12)
here the within class data covariance matrices are given as 11 = [ y 1 (y 1 )
′ ] and 22 = E [ y 2 (y 2 )
′ ] and between class data covari-
nce matrix is given as 12 = E [ y 1 (y 2 ) ′ ] . The projection vector q 1
n (12) can be solved by a generalized eigenvalue decomposition
roblem [43] as given below
12 (22 ) −1
′ 12 q
1 = α11 q 1 (13)
here α is a Lagrangian multiplier. Once q 1 is computed, q 2 can
e obtained using the following equation
2 =
(22 ) −1 12 q
1
α(14)
n practice, to avoid over-fitting and singularity problems, regular-
zation terms α1 and α2 are added with the covariance matrices
11 and 22 respectively. Therefore we actually solve the follow-
ng generalized eigenvalue problems instead of (13) and (14) to get
1 and q 2 respectively.
12 (22 + α2 I) −1
′ 12 q
1 = α(11 + α1 I) q 1 (15)
2 =
(22 + α2 I) −1 12 q
1
α(16)
n this manner, the two sets of projection vectors { q 1 k } n
k =1 and
q 2 k } n
k =1 are computed, where n < N . We decide the number of pro-
ection vectors n depending on the corresponding correlation coef-
cient ρ . For example, for the SCface dataset [4] , we have chosen
hose pair of projection vectors for which ρ > 10 −5 and found that
6 pair of projection vectors satisfy this criterion, i.e. n = 46 . Sim-
larly, we find the projection vectors for other pairs of subspaces,
s 0 2 , s 0
3 } and { s 0
3 , s 0
4 } of Layer 0. We take the same number of pro-
ection vectors for all the subspaces of a particular layer.
S. Sanyal et al. / Pattern Recognition 67 (2017) 353–365 359
i
n
o
S
o
e
w
o
0
h
s
i
w
D
t
i
d
r
e
q
m
i
i
t
o
f
L
o
j
i
p
w
w
M
d
p
s
n
S
t
i
4
c
p
a
j
t
4
[
a
e
i
o
o
t
e
P
d
Table 1
Rank-1 recognition accuracies (%) for face recognition across pose variations on the
PIE dataset [2] .
Method c 11 c 29 c 05 c 37 Average
K-SVD [45] 48 .5 76 .5 80 .9 57 .4 65 .8
Eigen Light-field [46] 78 .0 91 .0 93 .0 89 .0 87 .8
SGF [26] 58 .8 89 .7 89 .7 72 .1 77 .6
GFK [47] 63 .2 92 .7 92 .7 76 .5 81 .3
Subspace Interp. via DL [27] 76 .5 98 .5 98 .5 88 .2 90 .4
Proposed Approach ( DPF-SPR ) 98 .5 100 100 98 .5 99 .3
Proposed Approach ( DPF-LCC ) 98 .5 100 100 100 99 .6
fi
r
p
d
f
s
f
o
t
t
c
g
a
l
t
n
m
e
p
t
c
p
E
n
l
t
r
w
p
t
4
r
t
(
a
M
d
n
f
e
p
a
F
a
s
3
p
s
The total number of subspaces decreases in each layer, start-
ng from H in Layer 0, then (H − 1) in Layer 1, and so on, and fi-
ally one subspace in Layer (H − 1) . For each layer, for every pair
f consecutive subspaces, two projection matrices are computed.
o if there are h subspaces in one layer, there will be (h − 1) pairs
f subspaces leading to 2(h − 1) projection matrices. Thus, in the
xample given in Fig. 7 , the total number of projection matrices
ill be 6, 4 and 2 for Layer 0, 1 and 2 respectively. The number
f features will also change in each layer. Each subspace in Layer
has N number of features. Since subspace s 0 1
and s 0 2
of Layer 0
ave N number of features each, when they are projected onto the
ubspace s 1 1
of Layer 1, there will be a total 2 N number of features
n that subspace. Likewise, a subspace corresponding to Layer m ,
ill have 2 m N number of features. The initial feature dimension
will also change in each layer which will essentially depend on
he number of projection vectors used for each projection matrix
n each layer. The number of projection vectors corresponding to
ifferent layers can be different. But, we have observed that if the
ange of ρ is sufficiently large (as we have considered for SCface
xperiment), the value of n remains almost same for all the subse-
uent layers. The output of the training will be all the projection
atrices, denoted by Q
i j , which is the j th projection matrix of the
th layer.
During testing, the proposed descriptor is computed for all the
mages in the gallery and probe, which are then compared. First
he extracted low-level features (eg. SIFT for faces) are projected
nto all the subspaces learned on the geodesic curve and trans-
ormed using the learned M . This is the feature corresponding to
ayer 0 of the proposed approach. These features are projected
nto Layer 1 and all the subsequent layers using the learned pro-
ection matrices. Finally we concatenate all the projected features
n the final layer to generate the DPF-LCC descriptor. For our exam-
le, suppose the feature dimension of the image in Layer 0 is D . If
e take n = 46 for all the layers, after projection to Layer 1, there
ill be two 46 dimensional vector corresponding to each subspace.
oving ahead, after projection in Layer 2, there will be four 46
imensional vector corresponding to each subspace. Finally, after
rojection in Layer 3, there will be total eight vectors of dimen-
ion 46 representing the image. These eight vectors are concate-
ated to form the DPF-LCC descriptor of dimension 8 × 46 = 368 .
o the dimension of the DPF-LCC descriptor does not depend on
he initial dimension of the low level features extracted from the
mage, which is its another advantage.
. Experimental evaluation
In this section, we present the results of extensive experiments
onducted to evaluate the effectiveness and usefulness of the pro-
osed descriptors. Particularly, experiments on face recognition
cross pose, face recognition across pose and resolution, and ob-
ect recognition across pose, are done to test the applicability of
he proposed approach for these applications.
.1. Face recognition across pose
Face images are represented by local feature descriptors (SIFT
42] in this paper) computed at 15 fudicial locations. We have used
freely available C++ software library based on active shape mod-
ls known as STASM [44] to detect the fiducial locations automat-
cally and also verified them manually and corrected the incorrect
nes. Here, experiments on recognizing faces across pose variations
n the CMU-PIE dataset [2] are presented. We follow the same pro-
ocol as in [27] and have used all the 68 subjects under 5 differ-
nt poses and frontal illumination. 100 subjects from the Multi-
IE data [3] , whose images have been captured under similar con-
itions are used for training. Subspaces are constructed for each
ducial point separately and then concatenated to form the rep-
esentation of the facial image. Furthermore, frontal and extreme
oses ( c 11 and c 37 ) are used for representing the entire pose space
uring training. We compute 12 subspaces in between pose c 11 to
rontal and frontal to pose c 37 . We also subdivide the entire pose
pace into 4 regions for computing the discriminative feature sets.
We consider the frontal images as the gallery and the non-
rontal images under different poses as the probe. As there is no
verlap between the subjects used in the training and testing set,
here is no need for retraining even if the test subjects change. Af-
er the training stage, the subspaces and transformation matrices
an be used for any test subject. During testing, initially both the
allery and the probe images are projected onto all the subspaces,
nd then their discriminative feature sets are computed using the
earned metric. Finally, either the DPF-SPR or the DPF-LCC descrip-
or is computed. Unlike most of the methods in literature, there is
o need to learn a classifier separately for each pose which is a
ajor advantage of our approach.
Table 1 shows the results of the proposed approach for this
xperiment. We have shown comparison with several other ap-
roaches, namely (1) K-SVD [45] : which learns a dictionary from
he frontal images and uses the same dictionary to get the sparse
oefficients for the non-frontal images; (2) SGF [26] and GFK [47] :
erform subspace interpolation on the Grassmann manifold; (3)
igen-field approach [46] which is designed specifically to recog-
ize faces across pose; (4) subspace interpolation via dictionary
earning [27] interpolates subspaces by using dictionary learning
o link the frontal and non-frontal domains. The recognition accu-
acies of all the other approaches are taken directly from [27] . Even
ith no separate training for each of the different probe poses, the
roposed approaches perform better than all the other methods for
he task of recognizing faces across pose variations.
.2. Face recognition across pose and resolution
Here, we test the applicability of the proposed descriptors for
ecognizing faces across multiple variations simultaneously. For
his, we perform face recognition with frontal and high resolution
HR) images as gallery and non-frontal, low-resolution (LR) images
s probe, as usually found in surveillance scenarios.
Results on MultiPIE dataset: First we report results on the
ulti-PIE dataset [3] containing images of 337 subjects from four
ifferent recording sessions captured under different poses, illumi-
ation conditions and expressions. We consider HR images under
rontal pose and frontal illumination condition as gallery for our
xperiments. For the probe images, we use LR images taken under
ose 04 _ 1 , 05 _ 0 , 13 _ 0 and 14 _ 0 (as named in the dataset) under
ll the 20 different illumination conditions and neutral expression.
ig. 8 (a) shows a few sample HR gallery images and (b,c,d,e) shows
few probe images in four different poses. We use HR images of
ize 60 × 50 and LR images of size 20 × 17 (i.e. scale factor of
) for all the experiments. Standard bi-cubic interpolation was ap-
lied on the HR images to get the LR images. 100 randomly chosen
ubjects with frontal, 13 _ 0 (left extreme) and 04 _ 1 (right extreme)
360 S. Sanyal et al. / Pattern Recognition 67 (2017) 353–365
Fig. 8. Example images from the Multi-PIE data [3] . (a) Frontal high-resolution im-
ages used as gallery; (b,c,d,e) low-resolution images under non-frontal pose (pose
13 _ 0 , 14 _ 0 , 05 _ 0 and 04 _ 1 as given in the dataset) used as probe images.
Table 2
Rank-1 recognition performance (%) for four different probe poses, averaged over
the different gallery illuminations on the Multi-PIE dataset [3] .
Method Pose 13_0 Pose 14_0 Pose 05_0 Pose 04_1
MDS Learning [48] 32 .8 44 .8 47 .0 48 .5
LSML [24] 46 .9 53 .9 55 .2 54 .3
GMA [49] 65 .0 70 .1 70 .3 64 .2
MvDA [50] 45 .7 55 .0 53 .8 42 .9
FCPRF + LSML [51] 54 .0 71 .2 73 .4 61 .0
SCDL [52] 66 .3 73 .0 72 .7 64 .1
CFDL [53] 65 .9 72 .0 72 .8 64 .7
SCDL + LSML 69 .1 75 .1 74 67 .6
CFDL + LSML 68 .9 74 .1 74 .6 68 .1
Proposed DPF-SPR 74 .5 78 .0 74 .0 70 .1
Proposed DPF-LCC 75 .5 78 .0 78 .05 74 .7
Fig. 10. Example facial images of Surveillance Cameras Face Database [4] . Top
row: frontal gallery images, second row: corresponding probe images captured by
surveillance cameras.
a
p
a
f
s
u
w
p
a
t
p
p
s
(
W
d
r
s
p
S
a
i
a
fi
h
i
a
A
F
r
4
s
e
p
f
poses are used for generating the subspaces and metric learning,
and the remaining subjects for testing. There is no overlap between
the train and test subjects and training is done only once for all
the poses. We use HR images of frontal pose and LR images of ex-
treme poses during the subspace generation. Sample images that
are used for training (marked with bounding box) and testing (all
the five poses) are shown in Fig. 9 . The parameters for the pro-
posed approach are the same as used in the PIE experiment.
The results for the proposed descriptors are reported in Table 2 .
We also compare our approach with several state-of-the-art ap-
proaches; namely (1) MDS transformation learning [48] where a
transformation between the HR frontal gallery and LR non-frontal
probe is learned; (2) metric learning approaches: large scale met-
ric learning (LSML) [24] where a metric from equivalence con-
straints based on the statistical inference perspective is learned;
(3) semi-coupled and coupled dictionary learning [52,53] where
joint dictionary learning is performed to match objects from differ-
ent domains; (4) generalized multiview analysis (GMA) [49] where
a joint, quadratic program over different f eature spaces is solved
to compute a single linear subspace; (5) multiview discriminant
analysis (MvDA) [50] where a single discriminant common space
for multiple views is pursued in a non-pairwise manner by jointly
learning multiple view-specific linear transforms; (6) face image
classification by pooling raw features (FCPRF) [51] where features
Fig. 9. Illustration of training poses (marked with bounding box) and testing poses (all t
and 05 _ 0 are not used for training.
re extracted by pooling local patches over a multi dimensional
yramid.
Note that we have not used the two intermediate poses ( 14 _ 0
nd 05 _ 0 ) during training, but still we achieve good performance
or probe images in these intermediate poses. In comparison, re-
ults for all the other approaches reported in Table 2 are obtained
sing all the poses for training. Their performance is lower when
e train with only the frontal and extreme poses, as used in the
roposed approach. We have provided the same input features for
ll the algorithms (except [51] , where the algorithm itself extracts
he robust features) and learned one transformation for all the
robe poses. We have taken the source codes for the other ap-
roaches from the respective authors’ websites. For fair compari-
on, we also report results of the dictionary learning approaches
SCDL and CFDL) with LSML applied on the sparse coefficients.
e have also applied LSML on FCPRF features which can add the
iscriminability that can help in improving the performance of
aw features. We observe that for all the poses, the proposed de-
criptors perform better as compared with the state-of-the-art ap-
roaches. We also observe that DPF-LCC performs better than DPF-
PR .
Results on Surveillance Cameras Face Database (SCface): We
lso evaluate the proposed descriptors on real surveillance qual-
ty data obtained from the SCface database [4] . It contains im-
ges of 130 subjects captured in uncontrolled environment using
ve different video surveillance cameras. For the gallery images,
igh-quality camera was used. Same experimental setup as used
n [48] is applied for our evaluation, which includes all the im-
ges from the five surveillance cameras i.e. a total of 650 images.
few gallery (top row) and probe images (bottom row) are shown
ig. 10 .
As in [48] , randomly 50 subjects are picked for training and the
emaining 80 subjects are used for testing (thus there are a total of
00 probe images). There is no overlap between the train and test
ubjects. We have repeated the experiment 10 times with differ-
nt random sampling of the subjects. The Rank-1 accuracy of the
roposed approach and comparisons with several other approaches
or this experiment are reported in Table 3 . HR frontal images
he five poses) that are used in our experiments on Multi-PIE database. Poses 14 _ 0
S. Sanyal et al. / Pattern Recognition 67 (2017) 353–365 361
Table 3
Rank-1 accuracy (%) of the proposed approach and comparison with state-of-the-
art approaches on the Surveillance Cameras Face Database [4] . The two columns
indicate two different training setups- using data from only one camera and five
cameras for training respectively. The proposed approach trained using data from
just one camera performs better than all the compared approaches even when they
are trained using data from all five cameras.
Method Rank-1 Rank-1
1 Cam 5 Cam
MDS Learning [48] 30 .0 61 .1
LSML [24] 64 .7 67 .2
GMA [49] 38 .2 50 .5
FCPRF + LSML [51] 58 .0 61 .3
SCDL [52] 48 .2 58 .5
CFDL [53] 45 .7 62 .2
SCDL + LSML 48 .8 60 .0
CFDL + LSML 46 .3 63 .3
Proposed DPF-SPR 69 .0 –
Proposed DPF-LCC 72 .0 –
Fig. 11. Sample images from the COIL 20 dataset [5] . The first column shows the
gallery images and the second to fifth columns shows some probe images for the
same objects.
a
p
F
i
(
H
(
t
t
o
t
t
p
o
t
r
D
4
t
t
(
a
a
i
t
i
w
p
s
t
Table 4
Rank-1 accuracy (%) of the proposed approach and comparison with other ap-
proaches on COIL 20 database [5] .
Method Rank-1 Accuracy
MDS Learning [48] 75 .6
LSML [24] 80 .3
GMA [49] 66 .1
SCDL [52] 79 .2
CFDL [53] 78 .7
SCDL + LSML 82 .6
CFDL + LSML 82 .0
MvDA [50] 69 .7
Proposed DPF-SPR 82 .2
Proposed DPF-LCC 83 .0
Fig. 12. Sample RGB (row 1 and 3) and the corresponding depth images (row 2 and
4) of calculator and keyboard objects from RGB-D object database [6] .
m
r
m
T
t
d
b
p
t
i
f
4
f
t
v
p
f
p
s
r
t
o
-
(
i
b
L
p
e
4
i
T
nd LR non-frontal images from one camera are used for our ap-
roaches to generate the subspaces and transformation learning.
or all the other approaches, two setups are followed for train-
ng: (a) HR frontal images and non-frontal images from one camera
same as for the proposed approach) ( Table 3 second column); (b)
R frontal images and non-frontal images from all the five cameras
Table 3 third column). When only one camera is used for training,
he performance of the proposed approaches are significantly bet-
er than the other approaches. Even though the performance of the
ther approaches improve by using images from all the cameras,
hey still perform worse than the proposed approaches. This shows
hat our descriptors can generalize better across unseen poses. We
erform another experiment, in which we use GMA [49] in place
f CCA keeping the same experimental protocol, and observe that
he rank-1 recognition rate improved to 74.5%. Thus, if CCA can be
eplaced by better approaches, the performance of the proposed
PF-LCC descriptor may improve further.
.3. Object recognition across pose
Now we illustrate the applicability of the proposed descrip-
ors to recognize general objects across variations in viewpoint. For
his, we perform experiments on Columbia object image library 20
COIL 20) database [5] . It contains 20 objects with gray-scale im-
ges. A few sample images are shown in Fig. 11 . The dataset is cre-
ted in such a way that each object is captured by rotating it about
ts vertical axis at a regular interval of 5 °. 50 images of each object
hat has pose variations from left extreme to right extreme includ-
ng the frontal pose are selected for the experiments. For training
e use five images per object around the frontal pose and extreme
oses and the remaining images are used for testing. We have re-
ized the images to 32 × 32 and used the image intensity values as
he input features. The images are normalized against their maxi-
um pixel value. For our experiment, total 12 subspaces with four
egions in pose space are used.
We consider images of frontal pose as gallery data and the re-
aining images that differ in pose as probe data during testing.
he images and the object poses used for testing are different from
hose used during training. Note that our experimental protocol is
ifferent from ones normally used and so the performance cannot
e directly compared with other published papers which have re-
orted results on this dataset. The Rank-1 recognition accuracy of
he proposed descriptors and comparisons with other approaches
s given in Table 4 . We observe that the proposed descriptors per-
orm favourably as compared to the other approaches.
.4. Object recognition on RGB-D object database
The RGB-D database [6] contains both RGB and depth images
rom 51 categories. The objects are captured in such a way that
hey are covered from multiple views. We take all the images (both
isual and depth) of the first instance in each category for our ex-
eriments. In each category, we selected five images from four dif-
erent poses for training and the rest of the images are used as
robe images during testing. Sample images from the database are
hown in Fig. 12 .
Kernel descriptors [54] of dimension 500 are extracted sepa-
ately from visual and depth images and are used as features in
his experiment. The recognition experiment is conducted to rec-
gnize visual probe images against visual gallery images (Visual
Visual) and depth probe images against depth gallery images
Depth - Depth) and the Rank-1 recognition rates (%) are reported
n Table 5 . Comparison with other algorithms is also shown for
oth the cases. We observe that the proposed DPF-SPR and DPF-
CC descriptors performs favourably as compared to the other ap-
roaches thus justifying their usefulness for the application of gen-
ral object recognition.
.5. Analysis of DPF-LCC and DPF-SPR descriptors
Here we analyze the proposed descriptors, DPF-LCC and DPF-SPR
n more detail. For this purpose, we use the SCface database [4] .
he experimental setup is similar to that of [48] where we ran-
362 S. Sanyal et al. / Pattern Recognition 67 (2017) 353–365
Table 5
Rank-1 accuracy (%) of the proposed approach and comparison with other ap-
proaches on RGB-D object database [6] .
Method Visual - Visual Depth - Depth
MDS Learning [48] 82 .2 53 .9
LSML [24] 60 .1 45 .8
GMA [49] 70 .6 38 .9
MvDA [50] 77 .2 50 .6
SCDL [52] 80 .4 61 .1
CFDL [53] 81 .0 60 .5
SCDL + LSML 81 .7 62 .0
CFDL + LSML 82 .0 61 .3
Proposed DPF-SPR 86 .0 62 .0
Proposed DPF-LCC 84 .8 63 .1
Fig. 13. Rank-1 accuracy (%) vs number of subspaces of the Surveillance Cameras
Face Database.
Table 6
Number of metric learning regions vs Rank-1 accuracy (%).
Method Number of metric learning regions
1 2 3 4 5 6
DPF-SPR 65 .0 65 .3 69 .0 64 .5 64 .5 64 .3
DPF-LCC 70 .8 72 .0 72 .0 69 .8 69 .8 70 .0
Fig. 14. Size of the descriptors vs number of subspaces.
Fig. 15. Time required (seconds) vs number of subspaces.
e
t
t
a
i
1
s
d
d
s
s
o
c
p
r
o
w
c
l
a
t
c
c
domly pick 50 subjects for training and use the remaining 80 sub-
jects for testing with no overlap between the train and test sub-
jects.
First, we analyze the effect of the number of subspaces on the
Rank-1 accuracy. This result is shown in Fig. 13 . We observe that
for a wide range of the number of subspaces, the performance of
the two descriptors does not vary widely.
Now, we analyze the effect of number of metric learning re-
gions used for learning the discriminative features on the recog-
niiton accuracy. We use six intermediate subspaces in our experi-
ments. Table 6 shows the Rank 1 accuracy (%) with different num-
ber of metric learning regions. We see that the performance varies
little with small change in the number of regions. We also observe
that when the number of regions is very less, the performance of
both the proposed descriptors is slightly less. This can potentially
be attributed to the difference in pose of the images in the regions
will be more. Also, if we increase the number of regions, the accu-
racy increases at first and then it starts decreasing. This is due to
the reason that when the number of regions is high, the number
of match and non-match pairs available for learning the discrimi-
native metric is relatively less.
Now, we analyse the feature dimension and the computational
requirements of the two proposed descriptors. The dimensions of
the two descriptors, DPF-LCC and DPF-SPR are functions of the
number of intermediate subspaces used. Fig. 14 shows the varia-
tion in the feature dimension of DPF-SPR and DPF-LCC with differ-
nt number of intermediate subspaces for the SCface database. For
his dataset, we have taken the number of subspaces as six, and
hus the feature dimensions for DPF-SPR and DPF-LCC descriptors
re 371520 and 22080 respectively. Here, we represent each facial
mage as the concatenation of features computed from each of the
5 fiducial points. We observe that the dimension of both the de-
criptors increases with the number of subspaces. But the feature
imension of DPF-LCC is much less as compared to that of DPF-SPR
escriptor for the entire range.
The time required to compute the distance between two de-
criptors is a function of their dimension. We have already ob-
erved that the dimension of DPF-LCC is considerably less than that
f DPF-SPR . So it can be expected that it will take less time for
omputing the distance between two DPF-LCC descriptors as com-
ared to two DPF-SPR descriptors. Fig. 15 shows the plot of time
equired for pairwise comparison (in seconds) against the number
f subspaces. Since the dimension of both the descriptors increase
ith the number of subspaces, so the time requirement also in-
rease. But we observe that the time required for DPF-LCC is much
ess than that required for DPF-SPR . For the SCface database, there
re 80 gallery images during testing, so the time required to get
he identity of one probe image is around 0.7 s for DPF-LCC , in
omparison to around 15 s for DPF-SPR . This difference will in-
rease as the size of the gallery increases. We have mentioned
S. Sanyal et al. / Pattern Recognition 67 (2017) 353–365 363
Fig. 16. Rank-1 accuracy as a function of number of projection vectors in each pro-
jection matrix.
e
t
t
t
t
b
d
c
o
c
t
v
p
n
s
w
i
s
q
s
4
i
m
p
a
t
n
P
s
e
d
c
o
V
i
e
t
o
c
t
r
t
t
Table 7
Rank-1 recognition accuracy (%) for four different probe poses, averaged over the
different gallery illuminations on the Multi-PIE dataset [3] using VGG Features [55] .
Method Pose 13_0 Pose 14_0 Pose 05_0 Pose 04_1
VGG-HR-LR-NN 32 .2 52 .8 53 .1 32 .8
VGG-HR-LR-Proposed 39 .7 55 .9 57 .0 40 .5
VGG-HR-HR-NN 88 .3 97 .0 97 .0 91 .3
VGG-HR-HR-Proposed 92 .6 98 .0 98 .2 94 .3
Table 8
Rank-1 accuracy (%) of the proposed approach on RGB-D database with AlexNet
deep features [56] .
Method Rank-1 accuracy
AlexNet-NN 90 .2
AlexNet-Proposed 91 .4
L
b
V
m
c
t
t
e
g
t
u
l
A
H
n
w
fi
T
t
t
l
a
l
m
t
a
5
d
a
r
t
t
l
a
t
a
n
R
arlier that after the computation of the discriminative descriptor,
he feature sets can be compared using pairwise comparisons. But
his approach is computationally expensive, and this has motivated
he development of the two proposed descriptors. So we also plot
he time required to compute the distance between two images
y pairwise comparison in Fig. 15 . We see that both the proposed
escriptors require much less time as compared to the pair-wise
omparison approach, specially with the increase in the number
f subspaces. The Rank-1 accuracy for this dataset with pairwise
omparison is around 68.75%, which is slightly less than that ob-
ained using the two descriptors.
Lastly, we analyse the performance of DPF-LCC with different
alues of n , where n is the number of projection vectors in each
rojection matix as described is Section 3.4 . As mentioned earlier,
can be different for the different layers, but we have taken the
ame value since we have observed that it does not vary much
ith the layers. We observe from Fig. 16 that the Rank-1 accuracy
nitially increases with increasing n , reaches maximum and then
aturates. The value of n which gives the maximum accuracy is
uite small, which is one of the main reason behind the relatively
mall dimension of the DPF-LCC descriptor.
.6. Analysis with state-of-the-art deep features
Recently deep learning techniques have become very popular
n computer vision and have produced state-of-the-art results for
any different applications. In this section, we compare the pro-
osed approach with some of the recent deep learning methods
nd also show how the proposed descriptors can be used to fur-
her boost the performance of features obtained from deep neural
etworks. For this purpose, we perform experiments on the Multi-
IE dataset [3] for faces and RGB-D database [6] for objects. The
etup for these experiments are the same as those reported in the
xperiments section. For the Multi-PIE dataset [3] , we use a recent
eep learning architecture VGGNet [55] , which has been specifi-
ally trained with faces for the application of face recognition. The
utput of the fully connected layer labeled as FC6 in the original
GGNet model whose dimension is 4096 × 1 is used as features
n this experiment. Note that we use the existing network param-
ters without any retraining. Using the HR images as gallery and
he LR images as probe, (i.e. similar protocol as used in our previ-
us experiments) and nearest neighbour classifier, the Rank 1 ac-
uracy (%) is reported in Table 7 , denoted as VGG-HR-LR-NN. Since
he VGGNet is not trained on low resolution images, the Rank-1
ecognition accuracy is quite low as expected. We then compute
he proposed DPF-LCC descriptor using the VGGNet [55] output as
he low-level features and the performance is reported as VGG-HR-
R-Proposed. We see that though the performance is still low, it is
etter as compared to using the VGGNet features directly. Since the
GGNet is trained on HR images, we also perform another experi-
ent with both HR images as gallery and probe. The results in this
ase for both the nearest neighbour matching of the VGGNet fea-
ures and also that of the proposed descriptor on the VGGNet fea-
ures is given in the third and fourth row of Table 7 respectively. As
xpected, the performance using VGGNet features directly is very
ood. We also observe that for this case also, the proposed descrip-
or is able to further improve the performance thus justifying its
sefulness with different kinds of input features.
For the application of object recognition, we perform simi-
ar experiments with the very popular deep learning architecture
lexNet [56] , which is pretrained on object images from ImageNet.
ere, we take the output of the first fully connected layer of the
etwork as the low-level feature. First, we compute the accuracy
hen the AlexNet features are used and nearest neighbour classi-
er is used to compute the probe identity. This result is given in
able 8 . We also report the results using the AlexNet along with
he proposed descriptor, termed as AlexNet-Proposed. We observe
hat the proposed descriptor performs slightly better than the low
evel features. For all these comparisons, we have used DPF-LCC
nd not DPF-SPR descriptor due to the high dimensionality of the
atter. Note that both these deep networks have been trained on
illions of images, and so improvement over these features using
he proposed descriptors justifies the usefulness of the proposed
pproach.
. Discussion
In this work, we proposed two novel discriminative pose-free
escriptors ( DPF-SPR and DPF-LCC ) for matching faces and objects
cross pose. The proposed approaches require images from a few
egions of the pose space for training and do not require separate
raining for each probe pose. Experimental evaluations for various
asks like face recognition across pose, face recognition across reso-
ution and pose, and object recognition across different viewpoints
re conducted to evaluate the usefulness and generalizability of
he approach. Superior performance of the proposed descriptors
s compared to the state-of-the-art approaches show the effective-
ess of the proposed approach.
eferences
[1] S. Sanyal , S.P. Mudunuri , S. Biswas , Discriminative pose-free descriptors for face
and object matching, Int. Conf. Comput. Vision (2015) 3837–3845 . [2] T. Sim , S. Baker , M. Bsat , The cmu pose, illumination and expression database,
IEEE Trans. Pattern Anal. Mach. Intell. 25 (12) (2003) 1615–1618 . [3] R. Gross , I. Matthews , J. Cohn , T. Kanade , S. Baker , Guide to the cmu Multi-Pie
Database, Technical report, Carnegie Mellon University, 2007 .
364 S. Sanyal et al. / Pattern Recognition 67 (2017) 353–365
[
[4] M. Grgic , K. Delac , S. Grgic , Scface–surveillance cameras face database, Mul-timed. Tools Appl. 51 (3) (2011) 863–879 .
[5] S.A. Nene , S.K. Nayar , H. Murase , Columbia object image library (coil-20), Tech-nical report, 1996 .
[6] K. Lai , L. Bo , X. Ren , D. Fox , A large-scale hierarchical multi-view rgb-d objectdataset, Int. Conf. Rob. Autom. (2011) 1817–1824 .
[7] S. Li , X. Liu , X. Chai , H. Zhang , S. Lao , S. Shan , Maximal likelihood correspon-dence estimation for face recognition across pose, IEEE Trans. Image Process.
23 (10) (2014) 4587–4600 .
[8] C. Ding , C. Xu , D. Tao , Multi-task pose-invariant face recognition, IEEE Trans.Image Process. 24 (3) (2015) 980–993 .
[9] C. Ding , J. Choi , D. Tao , L.S. Davis , Multi-directional multi-level dual-cross pat-terns for robust face recognition, IEEE Trans. Pattern Anal. Mach. Intell. 38 (3)
(2016) 518–531 . [10] X. Zhang , D.S. Pham , S. Venkatesh , W. Liu , D. Phung , Mixed-norm sparse rep-
resentation for multi view face recognition, Pattern Recognit. 48 (9) (2015)
2935–2946 . [11] C.D. Castillo , D.W. Jacobs , Using stereo matching with general epipolar geom-
etry for 2d face recognition across pose, IEEE Trans. Pattern Anal. Mach. Intell.31 (12) (2009) 2298–2304 .
[12] D. Wang , H. Lu , M.H. Yang , Kernel collaborative face recognition, PatternRecognit. 48 (10) (2015) 3025–3037 .
[13] L.A. Cament , F.J. Galdames , K.W. Bowyer , C.A. Perez , Face recognition under
pose variation with local gabor features enhanced by active shape and statis-tical models, Pattern Recognit. 48 (11) (2015) 3371–3384 .
[14] H. Li , G. Hua , Z. Lin , J. Brandt , J. Yang , Probabilistic elastic matching for posevariant face verification, Comput. Vision Pattern Recognit. (2013) 3499–3506 .
[15] Q. Yin , X. Tang , J. Sun , An associate-predict model for face recognition, Comput.Vision Pattern Recognit. (2011) 497–504 .
[16] O. Arandjelovic , Learnt quasi-transitive similarity for retrieval from large col-
lections of faces, Comput. Vision Pattern Recognit. (2016) 4 883–4 892 . [17] P.H. Hennings-Yeomans , S. Baker , B.V. Kumar , Simultaneous super-resolution
and feature extraction for recognition of low-resolution faces, Comput. VisionPattern Recognit. (2008) 1–8 .
[18] B. Li , H. Chang , S. Shan , X. Chen , Low-resolution face recognition via coupledlocality preserving mappings, Comput. Vision Pattern Recognit. (2010) 20–23 .
[19] H.S. Bhatt , R. Singh , M. Vatsa , N.K. Ratha , Improving cross-resolution face
matching using ensemble-based co-transfer learning, IEEE Trans. Image Pro-cess. 23 (12) (2014) 5654–5669 .
[20] C. Ren , D. Dai , K. Huang , Z. Lai , Transfer learning of structured representationfor face recognition, IEEE Trans. Image Process. 23 (12) (2014) 5440–5454 .
[21] S. Al-Maadeed , M. Bourif , A. Bouridane , R. Jiang , Low-quality facial biometricverification via dictionary-based random pooling, Pattern Recognit. 52 (2016)
238–248 .
[22] Z.Q. Zhao , Y.M. Cheung , H. Hu , X. Wu , Corrupted and occluded face recognitionvia cooperative sparse representation, Pattern Recognit. 56 (2016) 77–87 .
[23] W.W.W. Zou , P.C. Yuen , Very low resolution face recognition problem, IEEETrans. Image Process. 21 (1) (2012) 327–340 .
[24] M. Kostinger , M. Hirzer , P. Wohlhart , P. Roth , H. Bischof , Large scale metriclearning from equivalence constraints, Comput. Vision Pattern Recognit. (2012)
2228–2295 . [25] P. Moutafis , I.A. Kakadiaris , Semi-coupled basis and distance metric learning
for cross-domain matching: application to low-resolution face recognition, Int.
Joint Conf. Biometrics (2014) 1–8 . [26] R. Gopalan , R. Li , R. Chellappa , Unsupervised adaptation across domain shifts
by generating intermediate data representations, IEEE Trans. Pattern Anal.Mach. Intell. 36 (11) (2014) 2288–2302 .
[27] J. Ni , Q. Qiu , R. Chellappa , Subspace interpolation via dictionary learningfor unsupervised domain adaptation, Comput. Vision Pattern Recognit. (2013)
692–699 .
[28] Y. Sun , Y. Chen , X. Wang , X. Tang , Deep learning face representation by jointidentification-verification, Neural Inf. Process. Syst. (2014) 1988–1996 .
[29] Y. Taigman , M. Yang , M.A. Ranzato , L. Wolf , Deepface: closing the gap to hu-man-level performance in face verification, Comput. Vision Pattern Recognit.
(2014) 1701–1708 .
[30] F. Schroff, D. Kalenichenko , J. Philbin , Facenet: a unified embedding for facerecognition and clustering, Comput. Vision Pattern Recognit. (2015) 815–823 .
[31] T.H. Chan , K. Jia , S. Gao , J. Lu , Z. Zeng , Y. Ma , Pcanet: a simple deep learn-ing baseline for image classification? IEEE Trans. Image Process. 24 (12) (2015)
5017–5032 . [32] Z. Wang, S. Chang, Y. Yang, D. Liu, T. Huang, Studying very low resolution
recognition using deep networks, arXiv preprint arXiv:1601.04153(2016). [33] J. Chen, J. Wu, J. Konrad, P. Ishwar, Semi-coupled two-stream fusion con-
vnets for action recognition at extremely low resolutions, arXiv preprint
arXiv:1610.03898(2016). [34] J. Schels , J. Liebelt , R. Lienhart , Learning an object class representation on a
continuous viewsphere, Comput. Vision Pattern Recognit. (2012) 3170–3177 . [35] E. Hsiao , M. Hebert , Occlusion reasoning for object detection under arbitrary
viewpoint, IEEE Trans. Pattern Anal. Mach. Intell. 36 (9) (2014) 1803–1815 . [36] J.C. Rubio , A. Eigenstetter , B. Ommer , Generative regularization with latent
topics for discriminative object recognition, Pattern Recognit. 48 (12) (2015)
3871–3880 . [37] M. Wu , J. Zhou , J. Sun , Query-expanded collaborative representation based
classification with class-specific prototypes for object recognition, PatternRecognit. 47 (11) (2014) 3585–3596 .
[38] A . Bakry , A . Elgammal , Untangling object-view manifold for multiview recog-nition and pose estimation, Eur. Conf. Comput. Vision (2014) 434–449 .
[39] K. He , X. Zhang , S. Ren , J. Sun , Spatial pyramid pooling in deep convolutional
networks for visual recognition, IEEE Trans. Pattern Anal. Mach. Intell. 37 (9)(2015) 1904–1916 .
[40] T. Hassner , V. Mayzels , L.Z. Manor , On sifts and their scales, Comput. VisionPattern Recognit. (2012) 808–821 .
[41] K. Gallivan , A. Srivastava , X. Liu , P.V. Dooren , Efficient algorithms for inferenceson grassmann manifolds, IEEE Workshop Stat. Signal Process. (2003) 315–318 .
[42] D.G. Lowe , Distinctive image features from scale-invariant keypoints, Int. J.
Comput. Vis. 60(2) (2004) 91–110 . [43] D.R. Hardoon , S. Szedmak , J. Shawe-Taylor , Canonical correlation analysis: an
overview with application to learning methods, Neural Comput. 16 (12) (2004)2639–2664 .
44] S. Milborrow, F. Nicolls, Locating facial features with an extended active shapemodel, Eur. Conf. Comput. Vision (2008) . http://www.milbo.users.sonic.net/
stasm .
[45] M. Aharon , M. Elad , A. Bruckstein , K-Svd: an algorithm for designing overcom-plete dictionaries for sparse representation, IEEE Trans. Signal Process. 54 (11)
(2006) 4311–4322 . [46] R. Gross , I. Matthews , S. Baker , Appearance-based face recognition and light–
fields, IEEE Trans. Pattern Anal. Mach. Intell. 26 (4) (2004) 449–465 . [47] B. Gong , Y. Shi , F. Sha , K. Grauman , Geodesic flow kernel for unsupervised do-
main adaptation, Comput. Vision Pattern Recognit. (2012) 2066–2073 .
[48] S. Biswas , G. Aggarwal , P.J. Flynn , K.W. Bowyer , Pose-robust recognition oflow-resolution face images, IEEE Trans. Pattern Anal. Mach. Intell. 35 (12)
(2013) 3037–3049 . [49] A . Sharma , A . Kumar , H. Daume , D. Jacobs , Generalized multiview analysis: a
discriminative latent space, Int. Conf. Comput. Vision (2012) 2160–2167 . [50] M. Kan , S. Shan , H. Zhang , S. Lao , X. Chen , Multi-view discriminant analysis,
IEEE Trans. Pattern Anal. Mach. Intell. 38 (1) (2016) 188–194 . [51] F. Shen , C. Shen , X. Zhou , Y. Yang , H.T. Shen , Face image classification by pool-
ing raw features, Pattern Recognit. 54 (2016) 94–103 .
[52] S. Wang , D. Zhang , Y. Liang , Q. Pan , Semi-coupled dictionary learning with ap-plications to image super-resolution and photo-sketch synthesis, Comput. Vi-
sion Pattern Recognit. (2012) 2216–2223 . [53] D.A. Huang , Y.C.F. Wang , Coupled dictionary and feature space learning with
applications to cross-domain image synthesis and recognition, Int. Conf. Com-put. Vision (2013) 2496–2503 .
[54] L. Bo , X. Ren , D. Fox , Kernel descriptors for visual recognition, in: Advances in
Neural Information Processing Systems, 2010, pp. 244–252 . [55] O.M. Parkhi , A. Vedaldi , A. Zisserman , Deep face recognition, Br. Mach. Vision
Conf. (2015) 1–6 . [56] A. Krizhevsky , I. Sutskever , G.E. Hinton , Imagenet classification with deep con-
volutional neural networks, Neural Inf. Process. Syst. (2012) 1097–1105 .
S. Sanyal et al. / Pattern Recognition 67 (2017) 353–365 365
S ntly working toward the M.Sc (Engg.) degree in Electrical Engineering in Indian Institute o ssing Society. His research interests include computer vision, machine learning and deep
l
S nications Engineering from Sasi Institute of Technology and Engineering, India, in 2009, a is currently a doctoral student in the Department of Electrical Engineering at the Indian
I interests are in image processing, computer vision and pattern recognition.
S ian Institute of Science, Bangalore, India. She received the MTech degree from the Indian I r engineering from University of Maryland, College Park, in 2009. Her research interests
i member of the IEEE.
oubhik Sanyal received B.E. degree from Jadavpur University in 2013. He is curref Science, Bangalore, India. He is a student member of IEEE and IEEE Signal Proce
earning.
ivaram Prasad Mudunuri received the B.Tech. degree in Electronics and Commund M.Tech. degree from the National Institute of Technology, Calicut, in 2011. He
nstitute of Science, Bangalore, India. He is a Student member of IEEE. His research
oma Biswas is an assistant professor in the Electrical Engineering Department, Indnstitute of Technology, Kanpur, in 2004, and PhD degree in Electrical and Compute
nclude image processing, computer vision, and pattern recognition. She is a senior