Discriminative pose-free descriptors for face and object ... · Discriminative pose-free...

Pattern Recognition 67 (2017) 353–365

Contents lists available at ScienceDirect

Pattern Recognition

journal homepage: www.elsevier.com/locate/patcog

Discriminative pose-free descriptors for face and object matching

Soubhik Sanyal, Sivaram Prasad Mudunuri, Soma Biswas ∗

Department of Electrical Engineering, Indian Institute of Science, Bangalore, India

a r t i c l e i n f o

Article history:

Received 25 July 2016

Revised 25 December 2016

Accepted 10 February 2017

Available online 17 February 2017

Keywords:

Face recognition

Object recognition

Pose invariant matching

Metric learning

Canonical correlation

Subspace to point representation.

a b s t r a c t

Pose invariant matching is a very important problem with various applications like recognizing faces

in uncontrolled scenarios in which the facial images appear in wide variety of pose and illumination

conditions along with low resolution. Here we propose two discriminative pose-free descriptors, Sub-

space Point Representation (DPF-SPR) and Layered Canonical Correlated (DPF-LCC) descriptor, for match-

ing faces and objects across pose. Training examples at very few poses are used to generate virtual in-

termediate pose subspaces. An image is represented by a feature set obtained by projecting its low-level

feature on these subspaces and a discriminative transform is applied to make this feature set suitable for

recognition. We represent this discriminative feature set by two novel descriptors. In one approach, we

transform it to a vector by using subspace to point representation technique. In the second approach, a

layered structure of canonical correlated subspaces are formed, onto which the feature set is projected.

Experiments on recognizing faces and objects across pose and comparisons with state-of-the-art show

the effectiveness of the proposed approach.

© 2017 Elsevier Ltd. All rights reserved.

1

v

w

f

a

u

(

d

t

i

t

m

t

i

(

e

t

a

i

G

t

d

j

s

f

p

d

i

s

u

l

h

0

. Introduction

Matching faces (or objects) across wide variety of poses is a

ery important area of research in the field of computer vision

ith many applications. For example, in surveillance setting, the

ace of a person captured by the overhead cameras may be in any

rbitrary pose and poor resolution as opposed to the frontal image

nder high resolution that is typically captured during enrolment

Fig. 1 , column 1 and 2). For object matching, the images captured

uring testing can be taken from a different view-point compared

o the images stored in the database which again requires compar-

ng objects in different poses ( Fig. 1 , column 3–6). The aforesaid

asks are challenging because the appearance of the images to be

atched can be very different due to significant pose variations.

In this paper, we propose two discriminative pose-free descrip-

ors, Subspace Point Representation ( DPF-SPR ) descriptor (which

s also termed as DPFD in [1] ) and Layered Canonical Correlated

DPF-LCC ) descriptor, for matching faces and objects across differ-

nt poses. During training phase, images from a few poses (two

o three) are used to generate virtual subspaces for the intermedi-

te poses. We generate the virtual intermediate subspaces by treat-

ng the subspaces generated by the training data as points on the

rassmann manifold and sampling the shortest geodesic path be-

ween those points. Then, we represent an image (or image region

epending on application) by a set of features, computed by pro-

∗ Corresponding author.

E-mail address: [email protected] (S. Biswas).

r

d

t

ttp://dx.doi.org/10.1016/j.patcog.2017.02.016

031-3203/© 2017 Elsevier Ltd. All rights reserved.

ecting its low level feature vector onto all the intermediate sub-

paces, which will ensure that at least one or more of the features

rom the entire feature set will match if the images with different

oses are compared. Since our final goal is recognition, we use a

iscriminative transform learned using the class labels of the train-

ng data to transform the feature set. Then DPF-SPR or DPF-LCC de-

criptor is computed from the feature set which can be directly

sed for matching. In this paper, our focus is on the following chal-

enging tasks:

1. Unconstrained face recognition, where the gallery consists of

frontal images captured during enrolment and the probe im-

ages can be in any arbitrary pose. We also address the problem

where, in addition to non-frontal pose, the probe images also

have low-resolution as is usually the case in surveillance set-

ting when the images are taken from a large distance from the

subject. We perform extensive experiments on the CMU PIE [2] ,

Multi-PIE [3] and the SCface database [4] .

2. Object recognition across pose, where the objects of different

poses are to be matched. For this purpose we evaluate the pro-

posed approach on COIL-20 [5] and RGB-D object datasets [6] .

We also consider the task of matching depth images of ob-

jects across pose to test the generalizability of the proposed

approach.

We compare the proposed approach with state-of-the-art met-

ic learning, cross-pose methods, domain adaptation and coupled

ictionary learning approaches to show the effectiveness of the

wo descriptors. The novelty/contribution of the work is as follows:

http://dx.doi.org/10.1016/j.patcog.2017.02.016

http://www.ScienceDirect.com

http://www.elsevier.com/locate/patcog

http://crossmark.crossref.org/dialog/?doi=10.1016/j.patcog.2017.02.016&domain=pdf

mailto:[email protected]

http://dx.doi.org/10.1016/j.patcog.2017.02.016

354 S. Sanyal et al. / Pattern Recognition 67 (2017) 353–365

Fig. 1. Applications requiring matching across pose variations. Column 1,2: Face

recognition in uncontrolled setting; and Column 3–6: Object recognition across

viewpoint.

t

a

e

s

f

n

t

a

i

f

[

c

K

c

a

a

i

d

s

a

s

m

d

h

i

e

o

p

i

n

n

m

v

[

a

u

j

[

R

t

p

fi

i

r

p

a

i

s

t

a

f

t

o

c

i

n

a

3

d

T

• Two novel discriminative pose-free descriptors, Subspace Point

Representation ( DPF-SPR ) and Layered Canonical Correlated

( DPF-LCC ) descriptor, for matching faces and objects across dif-

ferent views are proposed. • The approach does not require separate training for different

probe poses/view points. This is an advantage over many other

approaches which work well when separate training is per-

formed for different poses encountered during testing. • Very few poses (as little as two or three) are required during

the training phase and the method can generalize well to un-

seen poses. • Extensive experiments illustrate the applicability of the pro-

posed approach in diverse domains like face and object recog-

nition.

A preliminary version of this work appeared in [1] . The rest of

the paper is organized as follows. Section 2 describes the related

literature. Details of the proposed approach are given in Section 3 .

The experimental results and analysis of the descriptors are re-

ported in Section 4 . The paper concludes with a brief discussion

section.

2. Related work

In this section, we provide pointers to some of the related work

in the area of recognizing faces and objects across pose. Recog-

nizing faces across pose is an important research area. Li et al.

[7] propose maximal likelihood correspondence estimation after

encoding face specific structure information of semantic corre-

spondence. Ding et al. [8] learn a transformation dictionary which

transforms the features of different poses into a discriminative

subspace where face matching is performed at patch level rather

than at holistic level. Ding et al. [9] extract multi-directional multi-

level dual cross patterns as pose invariant face descriptor. Zhang

et al. [10] propose a mixed norm approach which is achieved by

a trade-off between sparse representation classification and joint

sparse representation classification. Castillo and Jacobs [11] pro-

pose a method to compute stereo matching cost between two fa-

cial images by using epipolar geometry. Wang et al. [12] formulate

a general representation of kernel collaborative framework and de-

velop an l 2 regularized algorithm within it. Cament et al. [13] ex-

tract Gabor features from modified grids using mesh to model face

deformations produced by varying pose. Li et al. [14] train a Gaus-

sian mixture model to capture the spatial appearance distribution

of all face images in the training corpus. Yin et al. [15] design a

model for recognizing faces under large variations which can pre-

dict the appearance and likelihood of the given query face against

the collected generic identities. Arandjelovic et al. [16] propose a

framework that can improve the performance of any baseline face

retrieval algorithm by leveraging the structure of the database.

Recently, matching of low resolution facial images has gained

considerable attention [17,18] . Bhatt et al. [19] propose using a

combination of transfer learning and co-training paradigms for

cross resolution matching. Ren et al. [20] propose a method of ex-

racting and representing discriminant feature from faces and then

lternatively optimize across different data domains. Al-Maadeed

t al. [21] learn a pairwise dictionary and utilize a random pooling

trategy to select a subset of visual words. Zhao et al. [22] combine

orward and backward sparse representation for robust face recog-

ition. A piecewise linear regression model is developed to learn

he relationship between the high resolution (HR) image space

nd the low resolution (LR) image space for face super resolution

n [23] . Metric learning approaches have shown a lot of promise

or matching faces in unconstrained environments. Kostinger et al.

24] propose a method that learns a distance metric from the

o-variance matrices of similar and dissimilar pairs. Moutafis and

akadiaris [25] propose an algorithm that can match HR and LR fa-

ial images by learning individual basis for optimal representation

nd coupled distance metrics to enhance the classification. Domain

daptation techniques have also been successfully used for match-

ng face images across pose, illumination, blur, etc. [26] . In [27] ,

ictionary learning is used to interpolate subspaces to link the

ource and target domains. The main drawback of most of these

pproaches is that they perform well only if the test faces have the

ame pose as those of the faces used for training. This is the key

otivation in developing our algorithm, so that the training can be

one using a few representative poses, and the testing image can

ave a different pose than those used for training.

There has been a lot of progress in the field of deep learning

n face and object recognition tasks [28] in recent times. Taigman

t al. [29] apply a piecewise affine transformation for 3D modeling

f faces using deep convolutional networks. Schroff et al. [30] pro-

ose to learn a deep convolutional network by mapping from face

mages to an Euclidean space. Chan et al. [31] build a deep learning

etwork with the help of cascade principle component analysis, bi-

ary hashing and block-wise histograms. Though the deep learning

ethods have shown better performance under wide range of pose

ariations, the performance degrades when the resolution is low

32,33] . Handling the low resolution problem with deep learning

pproaches requires sufficient amount of training samples captured

nder poor resolution conditions for learning the model.

There has also been a lot of research in the area of general ob-

ect recognition across different viewpoints [34] . Hsiao and Hebert

35] model occlusions by reasoning about 3D interactions of object.

ubio et al. [36] use generative non-negative matrix factorization

o find out relevant parts for training instances. Wu et al. [37] pro-

ose a query-expanded collaborative representation based classi-

er with class-specific prototypes. A model that separates a view-

nvariant category representation from category-invariant pose rep-

esentation is proposed in [38] . He et al. [39] use spatial pyramid

ooling strategy to eliminate the need of fixed size input image for

convolutional neural network for object recognition.

One of the proposed descriptors is inspired by [40] in which

mages are matched across varying scales. Features at different

cales can be computed from the same image itself, unlike fea-

ures at different poses which is the focus of our work. Gener-

ting intermediate subspaces by sampling the Grassmann mani-

old has also been exploited by [26] , and then the projections on

hese subspaces are used to train discriminative classifiers for each

bject. Instead, using the intermediate subspaces, we form a dis-

riminative feature vector which can directly be used for match-

ng. Our approach is more suitable for applications like face recog-

ition, where there may not be any overlap between the training

nd testing subjects.

. Proposed approach

In this section, we describe in detail the computation of the two

iscriminative pose-free descriptors, namely, DPF-SPR and DPF-LCC .

he different steps required during training are:

S. Sanyal et al. / Pattern Recognition 67 (2017) 353–365 355

Fig. 2. Flow chart of the proposed framework showing the training stage and construction of the two descriptors: DPF-SPR and DPF-LCC .

a

u

T

c

D

i

d

i

d

u

t

d

3

s

i

t

l

t

o

c

i

t

i

t

P

p

f

P

r

T

w

∈

d

s

G

t

b

R

P

i

t

�

w

a

P

R

w

a

e

B

s

o

1. First, the virtual intermediate subspaces are generated from the

given training examples of a few pose regions. The feature vec-

tor from the input image is then projected onto all the sub-

spaces to form a feature set.

2. Then, from the training class labels, a discriminative transform

is learned. This completes the training stage for DPF-SPR .

3. For DPF-LCC computation, after learning the transformation ma-

trices, a layered model of correlated subspaces and the corre-

sponding projection matrices are learned.

During testing, after computing the feature set for a given im-

ge by projecting onto all the subspaces, they are transformed

sing the discriminative transform learned in the training phase.

hen, subspace to point representation technique can be used to

onstruct the DPF-SPR descriptor for the image. For computing

PF-LCC , the discriminative feature set is projected onto the canon-

cal correlated subspace learned during training to compute the

escriptor. The flowcharts of the proposed approaches are provided

n Fig. 2 : Top left: Portion of the training stage common to both the

escriptors, Top Right: Training stage of DPF-LCC which is contin-

ed from the left side, Bottom: Testing stage showing the compu-

ation of DPF-SPR and DPF-LCC . We describe each of the steps in

etail in the following subsections.

.1. Feature representation using intermediate subspaces

Suppose we have training images from some parts of the pose

pace, say, P 1 , P 2 to P K , ( K is as small as two/three) ( Fig. 3 ). Our aim

s to generate a descriptor for an image with any unknown pose so

hat it can be used for matching across poses. Assume that f is the

ow level feature descriptor of an image. We propose to represent

he actual image using a collection of features { f 1 , f 2 , . . . } instead

f by a single f because our goal is to match it in any pose. The

ollection of features { f 1 , f 2 , . . . } are the feature vectors computed

f we have that image at different poses. The chances of matching

wo images of the same object which only differ by pose, is higher

f we now compare the feature sets from the two images, rather

han only using f .

We compute virtual poses by learning the path between P k and

k +1 in order to generate the features at different poses. For this

urpose, we exploit the idea of sampling on the Grassmann mani-

old [26,41] . Suppose N trn is the number of training images in pose

k as well as in pose P k +1 . We denote the low level features cor-

esponding to the images in P k as f k,i ∈ R

D , where i = 1 , 2 , . . . N trn .

hus, we have a data matrix of dimension D × N trn for pose P k as

ell as P k +1 . Now we obtain the generative subspaces P k and P k +1

R

(D ×d) by applying principal component analysis (PCA) on the

ata matrix P k and P k +1 respectively. The space of d -dimensional

ubspaces in R

D can be identified with the Grassmann manifold

d,D and thus, P k and P k +1 are points on G d,D . We aim to generate

he virtual features corresponding to the intermediate subspaces

etween P k and P k +1 .

Let P k has an orthogonal complement R k ∈ R

D ×(D −d) such that,

T k

P k = 0 . Then the geodesic flow, �( t ): t ∈ [0, 1], between P k and

k +1 is such that, �(t) ∈ G d,D and �(0) = P k and �(1) = P k +1 . This

mplies that starting from P k , the geodesic flow reaches P k +1 in unit

ime. The expression for the flow at any time t is given by

(t) = P k U 1 �(t) − R k U 2 �(t) (1)

here U 1 ∈ R

d×d and U 2 ∈ R

(D −d) ×d are orthonormal matrices. U 1

nd U 2 can be obtained using the following equations

′ k P k +1 = U 1 �V

′ (2)

′ k P k +1 = −U 2 �V

′ (3)

here �, � ∈ R

d×d are diagonal matrices whose diagonal elements

re cos θ i and sin θ i for i = 1 , 2 , , . . . d. {} ′ denotes the transpose op-

rator. θ i are known as the principal angles between P k and P k +1 .

y using different values of t , we can obtain different intermediate

ubspaces.

This work is motivated by the claim that if we project an image

f any unknown pose onto any interpolated subspace, the recon-


Fig. 3. Top: Training images from 3 different parts of the pose space (left pose, frontal, right pose) denoted by P 1 , P 2 and P 3 respectively. Bottom: Virtual subspaces generated

from training data, the two rows indicating the second and fourth eigenvectors of the subspaces.

Fig. 4. Illustration of pose reconstruction on Geodesic flow curve for two subjects.

a: unknown pose; b: synthesized from interpolated subspace; c: actual image at

interpolated pose.

r

m

P

l

p

r

p

d

f

j

t

f

a

F

p

p

t

a

structed image obtained will have a pose similar to that of the in-

terpolated pose. To show this, we have projected images of frontal

pose and 30 ° pose ( Fig. 4 (a)) onto an interpolated pose. We ob-

serve that the reconstructed images ( Fig. 4 (b)) are close to the ac-

tual 15 ° pose ( Fig. 4 (c)) in both cases, thus justifying the subspace

interpolation.

The enhanced feature set { f 1 , f 2 , . . . f H } for each image is ob-

tained by projecting the feature vector f onto all the intermedi-

ate subspaces after generating them. Here H is the total number of

subspaces, out of which Z subspaces are computed from the actual

training data and (H − Z) are the intermediate virtual subspaces.

Fig. 3 (bottom) shows virtual subspaces generated from training

data from three parts of the pose space ( Fig. 3 (top)), the two rows

indicating the second and fourth eigenvectors of the subspaces.

Fig. 5. Proposed feature set representation generated using projections on all intermedi

difference of the feature vector for that pose from that computed from the frontal image.

We illustrate the effectiveness of the proposed feature set rep-

esentation over the standard feature vector representation for

atching across poses by performing an experiment on the Multi-

IE dataset [3] . We use images of 100 subjects, under frontal il-

umination condition and five different poses including the frontal

ose ( Fig. 5 ). In this experiment, we represent the corner of the

ight eye using the SIFT descriptor [42] and also using the pro-

osed feature set representation (generated using the original SIFT

escriptor). The difference between the descriptor at that pose

rom that of the frontal image, averaged over all the 100 sub-

ects, is shown by each point in Fig. 5 . We compute distances be-

ween all the features and took the minimum to compute the dif-

erence between two feature sets for our descriptor. We observe

n increase in the difference with increase in the pose difference.

or the proposed feature set, this difference is much less as com-

ared to the baseline SIFT descriptor. This indicates that the pro-

osed feature set is more robust to change in pose. But we need

o address two issues before using such a feature set for matching

cross pose variations:

1. The feature sets may not be discriminative enough for recogni-

tion or classification task because they are computed from gen-

erative subspaces.

2. It will be also computationally expensive if two feature sets are

matched using some measure like minimum distance, which

requires H × H comparisons to compute the distance between

two feature sets.

Both the issues are addressed in the following subsections.

ate virtual subspaces vs. the SIFT descriptor. Each point in the curves indicate the


3

s

t

d

t

a

i

a

a

l

t

d

w

l

b

m

a

r

e

d

s

s

m

s

o

f

a

v

a

c

b

l

δ

w

m

f

o

f

w

w

a

�

δ

I

δ

A

3

t

s

a

T

N

d

s

b

d

p

t

[

i

H

f

s

r

L

m

s

D

F

g

f

D

3

v

v

a

m

b

t

t

f

t

o

p

L

d

w

a

d

f

r

0

b

t

t

t

(

o

m

i

d

t

.2. Discriminative features

Here we describe the computation of a discriminative feature

et for a given input image from the feature set computed from

he generative subspaces. This is done to make the final descriptors

iscriminative for applications like face and object recognition. For

his purpose, we utilize the class labels of the training data to learn

transformation. The feature sets of images are then transformed

n such a way that those from the same class come closer to one

nother, and those belonging to different classes are moved further

part. In this work, the framework of Mahalanobis distance metric

earning is used for making the features discriminative. In general,

he squared distance between two features x i , x j can be defined as

2 (x i , x j ) = (x i − x j ) T M(x i − x j ) (4)

here M �0 is the positive semi-definite matrix that we want to

earn. Learning one metric for our approach may not be sufficient

ecause the difference in pose between the images to be matched

ay be significant (considering the two extremes of the pose space

s in Fig. 3 ). Therefore, we divide the whole pose space into say T

egions and propose to learn a metric for each of these regions. For

xample, if there are 12 subspaces in the entire pose space and we

ivide the space into 4 regions, then each region will consist of 3

ubspaces. We jointly use the feature vectors for the constituent

ubspaces for each region as input features for the Mahalanobis

etric learning. Here, we utilize a formulation similar to the large

cale metric learning (LSML) [24] for learning the metrics for each

f these T regions.

The approach considers two independent generation processes

or match and non-match pairs. For example, for face recognition

pplication, we consider features from the same subject having

ariations in pose, (jointly) pose and resolution as the match pairs

nd those from different subjects as the non-match pairs. We de-

ide on whether any given pair of features x i and x j of a region,

elong to same class or not, from likelihood ratio test as formu-

ated below

(x i , x j ) = log

(p(x i , x j | H 0 )

p(x i , x j | H 1 )

)(5)

here, H 0 and H 1 are the hypotheses that a pair is non-match and

atch respectively. The value of δ( x i , x j ) is small when a pair of

eatures belong to the same class and its value is large when a pair

f features belong to different class. Assuming a Gaussian structure

or the difference space of features, the probabilities in (5) can be

ritten as

p(x i , x j | H 0 ) =

1 √

2 π | �n i j =0 | exp

(−1

2

x T i j �−1 n i j =0 x i j

)(6)

p(x i , x j | H 1 ) =

1 √

2 π | �n i j =1 | exp

(−1

2

x T i j �−1 n i j =1 x i j

)(7)

here, x i j = x i − x j is a vector in the difference space; n i j = 1 for

match pair and its value is 0 for a non-match pair. �n i j =1 and

n i j =0 are the corresponding covariance matrices.

Now we can reformulate (5) with the help of (6) and (7) as

(x i j ) = log

⎛

⎜ ⎜ ⎝

1 √

2 π | �n i j =0 | exp

(−1

2

x T i j �−1

n i j =0 x i j

)

1 √

2 π | �n i j =1 | exp

(−1

2

x T i j �−1

n i j =1 x i j

)⎞

⎟ ⎟ ⎠

(8)

t can be further simplified as

(x i j ) = x T i j

(�−1

n i j =1 − �−1 n i j =0

)x i j (9)

nalyzing (4) and (9) , the Mahalanobis Metric is given by M =(�−1

n i j =1 − �−1

n i j =0 ) .

.3. DPF-SPR computation

After computing the discriminative features, the distance be-

ween feature sets of two images can be directly computed using a

uitable set comparison metric. But this approach is computation-

lly inefficient which has motivated the development of DPF-SPR .

he details of the computational time is discussed in Section 4.5 .

ow, we explain the computation of DPF-SPR descriptor from the

iscriminative feature set for efficient matching of two feature sets.

The set of descriptors corresponding to each region in pose

pace can be approximated to lie on a linear subspace. This is

ecause there is a gradual change of the feature vectors for the

ifferent virtual intermediate poses that has been generated. Sup-

ose the basis vectors for region t spanning the space of the fea-

ures is given by g t, 1 , g t, 2 , . . . , g t,N s ∈ R

D . The D × N s matrix G t = g t, 1 , g t, 2 , . . . , g t,N s ] represents the subspace for region t , where N s

s the number of subspaces in a particular region, given by N s =/T . The dimension of each feature vector is D which can be dif-

erent for different applications. Now the subspace to vector repre-

entation for each region is computed. We can compute the vector

epresentation by rearranging the elements of the D × D matrix

= G t G

T t using the following operator (considering only the ele-

ents of the upper triangular matrix with the diagonal elements

caled by 1 / √

2 ) [40]

P F − SP R t =

(l 11 √

2

, l 12 , . . . , l 1 D , l 22 √

2

, l 23 , . . . , l DD √

2

)′

(10)

inally, we concatenate the vector representation for all the T re-

ions into a single vector denoted by DPF-SPR which is given as

ollows

P F − SP R = [ DP F − SP R 1 ; DP F − SP R 2 ; . . . ; DP F − SP R T ] (11)

.4. DPF-LCC computation

As we will see in the experimental section, DPF-SPR performs

ery well for matching faces and objects across pose. Another ad-

antage is that it can generalize to unseen poses, i.e. poses which

re not available during training. But one limitation is that the di-

ension of the descriptor can be considerably large as the num-

er of intermediate subspaces increase. This increases the compu-

ational time during testing (discussed in Section 4.5 ). Because of

he same reason, it is difficult to use this descriptor with low-level

eatures which are themselves high dimensional (eg. AlexNet fea-

ures which are of dimension 4096) as discussed in Section 4.5 . To

vercome this limitation, we propose another novel discriminative

ose-free descriptor, termed as Layered Canonical Correlated ( DPF-

CC ) descriptor based on canonical correlation analysis (CCA) [43] .

Motivation: CCA has proved to be very effective for cross-

omain or cross-modal data. CCA learns a common subspace into

hich the projected features from the source and target domains

re maximally correlated. We can think of face images from two

ifferent poses as two different domains and apply CCA to match

ace images across poses. To evaluate its performance for face

ecognition across pose, we took images of frontal pose and of pose

4_1 (refer to Fig. 8 ) as source and target views respectively for

oth training as well as testing, but there was no overlap between

he training and testing subjects. We see from Fig. 6 (first blue bar)

hat CCA performs very well for this application. Similar observa-

ion can also be made if pose 05_0 is used instead of pose 04_1

second blue bar). To evaluate how CCA performs for unseen poses

n which it has not been trained on, we perform another experi-

ent where pose 04_1 is used for training, and pose 05_0 for test-

ng. We observe from Fig. 6 (third blue bar) that the performance

ecreases drastically. We can make the following observations from

hese experiments - 1) CCA performs better when the difference


Fig. 6. Motivation for deriving the proposed DPF-LCC descriptor.

Fig. 7. Detailed look of the layered structure for DPF-LCC computation.

t

a

t

a

p

e

a

i

ρ

w

E

a

i

p

w

b

q

I

i

i

q

q

I

{

j

fi

t

4

i

{

j

between source and target poses is smaller, since performance is

better for pose 05_0 as compared to pose 04_1; and 2) CCA per-

forms well when trained and tested on the same pose, but does

not generalize well to unseen poses. Based on these observations,

we propose the novel descriptor termed DPF-LCC .

Descriptor Computation: As in the case of DPF-SPR , the in-

put features are first projected onto all the intermediate subspaces

which are constructed by sampling on the geodesic curve between

the source and target subspaces. The resultant features are then

transformed by the discriminant metric. Hence we have a fea-

ture set corresponding to each intermediate subspace. Now, for

any two successive feature sets, we compute the projection vec-

tors for those sets using CCA, which are used to project them to

a subspace where they are maximally correlated as illustrated in

Fig. 2 (Top right). This ensures that the difference in pose between

the two subspaces used for computing CCA is small. This also en-

sures that irrespective of the actual pose of the input image, CCA

projection matrix is always applied on the pose it has been trained

on. Fig. 6 also shows the performance using the proposed descrip-

tor. We observe that for known pose (first and second red bar),

its performance is comparable to that of CCA. But for unseen pose

(third red bar), it significantly outperforms CCA.

We now describe the computation of the descriptor using a

toy example illustrated in Fig. 7 . Let the total number of dis-

criminative subspaces obtained after Section 3.2 be denoted by H

(Layer 0). In our example, H = 4 . Let Y

1 = [ y 1 1 , y 1 2 , . . . , y

1 N ] ∈ R

D ×N

and Y

2 = [ y 2 , y 2 , . . . , y 2 ] ∈ R

D ×N be two feature sets corresponding
1 2 N
o two successive subspaces, s 0 1

and s 0 2

in Layer 0, where y i j ∈ R

D is

transformed feature. N is the total number of features (each fea-

ure comes from one training example) in that particular subspace

nd i = 1 , 2 , . . . , H for this layer. Now, CCA is performed for each

air of neighbouring subspace, for example, s 0 i

and s 0 i +1

. Consid-

ring s 0 1

and s 0 2 , the goal is to find two projection vectors q 1 ∈ R

D

nd q 2 ∈ R

D , such that the correlation coefficient ρ ∈ [0, 1] is max-

mized. It is given by

= max q 1 ,q 2

(q 1 ) ′ 12 q

2 √

(q 1 ) ′ 11 q 1 (q 2 ) ′ 22 q 2 (12)

here the within class data covariance matrices are given as 11 = [ y 1 (y 1 )

′ ] and 22 = E [ y 2 (y 2 )

′ ] and between class data covari-

nce matrix is given as 12 = E [ y 1 (y 2 ) ′ ] . The projection vector q 1

n (12) can be solved by a generalized eigenvalue decomposition

roblem [43] as given below

12 (22 ) −1

′ 12 q

1 = α11 q 1 (13)

here α is a Lagrangian multiplier. Once q 1 is computed, q 2 can

e obtained using the following equation

2 =

(22 ) −1 12 q

1

α(14)

n practice, to avoid over-fitting and singularity problems, regular-

zation terms α1 and α2 are added with the covariance matrices

11 and 22 respectively. Therefore we actually solve the follow-

ng generalized eigenvalue problems instead of (13) and (14) to get

1 and q 2 respectively.

12 (22 + α2 I) −1

′ 12 q

1 = α(11 + α1 I) q 1 (15)

2 =

(22 + α2 I) −1 12 q

1

α(16)

n this manner, the two sets of projection vectors { q 1 k } n

k =1 and

q 2 k } n

k =1 are computed, where n < N . We decide the number of pro-

ection vectors n depending on the corresponding correlation coef-

cient ρ . For example, for the SCface dataset [4] , we have chosen

hose pair of projection vectors for which ρ > 10 −5 and found that

6 pair of projection vectors satisfy this criterion, i.e. n = 46 . Sim-

larly, we find the projection vectors for other pairs of subspaces,

s 0 2 , s 0

3 } and { s 0

3 , s 0

4 } of Layer 0. We take the same number of pro-

ection vectors for all the subspaces of a particular layer.


i

n

o

S

o

e

w

o

0

h

s

i

w

D

t

i

d

r

e

q

m

i

i

t

o

f

L

o

j

i

p

w

w

M

d

p

s

n

S

t

i

4

c

p

a

j

t

4

[

a

e

i

o

o

t

e

P

d

Table 1

Rank-1 recognition accuracies (%) for face recognition across pose variations on the

PIE dataset [2] .

Method c 11 c 29 c 05 c 37 Average

K-SVD [45] 48 .5 76 .5 80 .9 57 .4 65 .8

Eigen Light-field [46] 78 .0 91 .0 93 .0 89 .0 87 .8

SGF [26] 58 .8 89 .7 89 .7 72 .1 77 .6

GFK [47] 63 .2 92 .7 92 .7 76 .5 81 .3

Subspace Interp. via DL [27] 76 .5 98 .5 98 .5 88 .2 90 .4

Proposed Approach ( DPF-SPR ) 98 .5 100 100 98 .5 99 .3

Proposed Approach ( DPF-LCC ) 98 .5 100 100 100 99 .6

fi

r

p

d

f

s

f

o

t

t

c

g

a

l

t

n

m

e

p

t

c

p

E

n

l

t

r

w

p

t

4

r

t

(

a

M

d

n

f

e

p

a

F

a

s

3

p

s

The total number of subspaces decreases in each layer, start-

ng from H in Layer 0, then (H − 1) in Layer 1, and so on, and fi-

ally one subspace in Layer (H − 1) . For each layer, for every pair

f consecutive subspaces, two projection matrices are computed.

o if there are h subspaces in one layer, there will be (h − 1) pairs

f subspaces leading to 2(h − 1) projection matrices. Thus, in the

xample given in Fig. 7 , the total number of projection matrices

ill be 6, 4 and 2 for Layer 0, 1 and 2 respectively. The number

f features will also change in each layer. Each subspace in Layer

has N number of features. Since subspace s 0 1

and s 0 2

of Layer 0

ave N number of features each, when they are projected onto the

ubspace s 1 1

of Layer 1, there will be a total 2 N number of features

n that subspace. Likewise, a subspace corresponding to Layer m ,

ill have 2 m N number of features. The initial feature dimension

will also change in each layer which will essentially depend on

he number of projection vectors used for each projection matrix

n each layer. The number of projection vectors corresponding to

ifferent layers can be different. But, we have observed that if the

ange of ρ is sufficiently large (as we have considered for SCface

xperiment), the value of n remains almost same for all the subse-

uent layers. The output of the training will be all the projection

atrices, denoted by Q

i j , which is the j th projection matrix of the

th layer.

During testing, the proposed descriptor is computed for all the

mages in the gallery and probe, which are then compared. First

he extracted low-level features (eg. SIFT for faces) are projected

nto all the subspaces learned on the geodesic curve and trans-

ormed using the learned M . This is the feature corresponding to

ayer 0 of the proposed approach. These features are projected

nto Layer 1 and all the subsequent layers using the learned pro-

ection matrices. Finally we concatenate all the projected features

n the final layer to generate the DPF-LCC descriptor. For our exam-

le, suppose the feature dimension of the image in Layer 0 is D . If

e take n = 46 for all the layers, after projection to Layer 1, there

ill be two 46 dimensional vector corresponding to each subspace.

oving ahead, after projection in Layer 2, there will be four 46

imensional vector corresponding to each subspace. Finally, after

rojection in Layer 3, there will be total eight vectors of dimen-

ion 46 representing the image. These eight vectors are concate-

ated to form the DPF-LCC descriptor of dimension 8 × 46 = 368 .

o the dimension of the DPF-LCC descriptor does not depend on

he initial dimension of the low level features extracted from the

mage, which is its another advantage.

. Experimental evaluation

In this section, we present the results of extensive experiments

onducted to evaluate the effectiveness and usefulness of the pro-

osed descriptors. Particularly, experiments on face recognition

cross pose, face recognition across pose and resolution, and ob-

ect recognition across pose, are done to test the applicability of

he proposed approach for these applications.

.1. Face recognition across pose

Face images are represented by local feature descriptors (SIFT

42] in this paper) computed at 15 fudicial locations. We have used

freely available C++ software library based on active shape mod-

ls known as STASM [44] to detect the fiducial locations automat-

cally and also verified them manually and corrected the incorrect

nes. Here, experiments on recognizing faces across pose variations

n the CMU-PIE dataset [2] are presented. We follow the same pro-

ocol as in [27] and have used all the 68 subjects under 5 differ-

nt poses and frontal illumination. 100 subjects from the Multi-

IE data [3] , whose images have been captured under similar con-

itions are used for training. Subspaces are constructed for each

ducial point separately and then concatenated to form the rep-

esentation of the facial image. Furthermore, frontal and extreme

oses ( c 11 and c 37 ) are used for representing the entire pose space

uring training. We compute 12 subspaces in between pose c 11 to

rontal and frontal to pose c 37 . We also subdivide the entire pose

pace into 4 regions for computing the discriminative feature sets.

We consider the frontal images as the gallery and the non-

rontal images under different poses as the probe. As there is no

verlap between the subjects used in the training and testing set,

here is no need for retraining even if the test subjects change. Af-

er the training stage, the subspaces and transformation matrices

an be used for any test subject. During testing, initially both the

allery and the probe images are projected onto all the subspaces,

nd then their discriminative feature sets are computed using the

earned metric. Finally, either the DPF-SPR or the DPF-LCC descrip-

or is computed. Unlike most of the methods in literature, there is

o need to learn a classifier separately for each pose which is a

ajor advantage of our approach.

Table 1 shows the results of the proposed approach for this

xperiment. We have shown comparison with several other ap-

roaches, namely (1) K-SVD [45] : which learns a dictionary from

he frontal images and uses the same dictionary to get the sparse

oefficients for the non-frontal images; (2) SGF [26] and GFK [47] :

erform subspace interpolation on the Grassmann manifold; (3)

igen-field approach [46] which is designed specifically to recog-

ize faces across pose; (4) subspace interpolation via dictionary

earning [27] interpolates subspaces by using dictionary learning

o link the frontal and non-frontal domains. The recognition accu-

acies of all the other approaches are taken directly from [27] . Even

ith no separate training for each of the different probe poses, the

roposed approaches perform better than all the other methods for

he task of recognizing faces across pose variations.

.2. Face recognition across pose and resolution

Here, we test the applicability of the proposed descriptors for

ecognizing faces across multiple variations simultaneously. For

his, we perform face recognition with frontal and high resolution

HR) images as gallery and non-frontal, low-resolution (LR) images

s probe, as usually found in surveillance scenarios.

Results on MultiPIE dataset: First we report results on the

ulti-PIE dataset [3] containing images of 337 subjects from four

ifferent recording sessions captured under different poses, illumi-

ation conditions and expressions. We consider HR images under

rontal pose and frontal illumination condition as gallery for our

xperiments. For the probe images, we use LR images taken under

ose 04 _ 1 , 05 _ 0 , 13 _ 0 and 14 _ 0 (as named in the dataset) under

ll the 20 different illumination conditions and neutral expression.

ig. 8 (a) shows a few sample HR gallery images and (b,c,d,e) shows

few probe images in four different poses. We use HR images of

ize 60 × 50 and LR images of size 20 × 17 (i.e. scale factor of

) for all the experiments. Standard bi-cubic interpolation was ap-

lied on the HR images to get the LR images. 100 randomly chosen

ubjects with frontal, 13 _ 0 (left extreme) and 04 _ 1 (right extreme)


Fig. 8. Example images from the Multi-PIE data [3] . (a) Frontal high-resolution im-

ages used as gallery; (b,c,d,e) low-resolution images under non-frontal pose (pose

13 _ 0 , 14 _ 0 , 05 _ 0 and 04 _ 1 as given in the dataset) used as probe images.

Table 2

Rank-1 recognition performance (%) for four different probe poses, averaged over

the different gallery illuminations on the Multi-PIE dataset [3] .

Method Pose 13_0 Pose 14_0 Pose 05_0 Pose 04_1

MDS Learning [48] 32 .8 44 .8 47 .0 48 .5

LSML [24] 46 .9 53 .9 55 .2 54 .3

GMA [49] 65 .0 70 .1 70 .3 64 .2

MvDA [50] 45 .7 55 .0 53 .8 42 .9

FCPRF + LSML [51] 54 .0 71 .2 73 .4 61 .0

SCDL [52] 66 .3 73 .0 72 .7 64 .1

CFDL [53] 65 .9 72 .0 72 .8 64 .7

SCDL + LSML 69 .1 75 .1 74 67 .6

CFDL + LSML 68 .9 74 .1 74 .6 68 .1

Proposed DPF-SPR 74 .5 78 .0 74 .0 70 .1

Proposed DPF-LCC 75 .5 78 .0 78 .05 74 .7

Fig. 10. Example facial images of Surveillance Cameras Face Database [4] . Top

row: frontal gallery images, second row: corresponding probe images captured by

surveillance cameras.

a

p

a

f

s

u

w

p

a

t

p

p

s

(

W

d

r

s

p

S

a

i

a

fi

h

i

a

A

F

r

4

s

e

p

f

poses are used for generating the subspaces and metric learning,

and the remaining subjects for testing. There is no overlap between

the train and test subjects and training is done only once for all

the poses. We use HR images of frontal pose and LR images of ex-

treme poses during the subspace generation. Sample images that

are used for training (marked with bounding box) and testing (all

the five poses) are shown in Fig. 9 . The parameters for the pro-

posed approach are the same as used in the PIE experiment.

The results for the proposed descriptors are reported in Table 2 .

We also compare our approach with several state-of-the-art ap-

proaches; namely (1) MDS transformation learning [48] where a

transformation between the HR frontal gallery and LR non-frontal

probe is learned; (2) metric learning approaches: large scale met-

ric learning (LSML) [24] where a metric from equivalence con-

straints based on the statistical inference perspective is learned;

(3) semi-coupled and coupled dictionary learning [52,53] where

joint dictionary learning is performed to match objects from differ-

ent domains; (4) generalized multiview analysis (GMA) [49] where

a joint, quadratic program over different f eature spaces is solved

to compute a single linear subspace; (5) multiview discriminant

analysis (MvDA) [50] where a single discriminant common space

for multiple views is pursued in a non-pairwise manner by jointly

learning multiple view-specific linear transforms; (6) face image

classification by pooling raw features (FCPRF) [51] where features

Fig. 9. Illustration of training poses (marked with bounding box) and testing poses (all t

and 05 _ 0 are not used for training.

re extracted by pooling local patches over a multi dimensional

yramid.

Note that we have not used the two intermediate poses ( 14 _ 0

nd 05 _ 0 ) during training, but still we achieve good performance

or probe images in these intermediate poses. In comparison, re-

ults for all the other approaches reported in Table 2 are obtained

sing all the poses for training. Their performance is lower when

e train with only the frontal and extreme poses, as used in the

roposed approach. We have provided the same input features for

ll the algorithms (except [51] , where the algorithm itself extracts

he robust features) and learned one transformation for all the

robe poses. We have taken the source codes for the other ap-

roaches from the respective authors’ websites. For fair compari-

on, we also report results of the dictionary learning approaches

SCDL and CFDL) with LSML applied on the sparse coefficients.

e have also applied LSML on FCPRF features which can add the

iscriminability that can help in improving the performance of

aw features. We observe that for all the poses, the proposed de-

criptors perform better as compared with the state-of-the-art ap-

roaches. We also observe that DPF-LCC performs better than DPF-

PR .

Results on Surveillance Cameras Face Database (SCface): We

lso evaluate the proposed descriptors on real surveillance qual-

ty data obtained from the SCface database [4] . It contains im-

ges of 130 subjects captured in uncontrolled environment using

ve different video surveillance cameras. For the gallery images,

igh-quality camera was used. Same experimental setup as used

n [48] is applied for our evaluation, which includes all the im-

ges from the five surveillance cameras i.e. a total of 650 images.

few gallery (top row) and probe images (bottom row) are shown

ig. 10 .

As in [48] , randomly 50 subjects are picked for training and the

emaining 80 subjects are used for testing (thus there are a total of

00 probe images). There is no overlap between the train and test

ubjects. We have repeated the experiment 10 times with differ-

nt random sampling of the subjects. The Rank-1 accuracy of the

roposed approach and comparisons with several other approaches

or this experiment are reported in Table 3 . HR frontal images

he five poses) that are used in our experiments on Multi-PIE database. Poses 14 _ 0


Table 3

Rank-1 accuracy (%) of the proposed approach and comparison with state-of-the-

art approaches on the Surveillance Cameras Face Database [4] . The two columns

indicate two different training setups- using data from only one camera and five

cameras for training respectively. The proposed approach trained using data from

just one camera performs better than all the compared approaches even when they

are trained using data from all five cameras.

Method Rank-1 Rank-1

1 Cam 5 Cam

MDS Learning [48] 30 .0 61 .1

LSML [24] 64 .7 67 .2

GMA [49] 38 .2 50 .5

FCPRF + LSML [51] 58 .0 61 .3

SCDL [52] 48 .2 58 .5

CFDL [53] 45 .7 62 .2

SCDL + LSML 48 .8 60 .0

CFDL + LSML 46 .3 63 .3

Proposed DPF-SPR 69 .0 –

Proposed DPF-LCC 72 .0 –

Fig. 11. Sample images from the COIL 20 dataset [5] . The first column shows the

gallery images and the second to fifth columns shows some probe images for the

same objects.

a

p

F

i

(

H

(

t

t

o

t

t

p

o

t

r

D

4

t

t

(

a

a

i

t

i

w

p

s

t

Table 4

Rank-1 accuracy (%) of the proposed approach and comparison with other ap-

proaches on COIL 20 database [5] .

Method Rank-1 Accuracy

MDS Learning [48] 75 .6

LSML [24] 80 .3

GMA [49] 66 .1

SCDL [52] 79 .2

CFDL [53] 78 .7

SCDL + LSML 82 .6

CFDL + LSML 82 .0

MvDA [50] 69 .7

Proposed DPF-SPR 82 .2

Proposed DPF-LCC 83 .0

Fig. 12. Sample RGB (row 1 and 3) and the corresponding depth images (row 2 and

4) of calculator and keyboard objects from RGB-D object database [6] .

m

r

m

T

t

d

b

p

t

i

f

4

f

t

v

p

f

p

s

r

t

o

-

(

i

b

L

p

e

4

i

T

nd LR non-frontal images from one camera are used for our ap-

roaches to generate the subspaces and transformation learning.

or all the other approaches, two setups are followed for train-

ng: (a) HR frontal images and non-frontal images from one camera

same as for the proposed approach) ( Table 3 second column); (b)

R frontal images and non-frontal images from all the five cameras

Table 3 third column). When only one camera is used for training,

he performance of the proposed approaches are significantly bet-

er than the other approaches. Even though the performance of the

ther approaches improve by using images from all the cameras,

hey still perform worse than the proposed approaches. This shows

hat our descriptors can generalize better across unseen poses. We

erform another experiment, in which we use GMA [49] in place

f CCA keeping the same experimental protocol, and observe that

he rank-1 recognition rate improved to 74.5%. Thus, if CCA can be

eplaced by better approaches, the performance of the proposed

PF-LCC descriptor may improve further.

.3. Object recognition across pose

Now we illustrate the applicability of the proposed descrip-

ors to recognize general objects across variations in viewpoint. For

his, we perform experiments on Columbia object image library 20

COIL 20) database [5] . It contains 20 objects with gray-scale im-

ges. A few sample images are shown in Fig. 11 . The dataset is cre-

ted in such a way that each object is captured by rotating it about

ts vertical axis at a regular interval of 5 °. 50 images of each object

hat has pose variations from left extreme to right extreme includ-

ng the frontal pose are selected for the experiments. For training

e use five images per object around the frontal pose and extreme

oses and the remaining images are used for testing. We have re-

ized the images to 32 × 32 and used the image intensity values as

he input features. The images are normalized against their maxi-

um pixel value. For our experiment, total 12 subspaces with four

egions in pose space are used.

We consider images of frontal pose as gallery data and the re-

aining images that differ in pose as probe data during testing.

he images and the object poses used for testing are different from

hose used during training. Note that our experimental protocol is

ifferent from ones normally used and so the performance cannot

e directly compared with other published papers which have re-

orted results on this dataset. The Rank-1 recognition accuracy of

he proposed descriptors and comparisons with other approaches

s given in Table 4 . We observe that the proposed descriptors per-

orm favourably as compared to the other approaches.

.4. Object recognition on RGB-D object database

The RGB-D database [6] contains both RGB and depth images

rom 51 categories. The objects are captured in such a way that

hey are covered from multiple views. We take all the images (both

isual and depth) of the first instance in each category for our ex-

eriments. In each category, we selected five images from four dif-

erent poses for training and the rest of the images are used as

robe images during testing. Sample images from the database are

hown in Fig. 12 .

Kernel descriptors [54] of dimension 500 are extracted sepa-

ately from visual and depth images and are used as features in

his experiment. The recognition experiment is conducted to rec-

gnize visual probe images against visual gallery images (Visual

Visual) and depth probe images against depth gallery images

Depth - Depth) and the Rank-1 recognition rates (%) are reported

n Table 5 . Comparison with other algorithms is also shown for

oth the cases. We observe that the proposed DPF-SPR and DPF-

CC descriptors performs favourably as compared to the other ap-

roaches thus justifying their usefulness for the application of gen-

ral object recognition.

.5. Analysis of DPF-LCC and DPF-SPR descriptors

Here we analyze the proposed descriptors, DPF-LCC and DPF-SPR

n more detail. For this purpose, we use the SCface database [4] .

he experimental setup is similar to that of [48] where we ran-


Table 5

Rank-1 accuracy (%) of the proposed approach and comparison with other ap-

proaches on RGB-D object database [6] .

Method Visual - Visual Depth - Depth

MDS Learning [48] 82 .2 53 .9

LSML [24] 60 .1 45 .8

GMA [49] 70 .6 38 .9

MvDA [50] 77 .2 50 .6

SCDL [52] 80 .4 61 .1

CFDL [53] 81 .0 60 .5

SCDL + LSML 81 .7 62 .0

CFDL + LSML 82 .0 61 .3

Proposed DPF-SPR 86 .0 62 .0

Proposed DPF-LCC 84 .8 63 .1

Fig. 13. Rank-1 accuracy (%) vs number of subspaces of the Surveillance Cameras

Face Database.

Table 6

Number of metric learning regions vs Rank-1 accuracy (%).

Method Number of metric learning regions

1 2 3 4 5 6

DPF-SPR 65 .0 65 .3 69 .0 64 .5 64 .5 64 .3

DPF-LCC 70 .8 72 .0 72 .0 69 .8 69 .8 70 .0

Fig. 14. Size of the descriptors vs number of subspaces.

Fig. 15. Time required (seconds) vs number of subspaces.

e

t

t

a

i

1

s

d

d

s

s

o

c

p

r

o

w

c

l

a

t

c

c

domly pick 50 subjects for training and use the remaining 80 sub-

jects for testing with no overlap between the train and test sub-

jects.

First, we analyze the effect of the number of subspaces on the

Rank-1 accuracy. This result is shown in Fig. 13 . We observe that

for a wide range of the number of subspaces, the performance of

the two descriptors does not vary widely.

Now, we analyze the effect of number of metric learning re-

gions used for learning the discriminative features on the recog-

niiton accuracy. We use six intermediate subspaces in our experi-

ments. Table 6 shows the Rank 1 accuracy (%) with different num-

ber of metric learning regions. We see that the performance varies

little with small change in the number of regions. We also observe

that when the number of regions is very less, the performance of

both the proposed descriptors is slightly less. This can potentially

be attributed to the difference in pose of the images in the regions

will be more. Also, if we increase the number of regions, the accu-

racy increases at first and then it starts decreasing. This is due to

the reason that when the number of regions is high, the number

of match and non-match pairs available for learning the discrimi-

native metric is relatively less.

Now, we analyse the feature dimension and the computational

requirements of the two proposed descriptors. The dimensions of

the two descriptors, DPF-LCC and DPF-SPR are functions of the

number of intermediate subspaces used. Fig. 14 shows the varia-

tion in the feature dimension of DPF-SPR and DPF-LCC with differ-

nt number of intermediate subspaces for the SCface database. For

his dataset, we have taken the number of subspaces as six, and

hus the feature dimensions for DPF-SPR and DPF-LCC descriptors

re 371520 and 22080 respectively. Here, we represent each facial

mage as the concatenation of features computed from each of the

5 fiducial points. We observe that the dimension of both the de-

criptors increases with the number of subspaces. But the feature

imension of DPF-LCC is much less as compared to that of DPF-SPR

escriptor for the entire range.

The time required to compute the distance between two de-

criptors is a function of their dimension. We have already ob-

erved that the dimension of DPF-LCC is considerably less than that

f DPF-SPR . So it can be expected that it will take less time for

omputing the distance between two DPF-LCC descriptors as com-

ared to two DPF-SPR descriptors. Fig. 15 shows the plot of time

equired for pairwise comparison (in seconds) against the number

f subspaces. Since the dimension of both the descriptors increase

ith the number of subspaces, so the time requirement also in-

rease. But we observe that the time required for DPF-LCC is much

ess than that required for DPF-SPR . For the SCface database, there

re 80 gallery images during testing, so the time required to get

he identity of one probe image is around 0.7 s for DPF-LCC , in

omparison to around 15 s for DPF-SPR . This difference will in-

rease as the size of the gallery increases. We have mentioned


Fig. 16. Rank-1 accuracy as a function of number of projection vectors in each pro-

jection matrix.

e

t

t

t

t

b

d

c

o

c

t

v

p

n

s

w

i

s

q

s

4

i

m

p

a

t

n

P

s

e

d

c

o

V

i

e

t

o

c

t

r

t

t

Table 7

Rank-1 recognition accuracy (%) for four different probe poses, averaged over the

different gallery illuminations on the Multi-PIE dataset [3] using VGG Features [55] .

Method Pose 13_0 Pose 14_0 Pose 05_0 Pose 04_1

VGG-HR-LR-NN 32 .2 52 .8 53 .1 32 .8

VGG-HR-LR-Proposed 39 .7 55 .9 57 .0 40 .5

VGG-HR-HR-NN 88 .3 97 .0 97 .0 91 .3

VGG-HR-HR-Proposed 92 .6 98 .0 98 .2 94 .3

Table 8

Rank-1 accuracy (%) of the proposed approach on RGB-D database with AlexNet

deep features [56] .

Method Rank-1 accuracy

AlexNet-NN 90 .2

AlexNet-Proposed 91 .4

L

b

V

m

c

t

t

e

g

t

u

l

A

H

n

w

fi

T

t

t

l

a

l

m

t

a

5

d

a

r

t

t

l

a

t

a

n

R

arlier that after the computation of the discriminative descriptor,

he feature sets can be compared using pairwise comparisons. But

his approach is computationally expensive, and this has motivated

he development of the two proposed descriptors. So we also plot

he time required to compute the distance between two images

y pairwise comparison in Fig. 15 . We see that both the proposed

escriptors require much less time as compared to the pair-wise

omparison approach, specially with the increase in the number

f subspaces. The Rank-1 accuracy for this dataset with pairwise

omparison is around 68.75%, which is slightly less than that ob-

ained using the two descriptors.

Lastly, we analyse the performance of DPF-LCC with different

alues of n , where n is the number of projection vectors in each

rojection matix as described is Section 3.4 . As mentioned earlier,

can be different for the different layers, but we have taken the

ame value since we have observed that it does not vary much

ith the layers. We observe from Fig. 16 that the Rank-1 accuracy

nitially increases with increasing n , reaches maximum and then

aturates. The value of n which gives the maximum accuracy is

uite small, which is one of the main reason behind the relatively

mall dimension of the DPF-LCC descriptor.

.6. Analysis with state-of-the-art deep features

Recently deep learning techniques have become very popular

n computer vision and have produced state-of-the-art results for

any different applications. In this section, we compare the pro-

osed approach with some of the recent deep learning methods

nd also show how the proposed descriptors can be used to fur-

her boost the performance of features obtained from deep neural

etworks. For this purpose, we perform experiments on the Multi-

IE dataset [3] for faces and RGB-D database [6] for objects. The

etup for these experiments are the same as those reported in the

xperiments section. For the Multi-PIE dataset [3] , we use a recent

eep learning architecture VGGNet [55] , which has been specifi-

ally trained with faces for the application of face recognition. The

utput of the fully connected layer labeled as FC6 in the original

GGNet model whose dimension is 4096 × 1 is used as features

n this experiment. Note that we use the existing network param-

ters without any retraining. Using the HR images as gallery and

he LR images as probe, (i.e. similar protocol as used in our previ-

us experiments) and nearest neighbour classifier, the Rank 1 ac-

uracy (%) is reported in Table 7 , denoted as VGG-HR-LR-NN. Since

he VGGNet is not trained on low resolution images, the Rank-1

ecognition accuracy is quite low as expected. We then compute

he proposed DPF-LCC descriptor using the VGGNet [55] output as

he low-level features and the performance is reported as VGG-HR-

R-Proposed. We see that though the performance is still low, it is

etter as compared to using the VGGNet features directly. Since the

GGNet is trained on HR images, we also perform another experi-

ent with both HR images as gallery and probe. The results in this

ase for both the nearest neighbour matching of the VGGNet fea-

ures and also that of the proposed descriptor on the VGGNet fea-

ures is given in the third and fourth row of Table 7 respectively. As

xpected, the performance using VGGNet features directly is very

ood. We also observe that for this case also, the proposed descrip-

or is able to further improve the performance thus justifying its

sefulness with different kinds of input features.

For the application of object recognition, we perform simi-

ar experiments with the very popular deep learning architecture

lexNet [56] , which is pretrained on object images from ImageNet.

ere, we take the output of the first fully connected layer of the

etwork as the low-level feature. First, we compute the accuracy

hen the AlexNet features are used and nearest neighbour classi-

er is used to compute the probe identity. This result is given in

able 8 . We also report the results using the AlexNet along with

he proposed descriptor, termed as AlexNet-Proposed. We observe

hat the proposed descriptor performs slightly better than the low

evel features. For all these comparisons, we have used DPF-LCC

nd not DPF-SPR descriptor due to the high dimensionality of the

atter. Note that both these deep networks have been trained on

illions of images, and so improvement over these features using

he proposed descriptors justifies the usefulness of the proposed

pproach.

. Discussion

In this work, we proposed two novel discriminative pose-free

escriptors ( DPF-SPR and DPF-LCC ) for matching faces and objects

cross pose. The proposed approaches require images from a few

egions of the pose space for training and do not require separate

raining for each probe pose. Experimental evaluations for various

asks like face recognition across pose, face recognition across reso-

ution and pose, and object recognition across different viewpoints

re conducted to evaluate the usefulness and generalizability of

he approach. Superior performance of the proposed descriptors

s compared to the state-of-the-art approaches show the effective-

ess of the proposed approach.

eferences

[1] S. Sanyal , S.P. Mudunuri , S. Biswas , Discriminative pose-free descriptors for face

and object matching, Int. Conf. Comput. Vision (2015) 3837–3845 . [2] T. Sim , S. Baker , M. Bsat , The cmu pose, illumination and expression database,

IEEE Trans. Pattern Anal. Mach. Intell. 25 (12) (2003) 1615–1618 . [3] R. Gross , I. Matthews , J. Cohn , T. Kanade , S. Baker , Guide to the cmu Multi-Pie

Database, Technical report, Carnegie Mellon University, 2007 .

http://refhub.elsevier.com/S0031-3203(17)30079-1/sbref0001















[

[4] M. Grgic , K. Delac , S. Grgic , Scface–surveillance cameras face database, Mul-timed. Tools Appl. 51 (3) (2011) 863–879 .

[5] S.A. Nene , S.K. Nayar , H. Murase , Columbia object image library (coil-20), Tech-nical report, 1996 .

[6] K. Lai , L. Bo , X. Ren , D. Fox , A large-scale hierarchical multi-view rgb-d objectdataset, Int. Conf. Rob. Autom. (2011) 1817–1824 .

[7] S. Li , X. Liu , X. Chai , H. Zhang , S. Lao , S. Shan , Maximal likelihood correspon-dence estimation for face recognition across pose, IEEE Trans. Image Process.

23 (10) (2014) 4587–4600 .

[8] C. Ding , C. Xu , D. Tao , Multi-task pose-invariant face recognition, IEEE Trans.Image Process. 24 (3) (2015) 980–993 .

[9] C. Ding , J. Choi , D. Tao , L.S. Davis , Multi-directional multi-level dual-cross pat-terns for robust face recognition, IEEE Trans. Pattern Anal. Mach. Intell. 38 (3)

(2016) 518–531 . [10] X. Zhang , D.S. Pham , S. Venkatesh , W. Liu , D. Phung , Mixed-norm sparse rep-

resentation for multi view face recognition, Pattern Recognit. 48 (9) (2015)

2935–2946 . [11] C.D. Castillo , D.W. Jacobs , Using stereo matching with general epipolar geom-

etry for 2d face recognition across pose, IEEE Trans. Pattern Anal. Mach. Intell.31 (12) (2009) 2298–2304 .

[12] D. Wang , H. Lu , M.H. Yang , Kernel collaborative face recognition, PatternRecognit. 48 (10) (2015) 3025–3037 .

[13] L.A. Cament , F.J. Galdames , K.W. Bowyer , C.A. Perez , Face recognition under

pose variation with local gabor features enhanced by active shape and statis-tical models, Pattern Recognit. 48 (11) (2015) 3371–3384 .

[14] H. Li , G. Hua , Z. Lin , J. Brandt , J. Yang , Probabilistic elastic matching for posevariant face verification, Comput. Vision Pattern Recognit. (2013) 3499–3506 .

[15] Q. Yin , X. Tang , J. Sun , An associate-predict model for face recognition, Comput.Vision Pattern Recognit. (2011) 497–504 .

[16] O. Arandjelovic , Learnt quasi-transitive similarity for retrieval from large col-

lections of faces, Comput. Vision Pattern Recognit. (2016) 4 883–4 892 . [17] P.H. Hennings-Yeomans , S. Baker , B.V. Kumar , Simultaneous super-resolution

and feature extraction for recognition of low-resolution faces, Comput. VisionPattern Recognit. (2008) 1–8 .

[18] B. Li , H. Chang , S. Shan , X. Chen , Low-resolution face recognition via coupledlocality preserving mappings, Comput. Vision Pattern Recognit. (2010) 20–23 .

[19] H.S. Bhatt , R. Singh , M. Vatsa , N.K. Ratha , Improving cross-resolution face

matching using ensemble-based co-transfer learning, IEEE Trans. Image Pro-cess. 23 (12) (2014) 5654–5669 .

[20] C. Ren , D. Dai , K. Huang , Z. Lai , Transfer learning of structured representationfor face recognition, IEEE Trans. Image Process. 23 (12) (2014) 5440–5454 .

[21] S. Al-Maadeed , M. Bourif , A. Bouridane , R. Jiang , Low-quality facial biometricverification via dictionary-based random pooling, Pattern Recognit. 52 (2016)

238–248 .

[22] Z.Q. Zhao , Y.M. Cheung , H. Hu , X. Wu , Corrupted and occluded face recognitionvia cooperative sparse representation, Pattern Recognit. 56 (2016) 77–87 .

[23] W.W.W. Zou , P.C. Yuen , Very low resolution face recognition problem, IEEETrans. Image Process. 21 (1) (2012) 327–340 .

[24] M. Kostinger , M. Hirzer , P. Wohlhart , P. Roth , H. Bischof , Large scale metriclearning from equivalence constraints, Comput. Vision Pattern Recognit. (2012)

2228–2295 . [25] P. Moutafis , I.A. Kakadiaris , Semi-coupled basis and distance metric learning

for cross-domain matching: application to low-resolution face recognition, Int.

Joint Conf. Biometrics (2014) 1–8 . [26] R. Gopalan , R. Li , R. Chellappa , Unsupervised adaptation across domain shifts

by generating intermediate data representations, IEEE Trans. Pattern Anal.Mach. Intell. 36 (11) (2014) 2288–2302 .

[27] J. Ni , Q. Qiu , R. Chellappa , Subspace interpolation via dictionary learningfor unsupervised domain adaptation, Comput. Vision Pattern Recognit. (2013)

692–699 .

[28] Y. Sun , Y. Chen , X. Wang , X. Tang , Deep learning face representation by jointidentification-verification, Neural Inf. Process. Syst. (2014) 1988–1996 .

[29] Y. Taigman , M. Yang , M.A. Ranzato , L. Wolf , Deepface: closing the gap to hu-man-level performance in face verification, Comput. Vision Pattern Recognit.

(2014) 1701–1708 .

[30] F. Schroff, D. Kalenichenko , J. Philbin , Facenet: a unified embedding for facerecognition and clustering, Comput. Vision Pattern Recognit. (2015) 815–823 .

[31] T.H. Chan , K. Jia , S. Gao , J. Lu , Z. Zeng , Y. Ma , Pcanet: a simple deep learn-ing baseline for image classification? IEEE Trans. Image Process. 24 (12) (2015)

5017–5032 . [32] Z. Wang, S. Chang, Y. Yang, D. Liu, T. Huang, Studying very low resolution

recognition using deep networks, arXiv preprint arXiv:1601.04153(2016). [33] J. Chen, J. Wu, J. Konrad, P. Ishwar, Semi-coupled two-stream fusion con-

vnets for action recognition at extremely low resolutions, arXiv preprint

arXiv:1610.03898(2016). [34] J. Schels , J. Liebelt , R. Lienhart , Learning an object class representation on a

continuous viewsphere, Comput. Vision Pattern Recognit. (2012) 3170–3177 . [35] E. Hsiao , M. Hebert , Occlusion reasoning for object detection under arbitrary

viewpoint, IEEE Trans. Pattern Anal. Mach. Intell. 36 (9) (2014) 1803–1815 . [36] J.C. Rubio , A. Eigenstetter , B. Ommer , Generative regularization with latent

topics for discriminative object recognition, Pattern Recognit. 48 (12) (2015)

3871–3880 . [37] M. Wu , J. Zhou , J. Sun , Query-expanded collaborative representation based

classification with class-specific prototypes for object recognition, PatternRecognit. 47 (11) (2014) 3585–3596 .

[38] A . Bakry , A . Elgammal , Untangling object-view manifold for multiview recog-nition and pose estimation, Eur. Conf. Comput. Vision (2014) 434–449 .

[39] K. He , X. Zhang , S. Ren , J. Sun , Spatial pyramid pooling in deep convolutional

networks for visual recognition, IEEE Trans. Pattern Anal. Mach. Intell. 37 (9)(2015) 1904–1916 .

[40] T. Hassner , V. Mayzels , L.Z. Manor , On sifts and their scales, Comput. VisionPattern Recognit. (2012) 808–821 .

[41] K. Gallivan , A. Srivastava , X. Liu , P.V. Dooren , Efficient algorithms for inferenceson grassmann manifolds, IEEE Workshop Stat. Signal Process. (2003) 315–318 .

[42] D.G. Lowe , Distinctive image features from scale-invariant keypoints, Int. J.

Comput. Vis. 60(2) (2004) 91–110 . [43] D.R. Hardoon , S. Szedmak , J. Shawe-Taylor , Canonical correlation analysis: an

overview with application to learning methods, Neural Comput. 16 (12) (2004)2639–2664 .

44] S. Milborrow, F. Nicolls, Locating facial features with an extended active shapemodel, Eur. Conf. Comput. Vision (2008) . http://www.milbo.users.sonic.net/

stasm .

[45] M. Aharon , M. Elad , A. Bruckstein , K-Svd: an algorithm for designing overcom-plete dictionaries for sparse representation, IEEE Trans. Signal Process. 54 (11)

(2006) 4311–4322 . [46] R. Gross , I. Matthews , S. Baker , Appearance-based face recognition and light–

fields, IEEE Trans. Pattern Anal. Mach. Intell. 26 (4) (2004) 449–465 . [47] B. Gong , Y. Shi , F. Sha , K. Grauman , Geodesic flow kernel for unsupervised do-

main adaptation, Comput. Vision Pattern Recognit. (2012) 2066–2073 .

[48] S. Biswas , G. Aggarwal , P.J. Flynn , K.W. Bowyer , Pose-robust recognition oflow-resolution face images, IEEE Trans. Pattern Anal. Mach. Intell. 35 (12)

(2013) 3037–3049 . [49] A . Sharma , A . Kumar , H. Daume , D. Jacobs , Generalized multiview analysis: a

discriminative latent space, Int. Conf. Comput. Vision (2012) 2160–2167 . [50] M. Kan , S. Shan , H. Zhang , S. Lao , X. Chen , Multi-view discriminant analysis,

IEEE Trans. Pattern Anal. Mach. Intell. 38 (1) (2016) 188–194 . [51] F. Shen , C. Shen , X. Zhou , Y. Yang , H.T. Shen , Face image classification by pool-

ing raw features, Pattern Recognit. 54 (2016) 94–103 .

[52] S. Wang , D. Zhang , Y. Liang , Q. Pan , Semi-coupled dictionary learning with ap-plications to image super-resolution and photo-sketch synthesis, Comput. Vi-

sion Pattern Recognit. (2012) 2216–2223 . [53] D.A. Huang , Y.C.F. Wang , Coupled dictionary and feature space learning with

applications to cross-domain image synthesis and recognition, Int. Conf. Com-put. Vision (2013) 2496–2503 .

[54] L. Bo , X. Ren , D. Fox , Kernel descriptors for visual recognition, in: Advances in

Neural Information Processing Systems, 2010, pp. 244–252 . [55] O.M. Parkhi , A. Vedaldi , A. Zisserman , Deep face recognition, Br. Mach. Vision

Conf. (2015) 1–6 . [56] A. Krizhevsky , I. Sutskever , G.E. Hinton , Imagenet classification with deep con-

volutional neural networks, Neural Inf. Process. Syst. (2012) 1097–1105 .








































































































































































http://www.milbo.users.sonic.net/stasm

























































S ntly working toward the M.Sc (Engg.) degree in Electrical Engineering in Indian Institute o ssing Society. His research interests include computer vision, machine learning and deep

l

S nications Engineering from Sasi Institute of Technology and Engineering, India, in 2009, a is currently a doctoral student in the Department of Electrical Engineering at the Indian

I interests are in image processing, computer vision and pattern recognition.

S ian Institute of Science, Bangalore, India. She received the MTech degree from the Indian I r engineering from University of Maryland, College Park, in 2009. Her research interests

i member of the IEEE.

oubhik Sanyal received B.E. degree from Jadavpur University in 2013. He is curref Science, Bangalore, India. He is a student member of IEEE and IEEE Signal Proce

earning.

ivaram Prasad Mudunuri received the B.Tech. degree in Electronics and Commund M.Tech. degree from the National Institute of Technology, Calicut, in 2011. He

nstitute of Science, Bangalore, India. He is a Student member of IEEE. His research

oma Biswas is an assistant professor in the Electrical Engineering Department, Indnstitute of Technology, Kanpur, in 2004, and PhD degree in Electrical and Compute

nclude image processing, computer vision, and pattern recognition. She is a senior

Discriminative pose-free descriptors for face and object ... · Discriminative pose-free...

Documents

Transcript of Discriminative pose-free descriptors for face and object ... · Discriminative pose-free...