Face Veri cation and Face Image Synthesis under ...daphna/theses/Tamar_Elazari...nition and face...

Face Verification and Face Image Synthesis

under Illumination Changes

using Neural Networks

by

Tamar Elazari Volcani

Under the supervision of

Prof. Daphna Weinshall

School of Computer Science and Engineering

The Hebrew University of Jerusalem

Israel

Submitted in partial fulfillment of the

requirements of the degree of

Master of Science

December, 2017

1

Abstract

Following the success neural networks brought to the field of face recognition, we

examine two further issues regarding changes in illumination.

We first examine the possibility to train a face verification algorithm, based on neural

networks, to overcome illumination changes. The common practice in face verification

used to search for hand-crafted optimal features, such that verification could be performed

by a simple computation, like an inner product. In this work we will focus on training a

neural network, that will overcome the need to predefine a feature space.

As an extension face verification, we explore a more challenging task - the generation of

new face images, rather than only to verifying ones. We propose several novel algorithms,

based on neural networks, trained to generate face images having different illuminations.

2

אימות פנים וסינתזת תמונות פנים שינויי תאורהתחת

ידי רשתות נוירוניםעל

אלעזרי וולקניתמר

:מנחה

דפנה וינשל פרופ'

הספר להנדסה ומדעי המחשבבית

שם רחל וסלים בניןעל

העברית בירושליםהאוניברסיטה

ישראל

הוגש כמילוי חלקי של החובות לתואר מוסמך במדעים

ה'תשע"ח טבת

תקציר

נבחן שני נושאים נוספים ההצלחה שרשתות נוירונים הביאו לתחום זיהוי הפנים, בעקבות

- הנוגעים לשינויי תאורה בתמונות

שיוכל ,מבוסס רשתות נוירונים אימות פנים תםיראשית נבדוק האם ניתן לאמן אלגור

במציאת מרחב דוגלת גישה הרווחת בתחום אימות הפניםה .להתגבר על שינויי תאורה

ה לקבל החלטות באמצעות חישוב פשוט, כמו יניתן יהעל בסיסו כך ש אופטימלי, מאפיין

על נתמקד באימון מקבל החלטות, שיתגבר בעבודה זו .מכפלה פנימית בין ווקטורים

.הצורך למצוא מרחב מאפיין שכזה

לא רק ו ים חדשות,ליצור תמונות פנ האם ניתן -משימה מאתגרת יותר מכן נחקורלאחר

נורשתות נוירונים שיאומ מים מבוססיאלגורית מספר נגדיר .תמונות שכאלהלאמת

לסנתז תמונה של אדם נתון בתנאי תאורה חדשים.

Contents

1 Introduction 6

2 Tools 7

2.1 The YaleB Dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7

2.2 Code Packages . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9

2.2.1 Artificial Neural Network . . . . . . . . . . . . . . . . . . . . . . . . . . . 9

2.2.2 SVM . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9

3 Illumination Invariant Face Verification 10

3.1 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10

3.2 Task Definition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12

3.3 Data Preparations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12

3.3.1 Training and Testing Samples Definition . . . . . . . . . . . . . . . . . . . 12

3.3.2 Balancing the Training Sample Set . . . . . . . . . . . . . . . . . . . . . . 13

3.4 Face Verification Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14

3.4.1 Feature Vector Construction . . . . . . . . . . . . . . . . . . . . . . . . . 14

3.4.2 Pair Recognition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15

3.5 Results and Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15

3.5.1 Evaluation of Feature Space . . . . . . . . . . . . . . . . . . . . . . . . . . 15

3.5.2 Evaluation of Pair Recognition . . . . . . . . . . . . . . . . . . . . . . . . 16

3.5.3 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18

4 Synthesis of Face Images Under New Illuminations 19

4.1 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19

4.2 Task Definition: New Illumination Synthesis . . . . . . . . . . . . . . . . . . . . . 21

4.3 Data Preparation: Training and Testing Samples . . . . . . . . . . . . . . . . . . 21

4.4 Algorithms and Architectures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23

4.4.1 Algo1: An End-to-End Approach . . . . . . . . . . . . . . . . . . . . . . . 23

4.4.2 Algo2: Introducing More Information in Training . . . . . . . . . . . . . . 24

4.4.3 Algo3: Forcing the Logic . . . . . . . . . . . . . . . . . . . . . . . . . . . 25

4.5 Computational Validation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27

4.6 Results and Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28

4.6.1 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32

List of Figures 33

List of Tables 34

References 37

5

1 Introduction

Face recognition is useful in many applications, varying from home leisure uses like social

networks image tagging and all the way to public security and border control. Therefore,

face recognition has been studied extensively. The earliest studies on automatic machine

recognition of faces were published in the 1970’s [Kelly, 1970] [Kanade, 1977] (see a review

in [Bhele and Mankar, 2012]).

Artificial Neural Networks (ANNs) enjoyed increased popularity in the last decade, fueled by

progress in Convolutional Neural Networks (CNNs), [LeCun et al., 1995], together with better

and faster computers hardware. This progress enabled far-reaching achievements in computer

vision, including face recognition. For example, the quality of face classification from a given

data-set using CNNs has surpassed human performance [Schroff et al., 2015] [Sun et al., 2015]

[Taigman et al., 2014].

A related face recognition task is Face verification, the recognition of faces of people not

included in the training set.

In this thesis, we address the problem of single sample face verification. Single sample refers

to providing only a single reference image during test time of the algorithm. A scenario such

as this, where the learning algorithm aims to learn information about object categories from

one or only a few images without feedback, is also called One-shot learning. This scenario is

often encountered in real life. For example, a new member joins a social network, a picture of a

criminal, yet unknown to the police, is given, and so on. Moreover, it is not reasonable (today)

to keep and search a database of all the people in the world (7.5× 109 ≈ 233), or even a single

country (China and India population is over a billion each 109 ≈ 230). Not to mention training

on such data, as the most successful face recognition algorithms require training on hundreds

or thousands of images per individual. For example in [Taigman et al., 2014], the classification

algorithm was trained on a data-set of over 4.4 million facial images belonging to more than

4,000 identities. An average of more than a thousand different images per individual.

Our approach to the task of single sample face verification is to train a decision maker: an

ANN that receives two input images performs computation on both of the images, and outputs

a binary decision, “the same person” or “different persons”. To simplify our task, we focus on

face images having a uniform resolution, such that the face is at the center of the images. We

also narrow the image differences to be of illumination changes, as in the Extended Yale Face

Database B (YaleB). We train multiple ANNs, which share the same architecture, to recognize

whether two face images in different preset illuminations are of the same individual.

In the second part of the thesis, we address a more complicated prediction task, by asking

“what a new face image would look like?”. We show how to synthesize a new face image for a

given individual, under different illumination conditions, in the context of YaleB data-set.

There is a possibility ANN algorithms are already predicting images showing faces under

new illuminations. Consider the face verification experiment described above: We would train

a different network for every pair of illuminations. It is possible that given the first image, the

6

Figure 1: The Extended Yale Face Database B.Example of 3 individuals images in 3 illumination conditions.

network predicts what the image of that individual would look like under the second illumina-

tion, and then compares it with the second input image to determine whether they depict the

same individual. Of course, this kind of process may not occur in image space. Finding the

right space would be one of the challenges we face.

Another challenging aspect of the synthesis task is combining regression networks into the

process, and not just classification ones. This raises the issue of nearly-correct results. To

address this problem, we propose an ANN-based validation system for the resulting images.

2 Tools

In this section, we describe the face data-set used in our experiments and define new concepts

regarding it. We also describe the code packages used.

2.1 The YaleB Dataset

We will use the Extended Yale Face Database B (YaleB) [Georghiades et al., 2001], which is

available to download at:

http://www.cad.zju.edu.cn/home/dengcai/Data/FaceData.html.

The data-set contains 32×32 pixel gray-scale centered near frontal faces images, took under

different illuminations conditions. The images were taken in lab conditions, where the person

sits still and there is only a single light source in the room. To create the different illuminations

the light source location is changed. The set of locations defines the set of illuminations.

Each individual appears in at most one image per illumination condition. Namely, the num-

ber of images of an individual equals the number of illuminations it appears in. We selected 31

individuals, who had (at least) 64 images (different illuminations) each so that each illumination

is shown in the same number of images.

Super Illuminations

In some cases, we thought it would be useful to have more than one sample image of the same

illumination and individual. We also noticed that some of the illuminations are rather similar

looking, which motivated the idea of grouping them together.

This is done in the following way: We represent each illumination by the average of all

images taken in that illumination (31 images, as the number of individuals). Then, we create

7

http://www.cad.zju.edu.cn/home/dengcai/Data/FaceData.html

15 illumination clusters by applying k-means clustering algorithm (Matlab), using Euclidean

distances. Denote the 15 illumination clusters with C1, ..., C15 and their corresponding centroids

with c1, ..., c15. Each cluster of illuminations is considered as one super-illumination.

Still, some illuminations which are located in different clusters may resemble one another.

We wish our set of illuminations to be the ones which are most significantly different from each

other. Therefore we farther go and choose a representative illumination for each cluster to be

the illumination which has the greatest average distance from the centroids of all the other

clusters (Eq. 1). Fig. 2 illustrates this process. Fig. 3 demonstrates the resulting representative

illumination.

Ki = argmaxkij∈Ci

(meanl 6=i

(||kij − cl||2

))(1)

Figure 2: An illustration of the representative illuminationrelationship with illumination clusters.

8

Figure 3: Representative illuminations images. Every row shows all images of allindividuals in one representative illumination. Every column shows images of thesame person in all representative illuminations.

2.2 Code Packages

2.2.1 Artificial Neural Network

The implementation of Artificial Neural Networks (ANN) we used is based on EasyConvNet,

a Convolutional Neural Networks (CNN) Matlab code package by Shai Shalev-Shwartz, which

can be found on https://github.com/shaisha/EasyConvNet, [Shalev-Shwartz, 2014].

2.2.2 SVM

We used Support Vector Machine (SVM) algorithm on different occasions. At times the imple-

mentation of Matlab’s SVM, http://www.mathworks.com/help/stats/svmtrain.html. And

at times the one of LIBSVM [Chang and Lin, 2011].

9

https://github.com/shaisha/EasyConvNet

http://www.mathworks.com/help/stats/svmtrain.html

Figure 4: From [Riklin-Raviv and Shashua, 1999]. An outline of the DeepFace faceclassifier architecture. A front-end of a single convolution-pooling-convolution fil-tering on the rectified input, followed by three locally-connected layers and twofully-connected layers. Colors illustrate feature maps produced at each layer. Thenet includes more than 120 million parameters, where more than 95% come fromthe local and fully connected layers.

3 Illumination Invariant Face Verification

In this section, we describe the experiment of training a decision maker ANN algorithm for

single sample face verification under change of illumination. First, we formally define the task

at hand. Then we address in detail the problem of an unbalanced training set and present

two data augmentation solutions. We employ the experiment’s framework to two types of

feature vectors, a high dimensional and a low dimensional, and compare the behaviors of the

corresponding results, to discover a surprising outcome.

3.1 Related Work

The study that best emphasizes the success neural networks brought to the field of face recog-

nition and face verification is [Taigman et al., 2014]. Employing deep CNN in their algorithm,

and having the largest labeled face training data-set available at the time, they receive the

best results to this day and even managed to surpass human performance on face verification

(Fig. 5).

Their method is to first train a very good face classifier, and then use the one-before-last

blob (of length 4, 096) that serves as a feature vector for the classification, also as the feature

vector for face verification. Given such feature vectors of two images, they simply use the inner

product between the two normalized inputs as a similarity measure.

The face classifier consists of two stages: 1) a 3D frontalization and alignment, using ana-

lytical 3D modeling of the face based on fiducial points, that is used to warp a detected facial

crop to a 3D frontal mode (frontalization) [Berg and Belhumeur, 2012], and 2) a deep CNN

(or DNN) whose architecture is shown in Fig. 4. The training data-set for this process is the

the SFC data-set, a collection of photos from Facebook of 4.4 million labeled faces from 4, 030

people each with 800 to 1200 faces. This data-set is unfortunately not accessible to the public,

as is the resulting algorithm (true for the time our experiments were conducted).

As opposed to the approach taken here, of finding the ”perfect” feature space, such that

10

Figure 5: From [Taigman et al., 2014]. ROC curve on LFW data-set. Surpassinghuman performance.

Figure 6: Market-1501 data-set for Re-ID [Zheng et al., 2015]. Images in first tworows are grouped by individuals. Images on the third row are ”noise”.

all there is left to do is to calculate a simple inner product, we seek a decision maker for face

verification, that overcomes the flaws of the representation space of the inputs. And since

our data-set of choice is the YaleB data-set, we do not have to apply neither frontalization or

alignment.

The field of Re-Identification

Another field dealing with a related task is the field of Re-Identification, or in short Re− ID.

Similarly to face verification, the task here is to decide whether two input images are of the

same individual or not. Except in Re-ID input images are of pedestrians, and often of low

resolution, as would be if the images came from security cameras.

The nature of the samples in this field makes it a harder problem. The variety of differences

between two images can be much greater than in face verification. In addition to illumination

and orientation, Re-ID deals with the pose (i.e front vs. back, standing vs. riding a bicycle),

and the resolution makes it nearly impossible to work with face features, that is if they are

11

visible at all. Moreover, Re-ID algorithms suffer from the blinding effect of an outstanding

feature, such as length and color of hair, gender, and as most data-sets contain images from the

same day for a single object, the color of clothes. These features are shared by all appearances

of the same individual, but not uniquely.

The difficulty of this task is reflected in the fact that the success rates here are

much lower then in face verification. For example: rank1 accuracy on Market-1501 data-

set [Zheng et al., 2015] (Fig. 6) in recent papers are 66% by [Varior et al., 2016], 48% by

[Liu et al., 2016], 37% by [Wu et al., 2016]. As in [Taigman et al., 2014] of face verification.

But much like in [Taigman et al., 2014], these studies also concentrate on the representation of

the input samples, which we will avoid.

3.2 Task Definition

So how do you recognize a face you do not know?

We begin with two main definitions: Recognition, in this context, is the ability to determine

whether two different face images are of the same person or not. And as this is a machine

learning algorithm, an individual is considered unseen, or unknown to the algorithm, if no

image of it is in the training set.

We define the task formally:

Let our set of individuals be P , and the set of possible illuminations be K. And let there

be some representation space. It could be images or otherwise chosen (or trained) function of

images, but for simplicity, we call it the image space. We denote the image of individual p ∈ Ptaken under illumination k ∈ K with imk(p).

We set two illuminations: source illumination s ∈ K and target illumination t ∈ K. Then,

for two input images: a base image, ims(p), and a query image, imt(p′), where p, p′ ∈ P , the

task is to determine whether p = p′.

In the language of machine learning, our sample set is the set of all ordered pairs of images

taken with the preset illuminations X = {(ims(p), imt(p

′))

: p, p′ ∈ P}, and our label set is

Y = {±1}, where the true label for an instance(ims(p), imt(p

′))

is 1 if and only if p = p′.

As described before, we have 31 different individuals (|P | = 31), each having 64 images in

(slightly) different illuminations, grouped in 15 clusters of significantly different illuminations.

To ensure that the source and target illuminations, s and t, are different not just by index, but

by appearance, we limit their choices set K to be the 15 illumination representatives (sec.2.1)

of the illuminations clusters.

3.3 Data Preparations

3.3.1 Training and Testing Samples Definition

We wish to test a one-shot learning scenario. Thus a partition of the sample set into training

and testing sample sets, where the individual appearing in the query image might have also

12

been seen through training and therefore known to the algorithm, is not enough. We propose

instead to first partition the individuals, and then infer the training and testing sets samples.

We randomly partition the 31 individuals of the YaleB data-set into two distinct sets, one

for testing and one for training. We assign 20 for training, and the remaining 11 for testing,

and maintain the same partition through any choice of source and target illuminations. Thus

for the test inputs, base or query, not only none of the given images were seen in training time,

but no other image of the individuals, p and p′, was seen either.

3.3.2 Balancing the Training Sample Set

Let there be some set A of N different item. Then, the group of all ordered pairs of items from

that set, A×A, has exactly N pairs that are of the same object, which are NN2 = 1

N of all pairs.

Hence, a function that labels all pairs as of different objects, i.e the constant −1 labeling, will

be correct for 1− 1N of the samples. In our case, different objects are the different individuals,

and N = 20. Therefore the constant negative function will get accuracy of 95%, where in fact

it did not learn to distinguish matching from non-matching pairs at all.

To avoid this deviation, and since we train under the assumption of uniform distribution of

the data, we need to modify our training set in a way that emphasizes the importance of positive

instances. Popular ways to modify the training set in such cases of unbalanced classes can be

divided into two main categories: Over-sampling, adding instances to the under represented

class; and under-sampling, removing instances from the over represented class. Due to the

already small size of the data set, using under sampling will leave us with too small a training

set to train on. We therefore choose two forms of over sampling, train under each of them, and

under no augmentation at all separately, and compare the results:

Duplication: We duplicate positive instances multiple times, so that positive instances

are of the same amount as negative ones.

Noisy Augmentation: We recall that both illuminations s and t are representative

illuminations. Thus we can aggregate positive samples with similarly looking

images. Meaning that for every positive instance (ims(p), imt(p)), we add to the

training set all pairs of the form (ims(p), imt(p)), where s is an illumination from

the s’s illumination cluster and t is from the t’s. Note that in this case the classes

of same and different pairs of images, are not perfectly equally represented, as

illumination clusters are not equal in size. Nevertheless the ratio is much closer

than in the original state, and a counting shows that for the YaleB data-set, for

any s and t, it is not worse than a 2 : 3 ratio.

As the test sample is also unbalanced, we measure success by accuracy and also by additional

measures that are more often used for unbalanced data (see results section).

13

Figure 7: Feature vector construction: CNN architecture.

3.4 Face Verification Algorithms

We construct a two stage algorithm: the first is feature vector construction for a single image,

using classification CNN, and the second is a decision making CNN for two such feature vectors.

3.4.1 Feature Vector Construction

First of all, it is worth noting that our attempts to train a single neural network taking raw

images and resulting a decision, did not converge. Perhaps too difficult a task considering the

small amount of instances in the data set. Moreover, we hold additional information that is not

being used with that approach, and that is the identity of the individuals in the training set.

These reasons motivated us both to look for a more suited representation space, and to use a

classification CNN to do that. The feature constructing network’s architecture is described in

Fig. 7.

We use all images of the training individuals group, 20 individuals over 64 illuminations, as

samples for the training. And for the loss function we use multi-class logistic loss (Eq. 2), as

proposed by [Shalev-Shwartz, 2014],

l(o, Y ) =∑i:Yi=1

−log(eoi∑j e

oj) =

∑i:Yi=1

log(∑j

eoj−oi) (2)

Where o ∈ Rk is the prediction and is Y ∈ {±1}k the label.

We will compare the results for using two types of feature vectors extracted from the trained

network: the low dimension score (labels) vector, and the one-before-last blob (OBL for short),

of high dimension of 1024. The use of the one-before-last blob is the more obvious choice, as it

is in fact the feature vector for labeling in the classification mission itself, and should be more

informative in that way. The use of the score vector is motivated by the intuition that we,

humans, often identify a stranger by its resemblance to other familiar people (”Alice looks a

lot like Bob, but also like Charlie”).

14

3.4.2 Pair Recognition

For the second stage we train multiple networks, one for every choice of feature vector and a

pair of source and target illuminations chosen from the set of representative illuminations set.

All networks share the same architecture, described in Fig. 8.

Figure 8: Pair recognition: ANN architecture.

For every session of training we use the (feature vectors of) image pairs such that the

individuals depicted in them are from the training individuals group, and the illuminations

are in a predefined order. For the loss function we use the binary logistic loss (Eq. 3) by

[Shalev-Shwartz, 2014],

l(o, y) = log(1 + e−yo) (3)

where o ∈ R is the prediction and is y ∈ {±1} the label.

3.5 Results and Discussion

3.5.1 Evaluation of Feature Space

The classification network was trained to classify the 20 individuals on the training set with

a 100% accuracy (all training and no testing). As the length of the score vector equals the

number of individuals (20), and we trained the network to perform under the operation of

max(·), there is no point to measure the classification accuracy over the test sample. To assess

the expressiveness of the resulting feature spaces (score, OBL) on face images of individuals

that the network was not trained on, we check the separability of the test sample images. We

use multi-class linear SVM to do so. For each feature space, we repeat the experiment with

three different types of labeling: face labels, illumination labels, and super illumination labels.

It is evident from the results (Table 1), that the test samples are easily (linearly) separable in

both feature spaces, and also that the super illuminations division still applies.

So, if the SVM depicts the different test sample identities so well, why bother training a

neural network? one might wonder. But this is actually not possible, since this SVM’s outcome

regard training over the test individuals themselves, and we are looking for a method that will

operate well on a new individual (cannot train over it at all).

15

Feature ID 64 original 15 super

vector type illuminations illuminations

Score 99.9% 98% 96.4%

OBL 99.7% 100% 100%

Table 1: Accuracy values for separability of test sample in different feature spacesusing multi-class SVM.

Note that while these characteristics show that the feature spaces are both relevant for our

experiments, they are somewhat surprising. Since we used the network to classify identities

only, and not illuminations, we expect the outcome to be illumination invariant, which it is

clearly not, both for the score vectors and the OBL vectors.

3.5.2 Evaluation of Pair Recognition

As mentioned earlier, for every choice of source and target illumination, we (train and) test the

networks separately. The results in Table 2 are average results of all these repetitions.

feature data training test test test test

space augmentation accuracy TPR TNR F1 score EER

score non 100% 28.4% 98.2% 34.8% 21.2%

score duplication 100% 37.3% 96.4% 40% 21.1%

score noisy 100% 58.8% 87.2% 41% 23.6%

OBL non 99.7% 25.8% 95.6% 28.2% 26.7%

OBL duplication (did not converge) - - - -

OBL noisy 97.4% 32.7% 78.1% 17.1% 37.9%

Table 2: Experiments results by types of feature space and data augmentation.TPR = true positive rate, TNR = true negative rate, EER = equal error rate.The EER measure is calculated with respect to the continuous value, prior to signfunction.

First of all, we note that in all the different experiments, the networks indeed learned to

tell positive and negative samples apart, even if not to a full degree. Recall that a constant

negative function would have a 95% training accuracy, and observe that in all cases our training

accuracy values are higher, and for some experiments even get to a 100%. Moreover, the test

TPR (= TPTP+FN ) is strictly positive, which means TP is strictly positive. Namely positive

samples were labeled correctly too. Still, the difference between the high TNR values and the

low TPR values, teach us that the relatively more common mistake is to label a positive sample

as negative (FN vs TP), than the other way around (FP vs TN).

Second, but perhaps a more obvious observation from the table, is the wide gap, in all

cases, between the training accuracy values, which are a 100% or nearly that, and the test

16

measurement values, that are far lower. This tells us that our models fit much better the training

sample than the testing one. In other words, we have an overfitting situation. A possible and

common explanation for it is the incompatibility between the number of parameters in the

algorithm and the number of training samples. Recall we trained for every pair of illuminations

separately, which effectively means we have multiplied the number of networks, and hence that

of parameters, by the number of pairs of illuminations. Making the number of parameters

orders of magnitude larger than the number of training samples. Continuing to analyze the

results, we need to keep in mind the overfitting effect.

Comparing the different experiments (Table 2), we see that the F1 score (a measure of a

test’s accuracy that can be interpreted as a weighted average of the precision and recall) results

are altogether higher when using the score vectors as feature vectors than when using the OBL

vectors. Possibly the low dimension of the score vector (20) is somewhat of a regularization,

and therefore the OBL vectors are more overfitted to the training sample.

As for data augmentation, looking at the test TPR it is clear that data augmentation

indeed helped emphasize the importance of positive labels. And for every input type, training

the network with augmented positive data, resulted in better TPR, as expected. When the

noisy augmentation, which adds more information on-top of balancing the data, performed

better than the plain duplication. The combination of these two beneficiating conditions, a

feature vector of low dimension and data augmenting with extra information, lead to the best

F1 score and TPR, and a nearly lowest EER.

methoddata training test test test test

augmentation accuracy TPR TNR F1 score EER

our ANNduplication 100% 37.3% 96.4% 40% 21.1%

noisy 100% 58.8% 87.2% 41% 23.6%

RBF SVMduplication 85.3% 78.3% 71.1% 37.1% 24.1%

noisy 82% 80.1% 66.4% 34.2% 24.4%

Table 3: Comparing the results of our ANN and RBF SVM for score vector featurespace and different data augmentation types.

In Table 3, we compare the results under the two best experiments frameworks, score vector

plus noisy data augmentation and score vector plus duplicated positive data, when applying an

off-the-shelf non-ANN ML method: Support Vector Machine (SVM) with Radial Basis Function

(RBF) kernel. The inputs are the same as for the ANN, concatenated pairs of feature vectors,

and the labels are in {±1}. The training accuracy alone tells us that the mission is too complex

for the SVM to express. And while the SVM seems to fit the testing sample similarly as to

the training one (training accuracy vs test TPR), the more inclusive measurement of F1 score,

shows that the ANN algorithm still gets better results for both data augmentation types. Thus

17

reflecting that the decrease in test TNR is more significant than the test TPR increase, for the

SVM experiments.

3.5.3 Conclusions

The most interesting outcome from the results is the feature space behavior. First, it is clear that

both feature spaces are not illumination invariant, in-spite of the fact that they were extracted

from a face only classification network. And second, in other studies where the feature space

is defined by a classification network, the feature vectors are never (to our knowledge) defined

by the score vector itself, but is the outcome of one of the hidden layers, much like in the

previously mentioned [Taigman et al., 2014]. Reasons for that include the fact that the label

vector’s length is determined by the number of layers and is not flexible like the dimension of

the feature vector for that very same labeling. Moreover, one might think it would be invariant

to other characteristics than the ones it is meant to classify, which evidently they are not.

The surprising success of the label vector as a feature vector in our experiments, as the lack

of other studies using it, raises some questions: Is the success coincidental? Can it be repeated

with other data-sets? Is it scalable, or is it working well due to its low dimension only? Can

it be applied to class-based classification (like types of cars, or animals), or only object (ID)

based classification? What other characteristics are preserved?

18

4 Synthesis of Face Images Under New Illuminations

In this section, we present our work on ANN algorithms for synthesis of face image showing

new illumination conditions. The common approach for this task includes a human engineered

property invariant (illumination of others) model. Our goal is to fully exploit ANNs power

on this task, and operate without such model. This opens the door to endless types of ANN

architecture. We characterize the logic structure of such synthesis algorithm and propose a

number of architectures.

Another challenging aspect of the experiment is the one of computational evaluation of the

synthesized images. We propose an ANN classifier, trained to evaluate key properties in the

resulted images.

4.1 Related Work

In [Riklin-Raviv and Shashua, 1999] a very similar synthesis task to ours, is approached. They

too are attempting to synthesize a new face image under given different illumination conditions.

Based on earlier research by the author, about the low dimensionality of the image space under

varying lighting conditions, which were originally reported in [Shashua, 1992], [Shashua, 1997]

in the case of Lambertian objects, which the natural human face is, this paper takes the strategy

is of a well defined illumination invariant signature image, called the ”quotient” image. Given

two objects (a, b), the quotient image Q is defined by the ratio of their albedo (surface texture)

functions.

Moreover, results in the paper refer to ideal class assumption: An ideal class is a collection of

3D objects that have the same shape but differ in the surface albedo function. While the human-

face class is definitely not an ideal class, the researchers treat it as one, claiming that performing

pixel-wise dense correspondence between images (like limiting frontal images) satisfies the ideal

class conditions. In the paper results are demonstrated on the high quality database prepared

by Thomas Vetter and his associates [Vetter et al., 1997] [Vetter and Poggio, 1996]. In our

experiments we specifically want to test for unprocessed images in that way, to better simulate

natural conditions.

One of our goals, is to see the the power of ANNs. [Riklin-Raviv and Shashua, 1999]

shows the success in our task, prior the age of neural networks. A more recent paper is of

[Kulkarni et al., 2015]. This work, much like [Riklin-Raviv and Shashua, 1999], is focused on

defining invariant representation, as a middle step, for the problem of face image synthesis under

new conditions, except they will do it using ANNs. This study aims to learn an interpretable

representation of images, disentangled with respect to three-dimensional scene structure and

viewing. A graphics code for complex transformations such as out-of-plane rotations, lighting

variations, pose, and shape.

Inspired by the various work done in the field of representation learning

[Bengio et al., 2013][Cohen and Welling, 2014] [Goodfellow et al., 2009], this work pro-

19

Original images under 3 distinct lighting conditions

The synthesized images using N = 10 bootstrap set.

Figure 9: Results of the Quotient Image algorithm.

Figure 10: From [Kulkarni et al., 2015]: Structure of the representation vector. φis the azimuth of the face, α is the elevation of the face with respect to the camera,and φL is the azimuth of the light source.

poses a representation that upholds the following principles: invariance, interpretability,

abstraction, and disentanglement. A disentangled representation is one for which changes in

the encoded data are sparse over real-world transformations.

The referred graphics code, is strictly defined for every property (Fig. 10), which simulates

parametric representation for the represented items. A manually designed feature vector. In

our work we strictly avoid that direction in particular, as any middle invariant representation.

They proposed the Deep Convolution Inverse Graphics Network (DCIGN) model, which as

shown in Fig. 11, consists of two parts: an encoder network which captures a distribution over

graphics codes Z given data x and a decoder network which learns a conditional distribution

to produce an approximation x given Z.

One more major difference between this work and ours (as from

[Riklin-Raviv and Shashua, 1999]), is the data-set nature. Here they use synthetic data-

set, and we are set to address real world images. The data-set is of faces generated from a 3D

face model obtained from [Paysan et al., 2009], consisting of faces with random variations on

face identity variables (shape/texture), pose, or lighting.

A few results examples are brought in Fig. 12. Despite the impressive comparison with

20

Figure 11: Algorithm of [Kulkarni et al., 2015]: Deep Convolutional Inverse Graph-ics Network (DC-IGN) has an encoder and a decoder.

“normally-trained network”, the results are not perfect, and although a very convincing trans-

formation of pose, it is not entirely clear the the identity is preserved in the synthesized images.

4.2 Task Definition: New Illumination Synthesis

If so, we wish to synthesize a face image of a given individual in new illumination conditions.

We would ”give” the identity by providing an image of that individual, in different illumination

conditions of course. As in previous chapter, let our set of individuals be P , and the set of

possible illuminations be K. We denote the image of face p ∈ P under illumination k ∈ K

with imk(p). We choose and set a target illumination k∗ ∈ K, and our task is therefore the

following:

Input: A face image of individual p ∈ P in some source illumination k ∈K, imk(p), where k 6= k∗. This is the query image.

Output: A face image of the same individual p in the the target illumina-

tion k∗, imk∗(p). The target image will be (the real) imk∗(p).

4.3 Data Preparation: Training and Testing Samples

To focus on the task of synthesis, we want to eliminate any unnecessary challenges. Namely,

difference between training and testing sets are minimal. Therefore, unlike in previous experi-

ments, we allow the individuals in the test sample to appear in the training sample set, but with

different images so the samples do not overlap. Also, since we preset the target illumination, it

has to appear both in test and training samples. For simplicity, and to ensure non of the train-

ing images resemble too much to the testing ones, we use the 15 representative illuminations

only. Under this framework, we specifically define our samples as follows:

21

Figure 12: From [Kulkarni et al., 2015]: Entangled versus disentangled representa-tions. First column: Original images. Second column: transformed image usingDC-IGN. Third column: transformed image using normally-trained network.

Figure 13: The unseen-sample images.

We set aside a set of images that we will attempt to create in test time. Thus they are all

in target illumination k∗. These images will never be seen during training. Therefore we call it

the unseen-sample. To still allow training with target illumination we limit the unseen-sample

to contain images of only 3 individuals, let us denote them p1, p2, p3. Than, the unseen sample

is as in Eq. 4.

{imk∗(p) : p ∈ {p1, p2, p3}

}(4)

Fig. 13 shows the unseen sample images. Over all, the unseen-sample contains exactly 3

images (3 faces in 1 illumination). All other images can be seen during training.

We define the query-test-sample to be the set of all query image - target image pairs, whose

target images are in the unseen-sample. Thus there are only 3 individuals in this set, as defined

in Eq. 5. Specifically we have 14 query images for every target image in the unseen-sample,

that is one query for every valid source illumination. Note that we allow the query images to

be seen during training.

{(imk(p), imk∗(p)) : k 6= k∗, p ∈ {p1, p2, p3}

}(5)

22

On the opposite side, we define a query-training-sample to contain all the query image -

target image pairs whose target images are not in the unseen-sample, as in EQ. 6. This means

that the query samples are samples of disjoint groups of individuals.

{(imk(p), imk∗(p)) : k 6= k∗, p ∈ P \ {p1, p2, p3}

}(6)

Some of the algorithms listed below, used a separately trained stage for feature vector

construction. For this stage we used all images that are not in the unseen-sample, including

query images of the query-test-sample, as training sample.

4.4 Algorithms and Architectures

Designing our algorithms, we couldn’t ignore the fact that many other studies (such as

[Riklin-Raviv and Shashua, 1999] and [Kulkarni et al., 2015] mentioned above), struggled to

find a feature space better suited to the task than image space. But, since we wish to test our

results in image space, by looking at them, our algorithms needed to convert its results back,

from the space it worked in, to images. In other words, it needs to cover three main logic stages:

1. Constructing the feature space imk(p)→ f(imk(p)).

2. Performing the illumination conversion f(imk(p))→ f(imk∗(p)).

3. Converting the result back to image space f(imk∗(p))→ imk∗(p).

However, it is not clear that we should force these stages in our algorithm design, as these

stages might by implicitly covered within the ANNs. That said, taking the road of an end-to-end

single ANN design hardens the task we were addressing, by demanding synthesis of yet unseen

individuals. Since we are effectively making the training and testing samples the training and

testing query samples accordingly, taking the end-to-end road, means training and testing for

distinct individuals groups.

We therefor define, and compare, three different ANN based algorithms designs (Fig. 14):

4.4.1 Algo1: An End-to-End Approach

This algorithm contains a single neural network (ANN1, Fig. 15), which performs the entire

task. In its structure there are no constraints on feature space or the logic structure above

at all. The network receives an image, and outputs an image, while encapsulating the feature

space in which the transformation takes place. Therefore we say the illumination conversion is

done in image space.

We use the query-training-sample for ANN1’s training. Which, as mentioned earlier, means

it is not trained over any of the unseen-sample’s individuals. ANN1 was trained using Huber

loss function (Eq. 7),

Lδ(y, f(x)) =

12 ·mean(x,y)∈batch(||y − f(x)||22)

δ · (max(x,y)∈batch||y − f(x)||1 − 12δ

2)(7)

23

Figure 14: Image synthesis algorithms logic design.

Figure 15: ANN1 architecture.

where x is query image, y is target image, f(x) is synthesized image, and δ = 0.001. It simulates

Mean Squared Error (MSE), that is commonly used to evaluate image regression. The general

form of Huber loss function is as proposed by [Shalev-Shwartz, 2014] (Eq. 8).

Lδ(y, f(x)) =

12(y − f(x))2

δ|y − f(x)| − 12δ

2(8)

4.4.2 Algo2: Introducing More Information in Training

To allow training for all individuals, and to examine illumination conversion in a different

feature space, we add a classification network to the process. Algo2 is composed of two parts,

two ANNs: a feature construction stage, and a combined illumination conversion and image

synthesis stage.

24

Figure 16: ANN2.1 architecture.


The feature construction stage (ANN2.1, Fig. 16) is a classification network for both ID

and illumination. This choice of task for the stage, allows us not only to train over all training

images (not just the ones on the query-training-sample), but also to use the information of ID

and illumination labels.

ANN2.1 was trained with multi-class logistic loss function (Eq. 2).

Although the second stage (ANN2.2, Fig. 17) is of a similar nature as ANN1 in Algo1,

having to perform illumination conversion and output an image, its task is more complicated

in a way. Being forced to yet another input feature space, compiles the network to learn the

transformation between these spaces in addition to the illumination conversion.

ANN2.2 is trained over the query-training-sample, when the query image of every query-

target pair in that sample, is given as its feature vector (denotedANN2.1(imk(p)) = f(imk(p))),

as in Eq. 9. {(f(imk(p)), imk∗(p)

): k 6= k∗, p ∈ P \ {p1, p2, p3}

}(9)

The loss function for ANN2.2 is the Huber loss (Eq. 8), with parameter δ = 0.001.

4.4.3 Algo3: Forcing the Logic

To farther go and simplify each network’s assignment, we propose Algo3, where we force all

logic stages, by training a different network for each stage. The advantage of this approach is

25



that it simplify the task of the networks. One for feature construction, one for illumination

conversion, and one for the inverse space transformation (from feature vectors back to images).

Also it allows to integrate information of ID and illumination, and images of the individuals of

the unseen-sample, not only in the first stage like in Algo2, but in the last one as well.

The feature constructing network (ANN3.1) is the same one as inAlgo2 (ANN2.1=ANN3.1).

The second stage network (ANN3.2, Fig. 18), having the mission of illumination conversion

(but not the synthesis), is trained with feature vectors of the query-training-sample (using

ANN2.1), both for query images and target (Eq. 10).

{(f(imk(p)), f(imk∗(p))

): k 6= k∗, p ∈ P \ {p1, p2, p3}

}(10)

ANN3.2 was trained with Huber loss function (Eq. 8), with parameter δ = 0.001.

The third stage, being performed by ANN3.3 (Fig. 19), is basically a transformation back

to image space. Its training sample is constructed with the help of ANN2.1, and contains all

images that are not in the unseen sample. When input instances are feature vectors of images

f(imk(p)), and the labels are the same image itself imk(p) (Eq. 11).

{(f(imk(p)), imk(p)

): k ∈ K, p ∈ P, k = k∗ → p /∈ p1, p2, p3

}(11)

ANN3.3 was trained with Huber loss function (Eq. 8), with parameter δ = 0.0001

26

4.5 Computational Validation

Given a synthesized image imk∗(p), how do you decide if it is ”good” enough, or ”close”

enough to your target image? Since we aim for images that demonstrate specific individual

and illumination conditions, using Mean Squared Error (MSE) from the target image, like is

commonly done for image evaluation, is not sufficient here. Since as in real life, a face in a

blurry image can still be recognizable. We propose acomputational validation tool, a face and

illumination image classification ANN (Fig. 20). We train the validation tool over the entire

YaleB data-set, because much like in classic face recognition tasks it is being used to evaluate

new images, in this case the synthesized images, not new individuals. We say an outputted

image, or feature vector, is correct if both its illumination and ID labeling by the validation

network, are correct. A synthesis network accuracy is determined by this notion.

We also construct one such validation tool for feature vectors (Fig. 21), in order to evaluate

Algo3’s performances before returning to image space.

Both networks were trained with multi-class logistic loss function (Eq. 2).

Figure 20: Image validation network architecture.

Figure 21: Feature vector validation network architecture.

27

4.6 Results and Discussion

Overall, and with so little data, all three algorithms were able to produce recognizable results

to some degree. We can see a clear structure of a face, and illumination seems correct. We

detail more with reference to quantitative results and enclosed figures:

Training

accuracy

Illumination

test accuracy

ID test

accuracy

Test

accuracy

Algo1 99.5% 100% 35.7% 35.7%

Algo2 100% 100% 19.1% 19.1%

Algo3 100% 100% 23.8% 23.8%

ANN3.2* 100% 100% 23.8% 23.8%

Table 4: Image synthesis algorithms accuracy values, by our validation tools. Sinceillumination accuracy is 100%, the over all test accuracy and ID accuracy are equal.*Evaluation of Algo3 in feature vectors space

Looking at accuracy results (Table 4), it is clear that we have a case of overfitting to the

training data. This is probably an outcome of the very big gap between the number of algorithm

parameters and the number of training samples, in every one of the algorithms. While there is

no known formula for the number of instances required to train a neural network, it is somewhat

agreed that it should be around one order of magnitude less than the number of parameters.

Our settings doesn’t come close to that relation. We bring the example of Algo1: in Eq. 12

we calculate the number of parameters in ANN1, and in Eq. 13 the number of samples in the

training set.

20×(5×5+1)+20×(5×5×20+1)+2048×(24×24×20+1)+1024×(2048+1) = 25, 703, 724 (12)

(31− 3)× 14 = 392 (13)

Another immediate observation (Table 4), is of illumination accuracy: Recall a synthesized

image is considered correct if both its illumination and ID labels are correct. But, since all al-

gorithms achieved illumination accuracy of 100%, a synthesized image is ”correct” if and only if

its validation network’s ID label is correct. Thus, total accuracy values are equal to ID accuracy

values. Looking at the examples in Table 5, we see that in fact even the most blurry results are

demonstrating realistic shadow, and are not just darker on the right side of the picture. That

said, all human face have similar characteristics when it comes to casting a shadow (eye socket

is shadowed, the eyebrow is illuminated). Namely, the task of demonstrating illumination is an

easier task to learn, than that of maintaining the correct face through illumination changes.

As for ID accuracy, in spite of the seemingly low testing accuracy values (Table 4), the

task is a success. We recall that the validation networks were trained over all individuals (31),

and therefore a random image pool would have achieved an accuracy rate of 131 ≈ 3.2%. In

28

Target image Correct results Incorrect results

ID: 1 by Algo3 by Algo1 by Algo2

ID: 2 by Algo1 by Algo1 by Algo3

ID: 3 by Algo1 by Algo1 by Algo1 by Algo2

Table 5: Examples of correct and incorrect results for the unseen-sample. A syn-thesized image is considered ”correct” if both its ID and illumination labels, by thevalidation network, are correct.All resulting images have correct illumination labeling. The incorrect results haveincorrect ID labeling.

other words, our algorithms achieved 6 to 11 times higher accuracy than chance. Moreover, the

degradation in the resulting images, correct or wrong, is still realistic. Unlike the examples in

[Kulkarni et al., 2015] (Fig. 12).

Looking at the resulting images, it seems the accuracy values correlate the synthesis image

quality. Algo1, who achieved the highest test accuracy of 35.7%, is also responsible for the best

results, in terms of image quality and ability to recognize the individual represented by eye,

of two out of three (IDs 2 and 3) individuals of the unseen sample (Fig. 23). Following in the

second place is Algo2, with the best result for the third (ID 1) individual, for which nor Algo1

and Algo2 had not achieved any correct result. Moreover, although Algo3 has more correct

result for individual #3, Algo1’s results look a lot more similar to the target image (Fig. 24).

Another conclusion from Table 4 regards Algo3. The accuracy results of Algo3 before

and after the transformation of predicted feature vector back to images (forth and third rows

accordingly) are the same. This tells us that the space conversion (ANN3.3) did not defect the

total task.

Recall Algo1 was trained on the least amount of data, and not on any image of the unseen

sample’s individuals, and thus couldn’t create any ”memory” of their faces. Having a more

difficult task than the others that way, the fact that Algo1 achieved the best testing accuracy

and best quality synthesized images, is surprising.

29

Figure 22: All correct test results. All synthesized image shown here were correctlyclassified by the image validation network. Location of every such image demon-strates query image it was synthesized for (top of the same column), the real targetimage for that query (second from the top of the same column), and the algorithmthat synthesized it (row). Query images that appear on the top row, are only theones that caused at least one of the algorithms to synthesize a correctly labeledimage.

But does that mean image space is a better space than the one achieved by the classification

network (ANN2.1) for our task? The answer is complicated. On the one hand, as stated earlier,

the only algorithm that uses unprocessed images (Algo1), performed better in general both on

the front of accuracy and that of quality. But on the other hand, its success is centered mostly

on one individual (#2). On both IDs 1 and 3, Algo3 achieved more correct results (Fig. 22). It is

possible that individual #2 has distinctive facial features in image space, which if preserved will

get a correct labeling. While in feature vector space, being defined with equal representation

for every individual in the data set, they are not more unique than anybody else’s.

What does remain clear is that whatever space, it is better to remain within it. The

transition in Algo2 between feature space and image space in addition to the illumination

conversion, in ANN2.2

30

Figure 23: Best looking synthesized images.

target image

Correct results of Algo1

Correct results of Algo3

Figure 24: Correct results for ID 3, Algo1 vs Algo3.

31

4.6.1 Conclusions

While showing the potential abilities of ANNs to synthesize realistic face images in new illumi-

nation conditions, the experiment’s framework requires some improvements. One of which is

the evaluation method.

The way we chose to evaluate our synthesized images, is far from perfect. Its shortcomings

raise a number of questions and idea for the subject of synthesized image evaluation in particular

and non-trivial evaluation in general:

First of all, it seems that if we wish to evaluate an algorithm in a certain way, we should

train it to perform under that evaluation method. In our experiment, we evaluated the synthesis

performance with Huber loss function (Eq. 7) in training time, and with the validation networks

in test time. But if the Huber loss, or MSE, is not the only way we want to evaluate our results

eventually, why train that way?

If so, how could we use the validation network for training? One possibility is to fix the

validation network’s weights and concatenate it with the synthesis network in training. And to

avoid synthesis of key distinctive feature alone, use a combination of both Huber loss and some

loss function for the classification labels of the validation network (Fig. 25). Its derivatives can

be calculated in a back propagation form, with the exception that it is with respect to input

rather than weights.

Figure 25: Combining the validation network into training process.

Still, this leaves out the question of eye-evaluated image quality. Having enough data for

it, we might have been able to train an ANN to evaluate quality, and operate in the same way.

But obtaining the right data is complicated. As adding synthetic noise or blurring or any other

kind will only teach the learning method to tell that specific noise apart, and not necessarily

the one created by the synthesis ANN.

Phrasing our task differently emphasize that this field is already being explored: we would

like to create a real-looking image, given description of the desired output. Namely, we need

an evaluation tool that punishes for unrealistic appearance. And this is, of course, the mission

of generative adversarial networks (GANs). For example, in [Reed et al., 2016] case, realistic

images are created from caption only.

32

List of Figures

1 Extended Yale Face Database B example . . . . . . . . . . . . . . . . . . . . . . 7

2 Illumination clustering and representatives illustration . . . . . . . . . . . . . . . 8

3 Representative illuminations images . . . . . . . . . . . . . . . . . . . . . . . . . 9

4 Outline of the DeepFace [Riklin-Raviv and Shashua, 1999] face classifier archi-

tecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10

5 [Taigman et al., 2014] ROC curve on LFW data-set. . . . . . . . . . . . . . . . . 11

6 Market-1501 data-set . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11

7 Feature vector construction: CNN architecture . . . . . . . . . . . . . . . . . . . 14

8 Pair recognition: ANN architecture . . . . . . . . . . . . . . . . . . . . . . . . . . 15

9 Results of the Quotient Image algorithm [Riklin-Raviv and Shashua, 1999] . . . . 20

10 Structure of the representation vector of [Kulkarni et al., 2015] . . . . . . . . . . 20

11 Algorithm of [Kulkarni et al., 2015] . . . . . . . . . . . . . . . . . . . . . . . . . . 21

12 Entangled versus disentangled representations [Kulkarni et al., 2015] . . . . . . . 22

13 The unseen-sample images . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22

14 Image synthesis algorithms logic design . . . . . . . . . . . . . . . . . . . . . . . 24

15 ANN1 architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24

16 ANN2.1 architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25




20 Image validation network architecture . . . . . . . . . . . . . . . . . . . . . . . . 27

21 Feature vector validation network architecture . . . . . . . . . . . . . . . . . . . . 27

22 All correct test results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30

23 Best looking synthesized images . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31

24 Correct results for ID 3, Algo1 vs Algo3 . . . . . . . . . . . . . . . . . . . . . . . 31

25 Combining the validation network into training process . . . . . . . . . . . . . . 32

33

List of Tables

1 Face verification: Separability of feature spaces . . . . . . . . . . . . . . . . . . . 16

2 Face verification: Experiments results by types of feature space and data aug-

mentation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16

3 Face verification: Comparing the results with RBF SVM . . . . . . . . . . . . . . 17

4 Image synthesis: Algorithms accuracy . . . . . . . . . . . . . . . . . . . . . . . . 28

5 Image synthesis: Examples of correct and incorrect results for the unseen-sample 29

34

References

[Bengio et al., 2013] Bengio, Y., Courville, A., and Vincent, P. (2013). Representation learn-

ing: A review and new perspectives. IEEE transactions on pattern analysis and machine

intelligence, 35(8):1798–1828.

[Berg and Belhumeur, 2012] Berg, T. and Belhumeur, P. N. (2012). Tom-vs-pete classifiers and

identity-preserving alignment for face verification. In BMVC, volume 2, page 7.

[Bhele and Mankar, 2012] Bhele, S. G. and Mankar, V. (2012). A review paper on face recog-

nition techniques. International Journal of Advanced Research in Computer Engineering &

Technology (IJARCET), 1(8):pp–339.

[Chang and Lin, 2011] Chang, C.-C. and Lin, C.-J. (2011). LIBSVM: A library for support

vector machines. ACM Transactions on Intelligent Systems and Technology, 2:27:1–27:27.

Software available at http://www.csie.ntu.edu.tw/~cjlin/libsvm.

[Cohen and Welling, 2014] Cohen, T. and Welling, M. (2014). Learning the irreducible rep-

resentations of commutative lie groups. In International Conference on Machine Learning,

pages 1755–1763.

[Georghiades et al., 2001] Georghiades, A. S., Belhumeur, P. N., and Kriegman, D. J. (2001).

From few to many: Illumination cone models for face recognition under variable lighting and

pose. IEEE transactions on pattern analysis and machine intelligence, 23(6):643–660.

[Goodfellow et al., 2009] Goodfellow, I., Lee, H., Le, Q. V., Saxe, A., and Ng, A. Y. (2009).

Measuring invariances in deep networks. In Advances in neural information processing sys-

tems, pages 646–654.

[Kanade, 1977] Kanade, T. (1977). Computer recognition of human faces, volume 47.

Birkhauser Basel.

[Kelly, 1970] Kelly, M. D. (1970). Visual identification of people by computer. Technical report,

stanford univ calif dept of computer science.

[Kulkarni et al., 2015] Kulkarni, T. D., Whitney, W. F., Kohli, P., and Tenenbaum, J. (2015).

Deep convolutional inverse graphics network. In Advances in Neural Information Processing

Systems, pages 2539–2547.

[LeCun et al., 1995] LeCun, Y., Bengio, Y., et al. (1995). Convolutional networks for images,

speech, and time series. The handbook of brain theory and neural networks, 3361(10):1995.

[Liu et al., 2016] Liu, H., Feng, J., Qi, M., Jiang, J., and Yan, S. (2016). End-to-End Compar-

ative Attention Networks for. 14(8):1–10.

35

http://www.csie.ntu.edu.tw/~cjlin/libsvm

[Paysan et al., 2009] Paysan, P., Knothe, R., Amberg, B., Romdhani, S., and Vetter, T. (2009).

A 3d face model for pose and illumination invariant face recognition. In Advanced Video and

Signal Based Surveillance, 2009. AVSS’09. Sixth IEEE International Conference on, pages

296–301. Ieee.

[Reed et al., 2016] Reed, S., Akata, Z., Yan, X., Logeswaran, L., Schiele, B., and Lee, H. (2016).

Generative adversarial text to image synthesis. arXiv preprint arXiv:1605.05396.

[Riklin-Raviv and Shashua, 1999] Riklin-Raviv, T. and Shashua, A. (1999). The quotient im-

age: Class based recognition and synthesis under varying illumination conditions. In Com-

puter Vision and Pattern Recognition, 1999. IEEE Computer Society Conference on., vol-

ume 2, pages 566–571. IEEE.

[Schroff et al., 2015] Schroff, F., Kalenichenko, D., and Philbin, J. (2015). Facenet: A unified

embedding for face recognition and clustering. In Proceedings of the IEEE Conference on

Computer Vision and Pattern Recognition, pages 815–823.

[Shalev-Shwartz, 2014] Shalev-Shwartz, S. (2014). A mini tutorial on convolutional-neural-

networks. Technical report.

[Shashua, 1992] Shashua, A. (1992). Illumination and view position in 3d visual recognition.

In Advances in neural information processing systems, pages 404–411.

[Shashua, 1997] Shashua, A. (1997). On photometric issues in 3d visual recognition from a

single 2d image. International Journal of Computer Vision, 21(1):99–122.

[Sun et al., 2015] Sun, Y., Liang, D., Wang, X., and Tang, X. (2015). Deepid3: Face recognition

with very deep neural networks. arXiv preprint arXiv:1502.00873.

[Taigman et al., 2014] Taigman, Y., Yang, M., Ranzato, M., and Wolf, L. (2014). Deepface:

Closing the gap to human-level performance in face verification. In Proceedings of the IEEE

conference on computer vision and pattern recognition, pages 1701–1708.

[Varior et al., 2016] Varior, R. R., Haloi, M., and Wang, G. (2016). Gated Siamese Convolu-

tional Neural Network Architecture for Human Re-Identification. pages 1–18.

[Vetter et al., 1997] Vetter, T., Jones, M. J., and Poggio, T. (1997). A bootstrapping algorithm

for learning linear models of object classes. In Computer Vision and Pattern Recognition,

1997. Proceedings., 1997 IEEE Computer Society Conference on, pages 40–46. IEEE.

[Vetter and Poggio, 1996] Vetter, T. and Poggio, T. (1996). Image synthesis from a single

example image. Computer Vision—ECCV’96, pages 652–659.

[Wu et al., 2016] Wu, L., Shen, C., and Hengel, A. V. D. (2016). PersonNet : Person Re-

identification with Deep Convolutional Neural Networks. pages 1–7.

36

[Zheng et al., 2015] Zheng, L., Shen, L., Tian, L., Wang, S., Wang, J., and Tian, Q. (2015).

Scalable person re-identification: A benchmark. In Computer Vision, IEEE International

Conference on.

37

Face Veri cation and Face Image Synthesis under ...daphna/theses/Tamar_Elazari...nition and face...

Documents

Transcript of Face Veri cation and Face Image Synthesis under ...daphna/theses/Tamar_Elazari...nition and face...