Deborah Thomas Dissertation

FACE RECOGNITION FROM SURVEILLANCE-QUALITY VIDEO

A Dissertation

Submitted to the Graduate School

of the University of Notre Dame

in Partial Fulfillment of the Requirements

for the Degree of

Doctor of Philosophy

by

Deborah Thomas,

Kevin W. Bowyer, Co-Director

Patrick J. Flynn, Co-Director

Graduate Program in Computer Science and Engineering

Notre Dame, Indiana

July 2010

c Copyright byDeborah Thomas

2010

All Rights Reserved

FACE RECOGNITION FROM SURVEILLANCE-QUALITY VIDEO

Abstract

by

Deborah Thomas

In this dissertation, we develop techniques for face recognition from surveillance-

quality video. We handle two specific problems that are characteristic of such

video, namely uncontrolled face pose changes and poor illumination. We conduct

a study that compares face recognition performance using two different types of

probe data and acquiring data in two different conditions. We describe approaches

to evaluate the face detections found in the video sequence to reduce the probe

images to those that contain true detections. We also augment the gallery set us-

ing synthetic poses generated using 3D morphable models. We show that we can

exploit temporal continuity of video data to improve the reliability of the matching

scores across probe frames. Reflected images are used to handle variable illumi-

nation conditions to improve recognition over the original images. While there

remains room for improvement in the area of face recognition from poor-quality

video, we have shown some techniques that help performance significantly.

CONTENTS

FIGURES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . v

TABLES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . viii

CHAPTER 1: INTRODUCTION . . . . . . . . . . . . . . . . . . . . . . . 11.1 Description of surveillance-quality video . . . . . . . . . . . . . . 11.2 Overview of our work . . . . . . . . . . . . . . . . . . . . . . . . . 31.3 Organization of the dissertation . . . . . . . . . . . . . . . . . . . 5

CHAPTER 2: PREVIOUS WORK . . . . . . . . . . . . . . . . . . . . . . 62.1 Current evaluations . . . . . . . . . . . . . . . . . . . . . . . . . . 62.2 Pose handling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82.3 Illumination handling . . . . . . . . . . . . . . . . . . . . . . . . . 132.4 Other issues . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 202.5 How this dissertation relates to prior work . . . . . . . . . . . . . 28

CHAPTER 3: EXPERIMENTAL SETUP . . . . . . . . . . . . . . . . . . 293.1 Sensors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29

3.1.1 Nikon D80 . . . . . . . . . . . . . . . . . . . . . . . . . . . 303.1.2 Surveillance camera installed by NDSP . . . . . . . . . . . 303.1.3 Sony IPELA camera . . . . . . . . . . . . . . . . . . . . . 313.1.4 Sony HDR Camcorder . . . . . . . . . . . . . . . . . . . . 31

3.2 Dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 333.2.1 NDSP dataset . . . . . . . . . . . . . . . . . . . . . . . . . 333.2.2 IPELA dataset . . . . . . . . . . . . . . . . . . . . . . . . 353.2.3 Comparison dataset . . . . . . . . . . . . . . . . . . . . . . 38

3.3 Software . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 433.3.1 FaceGen Modeller 3.2 . . . . . . . . . . . . . . . . . . . . . 433.3.2 IdentityEXPLORER . . . . . . . . . . . . . . . . . . . . . 433.3.3 Neurotechnoligija . . . . . . . . . . . . . . . . . . . . . . . 453.3.4 PittPatt . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45

ii

3.3.5 CSUs preprocessing and PCA software . . . . . . . . . . . 463.4 Performance metrics . . . . . . . . . . . . . . . . . . . . . . . . . 47

3.4.1 Rank one recognition rate . . . . . . . . . . . . . . . . . . 473.4.2 Equal error rate . . . . . . . . . . . . . . . . . . . . . . . . 47

3.5 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49

CHAPTER 4: A STUDY: COMPARING RECOGNITION PERFORMANCEWHEN USING POOR QUALITY DATA . . . . . . . . . . . . . . . . 504.1 NDSP dataset: Baseline performance . . . . . . . . . . . . . . . . 51

4.1.1 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . 514.1.2 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51

4.2 Comparison dataset . . . . . . . . . . . . . . . . . . . . . . . . . . 534.2.1 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . 534.2.2 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54

4.3 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65

CHAPTER 5: HANDLING POSE VARIATION IN SURVEILLANCE DATA 665.1 Pose handling: Enhanced gallery for multiple poses . . . . . . . . 665.2 Score-level fusion for improved recognition . . . . . . . . . . . . . 69

5.2.1 Description of fusion techniques . . . . . . . . . . . . . . . 695.3 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 735.4 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74

5.4.1 NDSP dataset . . . . . . . . . . . . . . . . . . . . . . . . . 745.4.2 IPELA dataset . . . . . . . . . . . . . . . . . . . . . . . . 80

5.5 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83

CHAPTER 6: HANDLING VARIABLE ILLUMINATION IN SURVEIL-LANCE DATA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 856.1 Acquisition setup . . . . . . . . . . . . . . . . . . . . . . . . . . . 866.2 Reflecting images to handle uneven illumination . . . . . . . . . . 86

6.2.1 Averaging images . . . . . . . . . . . . . . . . . . . . . . . 916.3 Comparison approaches . . . . . . . . . . . . . . . . . . . . . . . . 946.4 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96

6.4.1 Test dataset . . . . . . . . . . . . . . . . . . . . . . . . . . 966.4.2 Face detection . . . . . . . . . . . . . . . . . . . . . . . . . 986.4.3 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . 98

6.5 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 996.6 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101

CHAPTER 7: OTHER EXPERIMENTS . . . . . . . . . . . . . . . . . . . 1037.1 Face detection evaluation . . . . . . . . . . . . . . . . . . . . . . . 103

7.1.1 Background subtraction . . . . . . . . . . . . . . . . . . . 105

iii

7.1.2 Approach to pick good frames: Gestalt clusters . . . . . . 1087.1.3 Results: Comparing performance on entire dataset and datasets

pruned using background subtraction and gestalt clustering 1117.2 Distance metrics and number of eigenvectors dropped . . . . . . . 116

7.2.1 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . 1177.2.2 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 117

CHAPTER 8: CONCLUSIONS . . . . . . . . . . . . . . . . . . . . . . . . 119

APPENDIX A: GLOSSARY . . . . . . . . . . . . . . . . . . . . . . . . . . 121

APPENDIX B: POSE RESULTS . . . . . . . . . . . . . . . . . . . . . . . 123

APPENDIX C: ILLUMINATION RESULTS . . . . . . . . . . . . . . . . . 130

BIBLIOGRAPHY . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 135

iv

FIGURES

1.1 Example showing the problem of variable illumination . . . . . . . 2

1.2 Example showing the variable pose in two frames of a video clip . 3

1.3 Example showing the low resolution of the face in the frame, whenthe subject is too far from the camera . . . . . . . . . . . . . . . . 4

1.4 Example showing the face to be out of view of the camera . . . . 4

3.1 Camera to capture gallery data: Nikon D80 . . . . . . . . . . . . 30

3.2 Surveillance camera: NDSP camera . . . . . . . . . . . . . . . . . 31

3.3 Surveillance camera: Sony IPELA camera . . . . . . . . . . . . . 32

3.4 High-definition camcorder: Sony HDR-HC7 . . . . . . . . . . . . 32

3.5 Gallery image acquisition setup . . . . . . . . . . . . . . . . . . . 34

3.6 Example frames from the NDSP camera . . . . . . . . . . . . . . 363.7 Example frames from the IPELA camera . . . . . . . . . . . . . . 373.8 Example frames from IPELA camcorder for the Comparison dataset 393.9 Example frames from the Sony HDR-HC7 camcorder . . . . . . . 403.10 FaceGen Modeller 3.2 Interface . . . . . . . . . . . . . . . . . . . 44

3.11 Example of CMC curve . . . . . . . . . . . . . . . . . . . . . . . . 48

3.12 Example of ROC curve . . . . . . . . . . . . . . . . . . . . . . . . 49

4.1 Baseline performance for the NDSP dataset . . . . . . . . . . . . 52

4.2 Detections on surveillance video data acquired indoors . . . . . . 57

4.3 Detections on surveillance video data acquired outdoors . . . . . . 58

4.4 Detections on high-definition video data acquired indoors . . . . . 59

4.5 Detections on high-definition video data acquired outdoors . . . . 60

4.6 Results: ROC curve comparing performance when using high-definitionand surveillance data (Indoor video) . . . . . . . . . . . . . . . . 61

4.7 Results: ROC curve comparing performance when using high-definitionand surveillance data (Outdoor video) . . . . . . . . . . . . . . . 62

v

4.8 Results: CMC curve comparing performance when using high -definition and surveillance data (Indoor video) . . . . . . . . . . . 63

4.9 Results: CMC curve comparing performance when using high -definition and surveillance data (Outdoor video) . . . . . . . . . . 64

5.1 Frames showing the variable pose seen in a video clip (the blackdots mark the detected eye locations) . . . . . . . . . . . . . . . . 67

5.2 Synthetic gallery poses . . . . . . . . . . . . . . . . . . . . . . . . 68

5.3 Change in rank matrix for a new incoming image . . . . . . . . . 71

5.4 Results: Comparing rank one recognition rates when adding posesof increasing degrees of off-angle poses . . . . . . . . . . . . . . . 75

5.5 Results: Comparing rank one recognition rates when using frontal,+/-6 degree and +/-24 degree poses . . . . . . . . . . . . . . . . . 76

5.6 Results: Comparing rank one recognition rate when using fusiontechniques to improve recognition . . . . . . . . . . . . . . . . . . 78

5.7 Examples of poorly performing images . . . . . . . . . . . . . . . 81

6.1 Setup to acquire probe data and resulting illumination variation onthe face . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87

6.2 Comparison of gallery and probe images . . . . . . . . . . . . . . 88

6.3 Reflection algorithm . . . . . . . . . . . . . . . . . . . . . . . . . 896.4 Example images: original image, reflected left and reflected right . 92

6.5 Average intensity of each column . . . . . . . . . . . . . . . . . . 93

6.6 Reflection algorithm . . . . . . . . . . . . . . . . . . . . . . . . . 946.7 Example images: original image and averaged image . . . . . . . . 95

6.8 Example images: original image and quotient image . . . . . . . . 97

7.1 Example eye detections . . . . . . . . . . . . . . . . . . . . . . . . 104

7.2 Structuring element used for erosion and dilation . . . . . . . . . 106

7.3 Example subject: Ground truth and Viisage locations . . . . . . . 109

7.4 Results: Rank one recognition rates when using the entire dataset 113

7.5 Results: Rank one recognition rates when using the dataset afterbackground subtraction . . . . . . . . . . . . . . . . . . . . . . . . 114

7.6 Results: Rank one recognition rates when using the dataset afterbackground subtraction and gestalt clustering . . . . . . . . . . . 115

B.1 CMC curves: Comparing fusion techniques approaches using a sin-gle frame . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 124

vi

B.2 ROC curves: Comparing fusion techniques using a single frame . . 125

B.3 CMC curves: Comparing approaches exploiting temporal continu-ity, using rank-based fusion . . . . . . . . . . . . . . . . . . . . . 126

B.4 ROC curves: Comparing fusion techniques exploiting temporal con-tinuity, using rank-based fusion . . . . . . . . . . . . . . . . . . . 127

B.5 CMC curves: Comparing fusion techniques exploiting temporalcontinuity, using score-based fusion . . . . . . . . . . . . . . . . . 128

B.6 ROC curves: Comparing fusion techniques approaches exploitingtemporal continuity, using score-based fusion . . . . . . . . . . . . 129

C.1 CMC curves: Comparing illumination approaches using a singleframe . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 131

C.2 ROC curves: Comparing illumination approaches using a single frame132

C.3 CMC curves: Comparing illumination approaches exploiting tem-poral continuity . . . . . . . . . . . . . . . . . . . . . . . . . . . . 133

C.4 ROC curves: Comparing illumination approaches exploiting tem-poral continuity . . . . . . . . . . . . . . . . . . . . . . . . . . . . 134

vii

TABLES

2.1 PREVIOUS WORK . . . . . . . . . . . . . . . . . . . . . . . . . 23

3.1 FEATURES OF CAMERAS USED . . . . . . . . . . . . . . . . . 33

3.2 SUMMARY OF DATASETS . . . . . . . . . . . . . . . . . . . . . 42

4.1 COMPARISON DATASET RESULTS: DETECTIONS IN VIDEOUSING PITTPATT . . . . . . . . . . . . . . . . . . . . . . . . . . 55

4.2 COMPARISON DATASET RESULTS: COMPARISON OF RECOG-NITION RESULTS ACROSS CAMERAS USING PITTPATT . . 56

5.1 RESULTS: COMPARISON OF RECOGNITION PERFORMANCEUSING FUSION . . . . . . . . . . . . . . . . . . . . . . . . . . . 79

5.2 RESULTS: COMPARISON OF RECOGNITION PERFORMANCEUSING FUSION ON THE IPELA DATASET . . . . . . . . . . . 82

6.1 RESULTS: COMPARING RESULTS FOR DIFFERENT ILLUMI-NATION TECHNIQUES . . . . . . . . . . . . . . . . . . . . . . 100

7.1 COMPARING DATASET SIZE OF ALL IMAGES TO BACK-GROUND SUBTRACTION APPROACH AND GESTALT CLUS-TERING APPROACH . . . . . . . . . . . . . . . . . . . . . . . . 112

7.2 RESULTS: PERFORMANCE WHEN VARYING DISTANCE MET-RICS AND NUMBER OF EIGENVECTORS DROPPED . . . . 118

viii

CHAPTER 1

INTRODUCTION

Face recognition from video is an important area of biometrics research today.

Most of the existing work focuses on recognition from video where the images are

of high-resolution, containing faces in a frontal pose and where the lighting condi-

tions are optimal. However, face recognition from video surveillance has become

an increasingly important goal as more and more video surveillance cameras are

installed in public places. For example, the Metropolitan Police Department has

installed 14 pan, tilt, zoom (PTZ) cameras around the Washington D. C. area

[12]. Also, there are 2,397 cameras installed in Manhattan [30]. Face recogni-

tion using such video is a very challenging problem because of the low resolution,

lighting conditions of such video and the presence of uncontrolled movement. In

this dissertation, we focus on recognition in the presence of uncontrolled pose and

lighting in probe data.

1.1 Description of surveillance-quality video

We describe surveillance-quality video based on four different features. The

four characteristics are (1) variable illumination, (2) variable pose of the subjects

in the video, (3) the low resolution of the faces in the video and (4) obstructions

of the faces in the video.

1

Figure 1.1. Example showing the problem of variable illumination

Firstly, such video is affected by variable illumination. Often times, surveil-

lance cameras are pointed toward doorways where the sun is streaming in, or the

camera may be in a poorly-lit location. This can change the intensity of the image,

even causing different parts of the image to be illuminated differently, which can

cause problems for the recognition system. In Figure 1.1, we show an example

frame of such video affected by variable illumination.

The second feature of surveillance video is the variable pose of the subject in

the video. The subject is often not looking at the camera and the camera may

be mounted to the ceiling. Therefore, the subject may not be a frontal pose in

the video. While a lot of work has been done using images where the subject is

looking directly at the camera, there is a need to explore recognition when the

subject is not looking at the camera. In Figure 1.2, we two show such examples.

Another surveillance video characteristic is the low resolution of the face. Usu-

ally, such video is of low resolution and covers a large scene. Furthermore, the

2

Figure 1.2. Example showing the variable pose in two frames of a videoclip

camera may be located far from the subject. Hence, the subjects face may be

small, causing the number of pixels on the subjects face to be low, making it

difficult for robust face recognition. In Figure 1.3, we show an image where the

subject is too far from the camera for reliable face recognition.

The last feature of surveillance-quality video is obstruction to the human face.

A perpetrator may be aware of the presence of a camera and try to cover their face

to prevent the camera from capturing their face. Hats, glasses and makeup can

also be used to change the appearance of the face to cause problems for recognition

systems. Sometimes, the positioning of the camera may cause the face to be out

of view of the camera frame as seen in Figure 1.4.

1.2 Overview of our work

In this dissertation, we focus on variable pose and illumination. One theme

that we exploit throughout this dissertation is temporal continuity in the surveil-

lance video. One feature of video data that still images lack is the temporal

3

Figure 1.3. Example showing the low resolution of the face in the frame,when the subject is too far from the camera

Figure 1.4. Example showing the face to be out of view of the camera

4

continuity between the frames of the data. The identity of the subject will not

change in an instant, so the multiple frames available can be used for recogni-

tion. The matching scores between a pair of probe and gallery subjects can be

made more robust by using decisions about a previous frame for the current one.

First, we compare recognition performance when using surveillance video to per-

formance when using high-resolution video in our probe dataset. We also devise a

technique to evaluate the face detections to prune the dataset to true detections,

to improve recognition performance. We use a multi-gallery approach to make the

recognition system more robust to variable pose in the data. We generate these

poses using synthetic morphable models. We then create reflected images in order

to mitigate the effects of variable illumination.

By combining these techniques we show that we can handle some of the issues

of variable pose and illumination in surveillance data and improve recognition over

baseline performance.

1.3 Organization of the dissertation

The rest of the dissertation is organized as follows: Chapter 2 describes previ-

ous work done in the area. In Chapter 3, we describe the sensors and dataset and

the software used in our experiments. We study the effect of poor quality video on

recognition in Chapter 4. Chapters 5 through 7 describe the work we have done

in this dissertation. Finally, we end with our conclusions in Chapter 8.

5

CHAPTER 2

PREVIOUS WORK

In this chapter, we describe previous work that looks at face recognition from

unconstrained video. We first describe three studies that explore face recognition

from video. We look at two problems, namely uncontrolled pose and poor lighting

conditions. We then describe different approaches that have been used to handle

both of these problems.

2.1 Current evaluations

Three different studies that address face recognition from video are: the FRVT

2002 Evaluation, FRVT 2006 Evaluation and the Foto-Fahndung report. The

FRVT 2002 report describes the use of three-dimensional morphable models with

video and documents the benefits of using them for face recognition. FRVT 2006

reports on face recognition performance under controlled and uncontrolled light-

ing. The Foto-Fahndung report describes the face recognition performance of

three different pieces of software, when the data comes from video acquired by

a camera looking at an escalator in a German train station. We describe these

studies in further detail below.

In the FRVT 2002 Evaluation Report [32], face recognition experiments are

conducted in three new areas (three-dimensional morphable models, normalization

6

and face recognition from video). The first experiment compares face recognition

performance when using still images in the probe set to using 100 frames from a

video sequence, while the subject is talking with varied expression. The video is

similar to that of a mugshot with the added component of change in expression.

The gallery is a set of still images. Among all the participants in FRVT 2002,

except for DreamMIRH and VisionSphere, recognition performance is better when

using a still image rather than when using a video sequence. They observe that

if the subject were walking toward the camera, there would be a change in size

and orientation of the face that would be a further challenge to the system. In

this work, we focus on uncontrolled video, where data is captured using a surveil-

lance camera in uncontrolled lighting conditions, hence performance is expected

to be poor. They also conclude that 3D morphable models provide only slight

improvement over 2D images.

In 2006, the FRVT 2006 Evaluation Report [34], compared face recognition

when using 2D and 3D data. It also explores face recognition when using con-

trolled and uncontrolled lighting. When using 3D data, the algorithms were able

to meet the FRGC [33] goal of an improvement of an order of magnitude over

FRVT 2002. To test the effect of lighting, the gallery data was captured in a

controlled environment, whereas the probe data was captured in an uncontrolled

lighting environment (either indoors or outdoors). Cognitec, Neven Vision, SAIT

and Viisage outperformed the best FRGC results achieved, with SAIT having a

false reject rate between 0.103 and 0.130 at a false accept rate of 0.001. The per-

formance of FRVT participants when using uncontrolled probe data matches that

of the FRVT participants of 2002 when using controlled data. However, they also

show that illumination condition does have a huge effect on performance.

7

The Foto-Fahndung report [9] evaluates performance of three recognition sys-

tems when the data comes from a surveillance system in a German railway station.

They report recognition performance in four distinct conditions based on light-

ing and movement of the subjects and show that while face recognition systems

can be used in search scenarios, environmental conditions such as lighting and

quick movements influence performance greatly. They conclude that it is possible

to recognize people from video, provided the external conditions are right, espe-

cially lighting. They also state that high recognition performance can be achieved

indoors, where the light does not change much. However, drastic changes in

lighting conditions affect performance greatly. They state that High recognition

performance can be expected in indoor areas which have non-varying light con-

ditions. Varying light conditions (darkness, black light, direct sunlight) cause a

sharp decrease in recognition performance. A successful utilization of biometric

face recognition systems in outdoor areas does not seem to be very promising for

search purposes at the moment. [9] They suggest the use of 3D face recognition

technology as a way to improve performance.

2.2 Pose handling

Zhou and Chellappa [51] state that researchers have handled rotation problems

in three ways: (1) Using multiple images per person when they are available (2)

Using multiple training images but only one database image per subject when

running recognition and (3) Using a single image per subject where no training is

required.

Zhou et al. [54] apply a condensation approach to solve numerically the prob-

lem of face recognition from video. They point out that most surveillance video

8

is of poor quality and low image resolution and has large illumination and pose

variations. They believe that the posterior probability of the identity of a sub-

ject varies over time. They use a condensation algorithm that determines the

transformation of the kinematics in the sequence and the identity simultaneously,

incorporating two conditions into their model: (1) motion in a short time interval

depends on the previous interval, along with noise that is time-invariant and (2)

the identity of the subject in a sequence does not change over time. When they use

a gallery of 12 still images and 12 video sequences as probes, they achieve 100%

rank one recognition rate. However, the small size of the dataset may contribute

to the high accuracy.

In a later work, they extend this approach to apply to scenarios where the

illumination of probe videos is different from that of the gallery [21], which is also

made up of video clips. Each subject is represented as a set of exemplars from

a video sequence. They use a probabilistic approach to determine the set of the

images that minimizes the expected distance to a set of exemplar clusters and

assume that in a given clip, the identity of the subject does not change, Bayesian

probabilities are used over time to determine the identity of the faces in the frames.

A set of four clips of 24 subjects each walking on a treadmill is used for testing.

The background is plain and each clip is 300 frames long. They achieve 100%

rank one recognition rate on all four combinations of clips as probe and gallery.

Chellappa et al. build on this in [52]. They incorporate temporal information

in face recognition. They create a model that consists of a state equation, an

identity equation (containing information about the temporal change of the iden-

tity) and an observation equation. Using a set of four video clips with 25 subjects

walking on a treadmill, (from the MoBo [16] database), they train their model

9

on one or two clips per subject and use the remaining for testing. They are able

to achieve close to 100% rank one recognition rate overall. They expand on this

work [53] to incorporate both changes in pose within a video sequence and the

illumination change between a gallery and probe. They combine their likelihood

probability between frames over time which improves performance overall. In a

set of 30 subjects, where the gallery set consists of still images, they achieved 93%

rank one recognition rate.

Park and Jain [31] use a view synthesis strategy for face recognition from

surveillance video, where the poses are mainly non-frontal and the size of the faces

is small. They use frontal pose images for their gallery, whereas the probe data

contains variable pose. They propose a factorization method that develops 3D

face models from 2D images using Structure from Motion (SfM). They select a

video frame in which the pose of the face is the closest to a frontal pose, as a

texture model for the 3D face reconstruction. They then use a gradient descent

method to iteratively fit the 3D shape to the 72 feature points on the 2D image.

On a set of 197 subjects, they are able to demonstrate a 40% increase in rank one

recognition performance (from 30% to 70%).

Blanz and Vetter [10] describe a method to fit an image to a 3D morphable

model to handle pose changes for face recognition. Using a single image of a per-

son, they automatically estimate 3D shape, texture and illumination. They use

intrinsic characteristics of the face that are independent of the external conditions

to represent each face. In order to create the 3D morphable model, they use a

database of 3D laser scans that contains 200 subjects from a range of demograph-

ics. They build a dense point-to-point correspondence between the face model and

a new face using optical flow. Each face is fit to the 3D shape using seven facial

10

feature points (tip of nose, corners of eyes, etc.). They try to minimize the sum of

squared differences over all color channels from all pixels in the test image to all

pixels in the synthetic reconstruction. On a set of 68 subjects of the PIE database

[40], they achieve 95% rank one recognition rate when using the side view gallery.

Using the FERET set, with 194 subjects, they achieve 96% rank one recognition

when using the frontal images as gallery and the remaining images as probes.

Huang et al. [48] use 3D morphable models to handle pose and illumination

changes in face video. They create 3D face models based on three training images

per subject and then render 2D synthetic images to be used for face recognition.

They apply a component-based approach for face detection that uses 14 indepen-

dent component classifiers. The faces are rotated from 0 to 34 in increments of

2 using two different illuminations. At each instance, an image is saved. Out of

the 14 components detected, nine are used for face recognition. The recognition

system consists of second degree polynomial Support Vector Machine classifiers.

When they use 200 images of six different subjects, they get a true accept rate of

90% at a false accept rate of 10%.

Beymer [8] uses a template based approach to represent subjects in the gallery

when there are pose changes in the data. He first applies a pose estimator based

on the features of the face (eyes and mouth). Then, using the nose and the eyes,

the recognition system applies a transform to the input image to align the three

feature points with a training image. When using 930 images for training the

detector and 520 images for testing, the features are correctly detected 99.6% of

the time. For recognition, a feature-level set of systems is used for each eye, nose

and mouth. The probe images are compared only to those gallery images closest

to its pose. Then he uses a sum of correlations of the best matching eye, nose and

11

mouth templates to determine the best match. On the set of 62 subjects, when

using 10 images per subject with an inter-ocular distance of about 60 pixels in the

images, the rank one recognition rate is 98.39%. However, this is a relatively large

inter-ocular distance for good face recognition and not usually typical of faces in

surveillance quality video.

Arandjelovic and Cipolla [3] deal with face movement and observe that most

strategies use the temporal information of the frames to determine identity. They

propose a strategy that uses Resistor Average Distance (RAD), which is a measure

of dissimilarity between two disjoint probabilities. They claim that PCA does not

capture true modes of variation well and hence a Kernel PCA is used to map the

data to a high-dimensional space. Then, PCA can be applied to find the true

variations in the data. For recognition, the RAD between the distributions of

sets of gallery and probe points is used as a measure of distance. They test their

approach on two databases. One database contains 35 subjects and the other

contains 60 subjects. In both datasets, the illumination conditions are the same

for training and testing. They achieve around 98% rank one recognition rate on

the larger dataset.

Thomas et al. [43] use synthetic poses and score-level fusion to improve recog-

nition when there is variable pose in the data. They show that recognition can

be improved by exploiting temporal continuity. The gallery dataset consists of

one high-quality still image per subject. Using the approach in [10] to generate

synthetic poses, the gallery set is enhanced with multiple images per subject. A

dataset of 57 subjects is used, which contains subjects walking around a corner in

a hallway. When they use the original gallery images and treat each probe image

as a single frame with no temporal continuity, they achieve a rank one recog-

12

nition rate of 6%. However, by adding synthetic poses and exploiting temporal

continuity, they improved rank one recognition performance to 21%.

2.3 Illumination handling

Zhou et al. [55] separate the strategies to handle changes in illumination into

three categories. The first set of approaches is called subspace methods. These

approaches are most commonly used in recognition problems. Some common ex-

amples of this class of approaches are PCA [44] and LDA [50]. However, the

disadvantage of such techniques is that they are tuned to the illumination condi-

tions that they are trained on. When the gallery set consists of still images taken

indoors under controlled lighting conditions and the probe set is of surveillance

quality video acquired under uncontrolled lighting conditions, recognition perfor-

mance is poor. The second set of approaches is reflectance model methods. A

Lambertian reflectance model is used to model lighting. The disadvantage of this

approach is that it is not as effective an approach when the subjects in the testing

set are not encountered in the training set. The third set of approaches uses 3D

models for representation. These models are robust to illumination effects. How-

ever, they require a sensor that can capture such data or the data needs to be

built based on 2D images.

Adini et al. [2] describe four image representations that can be used to han-

dle illumination changes. They divide the approaches to handle illumination into

three categories: (1) Gray level information to extract a three-dimensional shape

of the object (2) A stored model that is relatively insensitive to changes in illumi-

nation and (3) A set of images of the same object under different illuminations.

The third approach may be not be realistic given the experiment and the setup.

13

Furthermore, one may not be able to fully capture all the possible variations in

the data. While it has been shown theoretically that a function invariant to il-

lumination does not exist, there are representations that are more robust than

others [2]. The four representations they consider are (1) the original gray-level

image, (2) the edge map of the image, (3) the image filtered with 2D Gabor like

filters and (4) the second-order derivative of the gray level image [2]. Some edges

of the image can be insensitive to illuminations whereas others are not. However,

an edge map is useful in that it is a compact representation of the original im-

age. Derivatives of the gray level image are useful because while ambient light

will affect the gray level image, under certain conditions it does not affect the

derivatives. In order to make the images more robust, they divide the face into

two sub parts by creating subregions of the eyes area and the lower part of the

face. They show that in highly variable lighting, the error rate is 100% on raw

gray level images, where there are changes in illumination direction. Performance

improves when using the filtered images. They also show that even though the

filtered images do not resemble the original face, they encode information to im-

prove recognition. However, they conclude that no one representation is sufficient

to overcome variations in illumination. While some are robust to changes along

the horizontal axis, others are more robust along the vertical axis. Hence, the

different approaches need to be combined to exploit the benefits of each of them.

Zhao and Chellappa [51] use 3D models to handle the problems of illumina-

tion in face recognition. They create synthesized images acquired under different

lighting and viewing conditions. They develop a 2D prototype image from a 2D

image acquired under variable lighting using a generic 3D model, rather than a

full 3D approach that uses accurate 3D information. For the generic 3D model,

14

a laser scanned range map is used. They use a Lambertian model to estimate the

albedo value, which they determine using a self-ratio image, which is the illu-

mination ratio of two differently aligned images. Using a 3D generic model, they

bypass the 2D to 3D step, since the pose is fixed in their dataset. When they test

their approach using the Yale database, on a set of 15 subjects with 4 images each

they obtain 100% rank one recognition rate, which was an improvement of about

25% improvement over using the original images (about 75% rank one recognition

rate).

Wei and Lai [47] describe a robust technique for face recognition under vary-

ing lighting conditions. They use a relative image gradient feature to represent

the image, which is the image gradient function of the original intensity image,

where each pixel is scaled by the maximum intensity of its neighbors. They use

a normalized correlation of the gradient maps of the probe and gallery images to

determine how well the images match. On the CMU-PIE face database [40], which

contains 22 images under varying illuminations of 68 individuals, they obtain an

equal error rate of 1.47% and show that their approach outperforms recognition

when using the original intensity images.

Price and Gee [36] also propose a PCA-based approach to address three issues

that could cause problems in face recognition, namely illumination, expression and

decoration (specifically, glasses and facial hair). They use an LDA-based approach

to handle changes in illumination and expression. They note that subregions of

the face are less sensitive to expression and decoration than the full face. So they

break the face into modular subregions: the full face, the region of the eyes and

the nose and then just the eyes. For each region, they independently determine

the distance from that region to each of the corresponding images in the database.

15

Hence, they have a parallel system of observations, one for each region mentioned

above. They then use a combination of results as their matching score to determine

the best match. They use a database of 106 subjects with varied illumination,

expression and decoration, where 400 still images are used for training and 276

for testing. When they combine the results from the three observers, using PCA

and LDA, they achieve a rank one recognition rate of 94.2%.

Hiremath and Prabhakar [18] use interval-type discriminating features to gen-

erate illuminant invariant images. They create symbolic faces for each subject

in each illumination type based on the maximum and minimum value found at

each pixel for a given dataset. While this is an appearance-based approach, it

does not suffer the same drawbacks as other approaches because it uses interval

type features. Therefore, it is insensitive to the particular illumination conditions

in which the data is captured within the range of illuminations in the training

data. They then use Factorial Discriminant Analysis to find a suitable subspace

with optimal separation between the face classes. They test their approach using

the CMU PIE [40] database and get a 0% error rate. This approach is advanta-

geous in that it does not require a probability distribution of the image gradient.

Furthermore, it does not use any complex modeling of reflection components or

assume a Lambertian model. However, it is limited by the range of illuminations

found in the training data. Therefore, it may not be applicable in cases where

there is a difference in the illuminations between the gallery and probe sets.

Belhumeur et al. [6] use LDA to produce well-separated classes for robustness

to lighting direction and facial expression, and compare their approach to using

eigenfaces (PCA) for recognition. They conclude that LDA performs the best

when there are variations in lighting or even simultaneous changes in lighting

16

and expression. They also state that In the [PCA] method, removing the first

three principal components results in better performance under variable lighting

conditions. [6] Their experiments use the Harvard database [17] to test variation

in lighting. The Harvard database contains 330 images from 5 subjects (66 images

each). The images are divided into five subsets based on the direction of the light

source (0, 30, 45, 60, 75 degrees). The Yale database consists of 16 subjects with

10 images each taken on the same day but with variation in expression, eyewear

and lighting. They use a nearest neighbor classifier for matching, though the

measure used to determine distance was not specified. The variation in expression

and lighting is tested using a leave-one-out error estimation strategy on all 16

subjects. They train the space on nine of the images and then tested it using the

image left out and achieve a 0.6% recognition error rate using LDA and a 19.4%

recognition error rate using PCA, with the first three dimensions dropped. They

do mention that the databases are small and more experimentation using larger

databases is needed.

Arandjelovic and Cipolla [5] handle variation in illumination and pose using

clustering and Gamma intensity correction. They create three clusters per sub-

ject corresponding to different poses and use locations of pupils and nostrils to

distinguish between the three clusters. Illumination is handled using Gamma in-

tensity correction. Here, the pixels in each image are transformed so as to match

a canonically illuminated image. Pose and illumination are combined by per-

forming PCA on variations of each persons images under different illuminations

from a given persons mean image and using simple Euclidean distance as their

distance measure. In order to match subjects to a novel image, they use the ratio

of the probability that three clusters belong to the same subject over the proba-

17

bility that they belong to a different subject. Their dataset consist of 20 subjects

for training and 40 others for testing, where each subject has 20-100 images in

random motion. They achieve 95% rank one recognition rate using this approach.

Arandjelovic and Cipolla [4] evaluate strategies to achieve illumination in-

variance when there are large and unpredictable illumination changes. In these

situations, the difference between two images of the same subject under different

illuminations is larger than that of two images under the same illumination but

of different subjects. Hence, they focus on ways to represent the subjects face

and put more emphasis on the classification stage. They show that both the high

pass filter and the self quotient image operations on the original intensity image

show recognition improvement over the raw grayscale representation of the images,

when the imaging conditions between the gallery and probe set are very different.

However, they also note that while they improve recognition in the difficult cases,

they actually reduce performance in the easy cases. They conclude that Lapla-

cian of Gaussian representation of the image as described in [2] and a quotient

image representation perform better than using the raw image. They demonstrate

a rank one recognition rate improvement from about 75% using the raw images,

to 85% using the Laplacian of Gaussian representation, to about 90%, using quo-

tient images. Since we are dealing with conditions which change drastically and

where the conditions for gallery and probe data differ, we use these approaches to

improve recognition in this work.

Gross and Brajovic [15] use an illuminance-reflectance model to generate im-

ages that are robust to illumination changes. Their model makes two assumptions:

human vision is mostly sensitive to scene reflectance and mostly insensitive to il-

lumination conditions and secondly that human vision responds to local changes

18

in contrast rather than to global brightness levels. [15] Since they focus on pre-

processing the images based on the intensity, there is no training required. They

test their approach using the Yale database, which contains 10 subjects acquired

under 576 lighting conditions. When using PCA for recognition, they improve

the rank one recognition rate from 60% to 93%, when using reflectance images in-

stead of the original intensity images. Since we are dealing with conditions which

change drastically and where the conditions for gallery and probe data differ, we

use these approaches to improve recognition in this work.

Wang et al. [46] expand on the approach in [15] and used self-quotient images

to handle the illumination variation for face recognition. The Lambertian model

of an image can be separated into two parts, the intrinsic and extrinsic part. If

one can estimate the extrinsic part based on the lighting, it can be factored out

of the image to retain the intrinsic part for face recognition. The image is found

by using a smoothing kernel and dividing the image pixels by this filter. Let F

be the smoothing filter and I the original image, then the self-quotient image, Q

is defined as IF [I]

. They demonstrate their approach on the Yale and PIE dataset

and show improvement over using the intensity images for recognition, from about

50% to about 95% rank one recognition rate.

Nishiyama et al. [25] show that self-quotient images [46] are insufficient to han-

dle partial cast shadows or partial specular reflection. They handle this weakness

by using an appearance-based quotient image. They use photometric linearization

to transform the image into the diffuse reflection. A linearized image is defined as

a linear combination of three basis images. In order to generate the basis images

to find the diffuse image, different images from other subjects are used. They

acquired images under fixed pose with a moving light source. The reflectance

19

image is then factored out using the estimated diffuse image. They compare their

algorithm to the self-quotient image and the quotient image and show that on the

Yale B database and show that they achieve a rank one recognition rate of 96%,

whereas self-quotient images achieve 87% rank one recognition rate and Support

Retinex images [37] achieves a rank one recognition rate of 93%.

2.4 Other issues

Howell and Buxton [19] propose a strategy for face recognition when using

low-resolution video. Their goal is to capture similarities of the face over a wide

range of conditions and solve the problem for just a small group (less than 100)

of subjects. The environment is unconstrained in that there are no restrictions on

movement. They use the temporal information of the frames linked by movement

information to match the frames. This allows them to make the assumption

that between two consecutive frames, the identity of the subject will not change

instantly. They use a two-layer, hybrid learning network with a supervised and

unsupervised layer and adjust weights using the Widrow-Hoff delta learning rule.

The network is trained to include the variation that they want their system to

tolerate. From a set of 400 images of 40 people, using 5 images per subject, and

discarding frames that do not include a face, they are able to achieve 95% rank

one recognition rate.

Lee et al. [22] discuss an approach to handle low resolution video using support

vector data description (SVDD). They project the input images as feature vectors

on the spherical boundary of the feature space and conduct face recognition using

correlation on the images normalized based on the inter-ocular distance. They use

the Asian Face database for their experiments and different resolutions, ranging

20

from 16 x 16 pixels to 128 x 128 pixels and achieve a rank one recognition rate of

92% when using the lowest resolution images.

Lin et al. [23] describe an approach to handle face recognition from video of low

resolution like those found in surveillance. They use optical flow for registration

to handle issues of non-planarity, non-rigidity, self-occlusion and illumination

and reflectance variation. [23] For each image in the sequence, they interpolate

between the rows and columns to obtain an image that is twice the size of the

original image. They then compute optical flow between the current frames and

the two previous and two next images and register the four adjacent images us-

ing displacements estimated by the optical flow. Then they compute the mean

using the registered images and the reference images. The final step is to apply a

deblurring Wiener deconvolution filter to the super resolved image. They tested

their approach on the CUAVE database, which contains 36 subjects. When they

reduce the images to 13x18 pixels, their approach (approximately 15% FRR at

1% FAR) performs slightly better than bilinear interpolated images and far out-

performs nearest neighbor interpolation. They expand on this work in [24] and

compare their approach to a hallucination approach (assumes a frontal view of

face and works well when faces are aligned exactly). They conclude that while

there is some improvement gained over using over the lower resolution images,

a fully automated recognition system is currently impractical, given the perfor-

mance. Hence, they relax their constraint to a rank ten match and can achieve

87.3% rank ten recognition rate on XM2VTS dataset that contains 295 subjects.

In Table 2.1, we summarize the different approaches along with their assump-

tions, dataset size and performance. We divide up the works based on the problem

they are trying to solve: (1) Variable pose (2) variable illumination and (3) other

21

problems, such as low resolution on the face. Performance is reported in rank one

recognition rate, unless otherwise specified. Some of the results are reported in

terms of equal error rate (or EER). Also, the results must be viewed in light of

the difficulty of the dataset (data features) and dataset size.

22

TA

BL

E2.

1

PR

EV

IOU

SW

OR

K

Auth

ors:

Tit

leB

asic

idea

Dat

afe

ature

sD

atas

etsi

zeP

erfo

r-m

ance

1

AP

PR

OA

CH

ES

TO

HA

ND

LE

VA

RIA

BL

EP

OSE

Zhou

:[5

4]P

oste

rior

pro

bab

ilit

yov

erti

me

Con

stra

ined

vid

eo12

100%

Bla

nz

and

Vet

ter:

[10]

3Dm

orphab

lem

odel

,p

oint-

to-p

oint

corr

esp

onden

cefo

rsi

milar

ity

Var

iable

pos

e19

496

%

Wey

rauch

,et

al:

[48]

3Dm

odel

sbas

edon

2Dtr

ainin

gim

ages

,re

nder

syn-

thet

icp

oses

for

reco

gnit

ion

2diff

eren

tillu

min

atio

ns,

face

sro

tate

d

690

%

Par

kan

dJai

n:

[31]

Vie

wsy

nth

esis

stra

tegi

es,

3Dfa

cem

odel

susi

ng

SfM

Non

fron

tal

face

s19

370

%

Bey

mer

:[8

]T

empla

te-b

ased

appro

ach,

feat

ure

-lev

elsy

stem

com

-bin

edusi

ng

sum

ofco

rrel

a-ti

ons

Pos

ech

ange

s62

98.3

9%

23

TA

BL

E2.

1

Con

tin

ued

Auth

ors:

Tit

leB

asic

idea

Dat

afe

ature

sD

atas

etsi

zeP

erfo

r-m

ance

Ara

ndje

lovic

,C

ipol

la:

[3]

Use

Ker

nel

PC

Ato

reduce

the

dim

ensi

onal

ity

ofim

ages

tonea

rly

linea

r.A

pply

RA

Dto

calc

ula

tedis

tance

sb

e-tw

een

two

sets

ofim

ages

Fac

em

ove-

men

t;N

oillu

min

atio

nch

ange

36A

bou

t97

%

AP

PR

OA

CH

ES

TO

HA

ND

LE

VA

RIA

BL

EIL

LU

MIN

AT

ION

Kru

eger

,Z

hou

:[2

1]E

xem

pla

rcl

ust

ers

tore

p-

rese

nt

sub

ject

s,B

ayes

ian

pro

bab

ilit

ies

over

tim

eto

eval

uat

eid

enti

ty

Sub

ject

ontr

eadm

ill,

fron

tal

vid

eo

2410

0%

Zhou

:C

hel

lappa:

[52]

Sta

teof

iden

tity

equat

ion,

tem

por

alco

nti

nuit

ySub

ject

ontr

eadm

ill,

fron

tal

vid

eo

25A

bou

t10

0%

Zhou

,C

hel

lappa:

[53]

Lik

elih

ood

pro

bab

ilit

yb

e-tw

een

fram

esov

erti

me

Sub

ject

ontr

eadm

ill,

fron

tal

vid

eo

3093

%

24

TA

BL

E2.

1

Con

tin

ued

Auth

ors:

Tit

leB

asic

idea

Dat

afe

ature

sD

atas

etsi

zeP

erfo

r-m

ance

Zhao

,C

hel

lappa:

[51]

Synth

etic

imag

esac

quir

edunder

diff

eren

tligh

ting,

use

aL

amb

erti

anm

odel

tohan

-dle

the

alb

edo

Var

ied

illu

mi-

nat

ion

1510

0%

Wei

and

Lai

:[4

7]M

odifi

edim

age

inte

n-

sity

funct

ion,

nor

mal

ized

corr

elat

ion

for

mat

chin

g

Var

yin

gillu

-m

inat

ions

681.

47%

EE

R

Pri

cean

dG

ee:

[36]

LD

Abas

edap

pro

ach

usi

ng

subre

gion

sof

the

face

Var

ied

il-

lum

inat

ion,

expre

ssio

n,

dec

orat

ion

106

94.2

%

Hir

emat

h,

Pra

b-

hak

ar:

[18]

Sym

bol

icin

terv

alty

pe

feat

ure

sto

repre

sent

face

clas

ses,

Fac

tor

Dis

crim

i-nan

tA

nal

ysi

sto

reduce

dim

ensi

ons

Diff

eren

til-

lum

inat

ions,

still

imag

es

680% E

ER

25

TA

BL

E2.

1

Con

tin

ued

Auth

ors:

Tit

leB

asic

idea

Dat

afe

ature

sD

atas

etsi

zeP

erfo

r-m

ance

Bel

hum

eur

etal

.:[6

]L

DA

for

reco

gnit

ion,tr

ained

onm

ult

iple

sam

ple

sp

ersu

b-

ject

wit

hva

ryin

gillu

min

a-ti

on

Var

iable

ligh

t-in

g5

0.6%

EE

R

Ara

nje

lovic

and

Cip

olla

:[4

]U

seim

age

filt

ers

since

the

il-

lum

inat

ion

isunpre

dic

table

Var

iable

ligh

t-in

g10

073

.6%

Ara

ndje

lovic

,C

ipol

la:

[5]

Cre

ate

Gau

ssia

ncl

ust

ers

corr

esp

ondin

gto

vari

ous

pos

es;

apply

Gam

ma

inte

n-

sity

corr

ecti

on;

Eucl

idea

ndis

tance

sto

det

erm

ine

diff

eren

ce

Var

ied

illu

mi-

nat

ion

96

AP

PR

OA

CH

ES

TO

HA

ND

LE

LO

W-

RE

SO

LU

TIO

NV

IDE

O

26

TA

BL

E2.

1

Con

tin

ued

Auth

ors:

Tit

leB

asic

idea

Dat

afe

ature

sD

atas

etsi

zeP

erfo

r-m

ance

How

ell,

Buxto

n:

[19]

Explo

itte

mp

oral

info

rma-

tion

offr

ames

,tw

o-la

yer

hy-

bri

dle

arnin

gnet

wor

k,

ad-

just

edusi

ng

the

Wid

row

-H

offle

arnin

gru

le

Unco

nst

rain

eden

vir

onm

ent

and

mov

e-m

ent

4095

%

Lee

,et

al:

[22]

Supp

ort

vect

ordat

ades

crip

-ti

on,

corr

elat

ion

for

reco

gni-

tion

Low

reso

lu-

tion

92%

Lin

etal

.:[2

3]U

seop

tica

lflow

for

regi

stra

-ti

on,

crea

tesi

ngl

eSR

fram

efr

om5

regi

ster

edfr

ames

,use

PC

Aw

ith

Mah

Cos

ine

dis

-ta

nce

for

reco

gnit

ion

Sub

ject

talk

-in

g36

15FA

R

27

2.5 How this dissertation relates to prior work

In this dissertation, we focus on the variations in pose and uncontrolled light-

ing.

To handle the variations in pose, we use a multi - gallery approach to repre-

sent all the poses in the dataset. We create synthetic poses to represent those

that may be present in our probe set. We then use score - level fusion. This ap-

proach requires no training and thus is useful for datasets where the poses in the

probe and gallery sets differ. On a dataset of 57 subjects, we achieve a rank one

recognition rate of 21%, which is an improvement of 6% rank one recognition rate

achieved using the baseline approach. The baseline approach used for comparison

is described in Section 4.1. This work is described in Chapter 5.

To handle the lighting conditions, we use an appearance-based model, which

doesnt require any training data or knowledge of the model. We create reflected

images by using one half of the face and reflecting it over the other half. We

then use score-level fusion to combine the two sets of results. We demonstrate

an improvement relative to the self-quotient image and quotient image approach,

which assumes a Lambertian model. On a dataset of 26 subjects, we show a rank

one recognition rate of 49.88% and an equal error rate of 18.27%, whereas base-

line performance using the original images 38.62% rank one recognition rate and

19.27% equal error rate. Here, baseline performance is the performance achieved

when using the original images as obtained from the surveillance cameras, with

no preprocessing. This work is described in Chapter 6.

28

CHAPTER 3

EXPERIMENTAL SETUP

In this chapter, we describe the sensors, data sets and software used in our

experiments. We acquire three different datasets for our experiments. We label

them the NDSP, IPELA and Comparison dataset. The first dataset is used to

show baseline performance and used in our pose and face detection experiments.

The IPELA dataset is used for our reflection experiments to handle pose and

illumination variation. Finally, the Comparison dataset is used to compare face

recognition performance when using high-quality data acquired on a camcorder

and when using data acquired on a surveillance camera.

The rest of the chapter is organized as follows: Section 3.1 describes the differ-

ent sensors we use in our experiments. We then describe our datasets in Section

3.2. Finally, the software we use is described in Section 3.3.

3.1 Sensors

We capture data using four different sensors. The first camera is a Nikon D80,

used to acquire the gallery data used in the experiments. The second camera is

a PTZ camera installed by the Notre Dame Security Police. The third camera is

a Sony IPELA camera with PTZ capability. The fourth camera is a Sony HDR

camcorder used to capture data as a comparison to the surveillance-quality data.

We describe each in detail below.

29

Figure 3.1. Camera to capture gallery data: Nikon D80

3.1.1 Nikon D80

The gallery data is acquired on a Nikon D80 [28]. It is a digital single-lens

reflex (SLR) camera. The resolution of the images is 3872 x 2592 pixels. The

camera is shown in Figure 3.1.

3.1.2 Surveillance camera installed by NDSP

The probe data is acquired using a surveillance video camera with PTZ (pan,

tilt, zoom) capability. The camera is part of the NDSP security system and is

attached to the ceiling on the first floor of Fitzpatrick Hall, as seen in Figure

3.2. The resolution of this camera is 640 x 480 pixels. The data is captured in

30

Figure 3.2. Surveillance camera: NDSP camera

interlaced mode.

3.1.3 Sony IPELA camera

We also acquire data on a surveillance camera called Sony SNC RZ25N surveil-

lance camera [41]. The resolution of this camera is 640 x 480 pixels and the data

is captured in interlaced mode. In Figure 3.3, we show an image of such a camera.

3.1.4 Sony HDR Camcorder

For our comparison dataset, we also acquire high quality data on a Sony HDR

camcorder [42]. The video was captured at a frame rate of 29.97 frames per second

in interlaced mode. In Figure 3.4, we show an image of this camcorder.

In Table 3.1, we compare all the cameras used in this dissertation.

31

Figure 3.3. Surveillance camera: Sony IPELA camera

Figure 3.4. High-definition camcorder: Sony HDR-HC7

32

TABLE 3.1

FEATURES OF CAMERAS USED

Features Camera

Name used Nikon D80 NSDP camera IPELA Sony HD

Model Nikon D80 Not available Sony SNCRZ25N

Sony HDRHC

Resolution 2592x3872 640x480 640x480 1920x1080

Image size 3,732 kb 40kb 52kb 466 kb

Interlaced No Yes Yes Yes

3.2 Dataset

We describe three datasets. They are named NDSP, IPELA and Comparison

dataset based on the camera used to acquire them and the experiments for which

they are used.

3.2.1 NDSP dataset

We use two kinds of sensors to acquire data for this dataset. The gallery

data containing high quality still images is acquired using the Nikon D80 camera.

The subject is sitting about two meters from the camera in a controlled well-lit

environment, in front of a gray background. The inter-ocular distance is about

230 pixels, with a range of between 135 and 698 pixels. In Figure 3.5, we show

the set up and two of the images acquired for the gallery.

The probe data is acquired using the NDSP surveillance camera, located on

the first floor of Fitzpatrick Hall. The video consists of a subject entering through

33

(a) Acquisition setup for gallery data

(b) Example gallery images

Figure 3.5. Gallery image acquisition setup

34

a glass door, walking around the corner till they are out of the camera view.

Each video sequence consists of between 50 and 150 frames. In Figure 3.6 we

show 10 frames acquired from this camera. We see that the illumination is highly

uneven due to the glare of the sun on the subject. The inter-ocular distance is

about 40 pixels on average. The pan, tilt and zoom are not changed during data

acquisition but could vary from day to day since the camera is part of a campus

security system. There are 57 subjects in this dataset. The time lapse between

the probe and gallery data varies from two weeks to about six months.

3.2.2 IPELA dataset

The gallery images are acquired using the Nikon D80, in a well-lit room under

controlled conditions. The subject is sitting about two meters from the camera in

front of a gray background, with a neutral expression, as seen in Figure 3.5. Since

these images are acquired indoors, the illumination is controlled. The inter-ocular

distance is about 300 pixels.

The probe data is acquired from the IPELA camera. The zoom on this camera

is set so that the inter-ocular distance of the subject is about 50 pixels, starting at

about 30 pixels when the subject enters the scene and is farthest from the camera

to about 115 pixels when the subject is closest to the camera. It is mounted on

a tripod set at a height of about five and a half feet. The camera position is not

changed during a day of capture, but may vary slightly from day to day. Each

clip consists of the subject walking around a corner until they are out of the view

of the camera. Therefore, we capture data of the subject in a variety of poses and

face sizes. Each video sequence is made up of 100 to 200 frames. In Figure 3.7,

we show 10 example frames acquired from one subject from one of the clips. The

35

Figure 3.6: Example frames from the NDSP camera

36

Figure 3.7: Example frames from the IPELA camera

illumination is also uncontrolled. This dataset consists of 104 subjects. The time

lapse between the probe and gallery data is between two weeks and six months.

Splitting up the dataset: In order to test our approach when using the

surveillance data, we use four-fold cross validation. We split up the dataset into

four disjoint subsets, where each set contains 26 subjects. The sets are subject-

disjoint. For our experiments, we train the space on three subsets and test on

the remaining subset. We use the average of the four scores as our measure of

37

performance of the different approaches.

3.2.3 Comparison dataset

For each of the subjects in this dataset, we have one high quality still im-

age, one high-quality video sequence and one video clip acquired from the IPELA

surveillance camera. The gallery data is acquired in a well-lit room under con-

trolled conditions. The subject is sitting about 2 meters from the camera against

a gray background. We show an example image in Figure 3.5.

The IPELA camera and HD Sony camcorder are set up to acquire data in

the same setting. The zoom on the IPELA camera is set so that the inter-ocular

distance of the subject is about 40 pixels on average. It is mounted on a tripod

set at a height of about five and one half feet. The camera position is not changed

during a day of capture, but may vary slightly from day to day. In each clip, the

subject walks toward the left, picks up an object and then walks towards the right

of the frame. Therefore, we capture data of the subject in a variety of poses and

face sizes. We acquired data on three consecutive days. Each video sequence was

made of between 100 and 300 frames. We show examples of each of these images

in Figures 3.8 to 3.9.

The Sony HDR camcorder is also mounted on a tripod set at a height of about

five and a half feet and adjusted according to the height of the subject. This

is captured simultaneously with the surveillance video, and thus consists of the

subject walking to the left, picking up an object and then walking towards the

right of the frame. The interocular distance of this dataset is about 45 pixels with

a range of about 15 pixels to 110 pixels. In Figure 3.9, we show 10 example frames

acquired from one subject from one of the clips.

38

Figure 3.8: Example frames from IPELA camcorder for the Comparison dataset

39

Figure 3.9: Example frames from the Sony HDR-HC7 camcorder

40

This dataset contains 176 subjects. Out of the 176 subjects, 78 are acquired

indoors in a hallway on the first floor of Fitzpatrick Hall. One half of the face

is partially lit by the sun. The remaining 98 subjects are acquired outdoors

in uncontrolled lighting conditions. We separate out these datasets to compare

recognition performance when using data acquired indoors rather than outdoors.

The probe and gallery data in this dataset are acquired on the same day. This

dataset partly overlaps with the data of the Multi - Biometric Grand Challenge

(or MBGC) dataset [29], but also includes surveillance data that is not part of

the MBGC dataset.

In Table 3.2, we summarize the details of the datasets we use in this disserta-

tion.

41

TABLE 3.2

SUMMARY OF DATASETS

Features Dataset

Name NDSPDataset

IPELADataset

Comparison Dataset

Gallery datasource

Nikon D80 Nikon D80 Nikon D80 Nikon D80

Probe datasource

NDSP in-stalledsurveillancecamera

Sony IPELAcamera

Sony IPELAcamera

Sony HDcamcorder

Number sub-jects

57 104 176 176

Numberimages pergallery sub-ject

1 1 1 1

Numberimages perprobe subject

50 - 150frames

100 - 300frames

100 - 300frames

300 - 450frames

Acquisitionenvironmentof probe data

FitzpatrickHallway

FitzpatrickHallway

Indoor andoutdoor

Indoor andoutdoor

Activity Subject en-ters through aglass door andwalks arounda corner

Subject walksaround a cor-ner and downa hallway andout of view ofthe camera

Subject picksup an ob-ject andwalks out ofcamera-view

Subject picksand objectand walks outof camera -view

Time lapsebetweenprobe andgallery data

2 weeks to 6months apart

2 weeks to 6months apart

Same day Same day

42

3.3 Software

We use a variety of software for our work: FaceGen Modeller 3.2, Viisage

IdentityExplorer, Neurotechnologija, PittPatt and CSUs PCA code. They are

described in further detail below:

3.3.1 FaceGen Modeller 3.2

For each gallery image, we create a 3D model using the Nikon image as input

and then rotate the model to get different poses. In order to create the models, we

use the FaceGen Modeller 3.2 Free Version manufactured by Singular Inversions

[26]. The software is based on the work by Vetter et al. [10].

This modeler creates a 3D model using the notion of an average 3D face and

a still frontal image. It is trained on a set of subjects from various demographics

such as age, gender and ethnicity. It requires eleven points to be marked on the

face: centers of the eyes, edges of the nose, the corners of the mouth, the chin,

the point at which the jaw line touches the face visually and the points at which

the ears touch the face. Once the 3D model is rendered using the still image,

different parameters, such as gauntness of cheeks and the jaw line can be tweaked

to represent the particular subject in the 2D image more accurately. The synthetic

3D face can then be rotated to get different views of the face. A screen shot of

the software is shown in Figure 3.10.

3.3.2 IdentityEXPLORER

Viisage manufactures an SDK for multi-biometric technology, called Identity-

EXPLORER. It provides packages for both face and fingerprint recognition. It

is based on Viisages Flexible Template Matching technology and a new set of

43

Figure 3.10. FaceGen Modeller 3.2 Interface

44

powerful multiple biometric recognition algorithms, incorporating a unique com-

bination of biometric tools [45]. We use it for detection and recognition:

1. Detection: It gives the centers of the eyes and the mouth, with an associated

confidence measure in the face localization, ranging from 0.0 to 100.0.

2. Recognition: It takes two images and gives a matching score between the

faces in the two images. The scores range from 0.0 to 100.0, where a higher

score implies a better match.

3.3.3 Neurotechnoligija

Neurotechnology [27] manufactures an SDK for face and fingerprint biometrics.

The face recognition package is called Neurotechnologija Verilook. It includes face

detection and face recognition capability. The face detection gives the eye and

mouth locations. The software also includes recognition software, which gives the

matching score between two faces in two images.

3.3.4 PittPatt

PittPatt manufactures a face detection and recognition package [35] that we

use in our comparison experiments. The face detection component is robust to

illumination and pose changes in the data and to a variety of demographics. Along

with its detection capability it can determine the pose of the face. It is able to

capture small faces, such as faces with an inter-ocular distance of eight pixels. The

face recognition component is also robust to a variety of poses and expressions by

using statistical learning techniques. By combining face detection and tracking,

PittPatt can also be used to recognize humans across video sequences.

45

3.3.5 CSUs preprocessing and PCA software

In order to form a template image of the face that is found in the image, we

use CSUs preprocessing code [13]. We create images that are 65x75 pixels in

size, based on the eye locations found by Viisage, because the subjects face in the

surveillance video has an average inter-ocular distance of about 40 pixels. In the

normalization stage, the images are first centered, based on eye locations, and the

mean image of the set is subtracted from each image in the set.

The CSU software also includes an implementation of Principal Component

Analysis for face recognition [44]. We use this software when using reflectance

images as input for recognition to handle illumination effects [15]. The basic PCA

algorithm is described by Turk and Pentland [44]. The process consists of two

parts, the oine training phase and the online recognition phase.

In the oine phase, the eigenspace is created. Each image is unraveled into

a vector and each vector becomes a column in an MxN matrix, where N is the

number of images and M is the number of pixels. Then the covariance matrix Q

is defined as the outer product of this metric.

The next step is to calculate the eigenvalues and eigenvectors of the matrix Q

and then keep the k eigenvectors with the k largest eigenvalues (which correspond

to the dimensions of highest variation). This defines a k-dimensional eigenspace

into which new images can be projected. In the recognition phase, the normalized

images are projected along their eigenvectors into the k-dimensional face space

and the projected gallery image closest to a projected probe image is the best

match.

46

3.4 Performance metrics

In this dissertation, we use two different performance metrics to evaluate recog-

nition performance. They are called rank one recognition rate and equal error

rate. They are shown graphically using cumulative matching curves and receiver

operating curves respectively. We describe each metric in further detail below.

3.4.1 Rank one recognition rate

When a image is probed against a set of gallery images, the gallery image

that has the highest matching score to that probe image is considered its rank

one match. Rank one recognition rate is then defined as the ratio of set of probe

images where the rank one match of each is its true match.

A CMC curve plots the change in recognition rate as the rank of acceptance is

increased. The x-axis ranges from 1 through M , where M is the number of unique

gallery subjects and the y - axis ranges from 0 to 100%. In Figure 3.11, we show

an example of such a curve. In this example, there are 26 subjects in the gallery

set.

Assume that there are n images the probe set and m images in the gallery set.

Let p be the number of probe images for which the rank one match is its true

match, then the rank one recognition rate R is defined as in Equation 3.1.

R =p

100 n (3.1)

3.4.2 Equal error rate

Another metric to measure the equal error rate of the receiver operating (ROC)

curve. An ROC curve plots the false accept rate against the true accept rate. The

47

Figure 3.11. Example of CMC curve

48

Figure 3.12. Example of ROC curve

rate at which the true accept rate equals the false accept rate is called the equal

error rate.

An ROC curve plots the change in false accept rate versus the true accept

rate. At each point on the graph, the threshold of acceptance as a true match is

varied. In Figure 3.12, we show an example of such a curve.

3.5 Conclusions

In this chapter, we discussed the sensors and datasets used in our experiments.

We also described the software we used to support our work. Finally, we closed

with a discussion about the metrics used to evaluate performance.

49

CHAPTER 4

A STUDY: COMPARING RECOGNITION PERFORMANCE WHEN USING

POOR QUALITY DATA

The sensor used to capture data used for face recognition can affect recognition

performance. Low quality cameras are often used for surveillance, which can

result in poor recognition because of the poor video quality and low resolution

on the face. In this chapter, we conduct two sets of experiments. The first

set of experiments demonstrate baseline performance using the NDSP dataset.

This dataset is captured indoors, where the sunlight streaming through the doors

affects the illumination of the scene. Then we show recognition experiments using

the Comparison dataset, where video data acquired from two different sources:

a high-quality camcorder and a surveillance camera. We also capture data both

indoors and outdoors to compare performance when acquiring data in different

acquisition settings. We then compare recognition performance when using each

of these two sources of video data as our probe set and show that performance

falls drastically when we use poor quality video and when we move from indoor

to outdoor settings.

The rest of the chapter is organized as follows: First, we describe baseline

performance for the NDSP dataset in Section 4.1. Then, Section 4.2 describes

the experiments we run to compare performance and in Sections 4.2.2 and 4.3, we

describe our results and conclusions.

50

4.1 NDSP dataset: Baseline performance

We first define baseline performance using the NDSP dataset to show the

difficulty of this dataset. While there is significant research done in the area of

face recognition using high - quality video where the subject is looking directly

at the camera, research using poor - quality data with off-angle poses is also

needed. So we define baseline performance for this dissertation to show that it is

a challenging problem.

4.1.1 Experiments

For each subject in the NDSP probe set, we compare each frame of their

probe video clip to the set of gallery images of the same subject. We describe

how we generate the multiple gallery images per subject in Section 5.1. For each

subject, we predetermine the best single probe video frame to use for that person.

We do this by picking the frame that gives us the highest matching score to

this corresponding set of gallery images. This gives us a new image set of 57

images (one image per subject), where each image represents the highest possible

matching score of that subject to the gallery images of the same subject. We use

this oracle set of probe video frames as our probe set. This is an optimistic

baseline, in that a recognition system would not necessarily be able to find the best

frame in each probe video clip. We then run recognition using this set of images

as probes and report the rank one recognition rate as our baseline performance.

4.1.2 Results

In Figure 4.1, we show the rank one recognition rates using this set of 57

images, when the images in the gallery set correspond to an off-angle degree from

51

Figure 4.1. Baseline performance for the NDSP dataset

the frontal position of 0, +/- 6, +/-12, +/-18 in the yaw angle. The face is

also rotated to +/- 6 in pitch angle.

We see that performance steadily increases as we increase the range of poses

available in the gallery set. We determine the best frame per subject based on its

matching score to all 17 poses. This explains why performance peaks when we use

all 17 poses, since we use all 17 poses to pick the frames that make up the oracle

probe set. This shows that this is a challenging dataset, where performance is

poor even when we pick out the probe frame with the best matching score to its

52

gallery image. Secondly, we demonstrate that using a variety of poses increases

recognition performance.

We show that performance increases as we increase the number of poses, till we

stop at 17 poses. So the question arises as to whether or not performance would

continue to increase if we were to increase the off-angle of the poses and add more

images to our gallery. However, we generated additional synthetic poses, but the

face detection system was unable to handle poses that were greater than 18 from

a frontal position. So, those poses were not used by the recognition system and

even if they were, would have not been useful for recognition. Furthermore, if the

video contained images of the subject in a strictly frontal position, the additional

poses would not be useful for recognition. However, as we showed in [43], multiple

images can be used to be improve recognition even in instances where the subject

is in a frontal pose.

4.2 Comparison dataset

For comparison, we run recognition experiments using the Comparison dataset

described in Section 3.2. This set contains high quality still images as gallery data

and two sets of probe data, one acquired on a high - quality camcorder and the

other on a surveillance camera. This dataset also contains data acquired indoors

and outdoors. This shows how the change in lighting can also affect recognition

performance.

4.2.1 Experiments

For our experiments, we use PittPatts detector and recognition system. Once

we have detected all the faces in the probe and gallery data, we create a single

53

gallery of all the gallery images, with minimal clustering to ensure that each image

is considered a unique subject. Then for each video sequence, we create a gallery

and cluster it so that they all correspond to the same subject. We then run

recognition of each set of videos against the gallery of high-quality still images.

We report results using rank one recognition rate and equal error rate.

Since we cluster the video frames to correspond to one subject, distances are

reported between one sequence and a gallery image. So results are reported per

video sequence, rather than per frame. Our experiments are grouped into four

categories, depending on the sensor used and the acquisition condition in which

the data is acquired.

4.2.2 Results

In this section, we describe the detection and recognition results when we run

recognition experiments described in 4.2.

Detection results: In Table 4.1 we show the results of the face detection

and how many faces were detected in the video sequences. The number of faces

detected in the outdoor video is far fewer than that in the indoor video. We

also notice that the number of faces detected reduces as we move from high-

quality video to outdoor video. With the high - quality video indoors, detection is

about 50% and falls to less than 5% when we move outdoors, using a surveillance

camera. So we see that the type of camera and the acquisition condition affects

face detection performance.

In Figures 4.2 through 4.5, we show an example frame from each acquisition

and camera. We also show some of the thumbnails created after we run detection

on the surveillance and high-definition video (both indoor and outdoor video). We

54

TABLE 4.1

COMPARISON DATASET RESULTS: DETECTIONS IN VIDEO

USING PITTPATT

Performance Indoor video Outdoor video

metric High-resolutionvideo

Surveillancevideo

High-resolution

Deborah Thomas Dissertation

Documents

Transcript of Deborah Thomas Dissertation