Deborah Thomas Dissertation

150
FACE RECOGNITION FROM SURVEILLANCE-QUALITY VIDEO A Dissertation Submitted to the Graduate School of the University of Notre Dame in Partial Fulfillment of the Requirements for the Degree of Doctor of Philosophy by Deborah Thomas, Kevin W. Bowyer, Co-Director Patrick J. Flynn, Co-Director Graduate Program in Computer Science and Engineering Notre Dame, Indiana July 2010

description

Dissertation

Transcript of Deborah Thomas Dissertation

  • FACE RECOGNITION FROM SURVEILLANCE-QUALITY VIDEO

    A Dissertation

    Submitted to the Graduate School

    of the University of Notre Dame

    in Partial Fulfillment of the Requirements

    for the Degree of

    Doctor of Philosophy

    by

    Deborah Thomas,

    Kevin W. Bowyer, Co-Director

    Patrick J. Flynn, Co-Director

    Graduate Program in Computer Science and Engineering

    Notre Dame, Indiana

    July 2010

  • c Copyright byDeborah Thomas

    2010

    All Rights Reserved

  • FACE RECOGNITION FROM SURVEILLANCE-QUALITY VIDEO

    Abstract

    by

    Deborah Thomas

    In this dissertation, we develop techniques for face recognition from surveillance-

    quality video. We handle two specific problems that are characteristic of such

    video, namely uncontrolled face pose changes and poor illumination. We conduct

    a study that compares face recognition performance using two different types of

    probe data and acquiring data in two different conditions. We describe approaches

    to evaluate the face detections found in the video sequence to reduce the probe

    images to those that contain true detections. We also augment the gallery set us-

    ing synthetic poses generated using 3D morphable models. We show that we can

    exploit temporal continuity of video data to improve the reliability of the matching

    scores across probe frames. Reflected images are used to handle variable illumi-

    nation conditions to improve recognition over the original images. While there

    remains room for improvement in the area of face recognition from poor-quality

    video, we have shown some techniques that help performance significantly.

  • CONTENTS

    FIGURES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . v

    TABLES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . viii

    CHAPTER 1: INTRODUCTION . . . . . . . . . . . . . . . . . . . . . . . 11.1 Description of surveillance-quality video . . . . . . . . . . . . . . 11.2 Overview of our work . . . . . . . . . . . . . . . . . . . . . . . . . 31.3 Organization of the dissertation . . . . . . . . . . . . . . . . . . . 5

    CHAPTER 2: PREVIOUS WORK . . . . . . . . . . . . . . . . . . . . . . 62.1 Current evaluations . . . . . . . . . . . . . . . . . . . . . . . . . . 62.2 Pose handling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82.3 Illumination handling . . . . . . . . . . . . . . . . . . . . . . . . . 132.4 Other issues . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 202.5 How this dissertation relates to prior work . . . . . . . . . . . . . 28

    CHAPTER 3: EXPERIMENTAL SETUP . . . . . . . . . . . . . . . . . . 293.1 Sensors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29

    3.1.1 Nikon D80 . . . . . . . . . . . . . . . . . . . . . . . . . . . 303.1.2 Surveillance camera installed by NDSP . . . . . . . . . . . 303.1.3 Sony IPELA camera . . . . . . . . . . . . . . . . . . . . . 313.1.4 Sony HDR Camcorder . . . . . . . . . . . . . . . . . . . . 31

    3.2 Dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 333.2.1 NDSP dataset . . . . . . . . . . . . . . . . . . . . . . . . . 333.2.2 IPELA dataset . . . . . . . . . . . . . . . . . . . . . . . . 353.2.3 Comparison dataset . . . . . . . . . . . . . . . . . . . . . . 38

    3.3 Software . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 433.3.1 FaceGen Modeller 3.2 . . . . . . . . . . . . . . . . . . . . . 433.3.2 IdentityEXPLORER . . . . . . . . . . . . . . . . . . . . . 433.3.3 Neurotechnoligija . . . . . . . . . . . . . . . . . . . . . . . 453.3.4 PittPatt . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45

    ii

  • 3.3.5 CSUs preprocessing and PCA software . . . . . . . . . . . 463.4 Performance metrics . . . . . . . . . . . . . . . . . . . . . . . . . 47

    3.4.1 Rank one recognition rate . . . . . . . . . . . . . . . . . . 473.4.2 Equal error rate . . . . . . . . . . . . . . . . . . . . . . . . 47

    3.5 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49

    CHAPTER 4: A STUDY: COMPARING RECOGNITION PERFORMANCEWHEN USING POOR QUALITY DATA . . . . . . . . . . . . . . . . 504.1 NDSP dataset: Baseline performance . . . . . . . . . . . . . . . . 51

    4.1.1 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . 514.1.2 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51

    4.2 Comparison dataset . . . . . . . . . . . . . . . . . . . . . . . . . . 534.2.1 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . 534.2.2 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54

    4.3 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65

    CHAPTER 5: HANDLING POSE VARIATION IN SURVEILLANCE DATA 665.1 Pose handling: Enhanced gallery for multiple poses . . . . . . . . 665.2 Score-level fusion for improved recognition . . . . . . . . . . . . . 69

    5.2.1 Description of fusion techniques . . . . . . . . . . . . . . . 695.3 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 735.4 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74

    5.4.1 NDSP dataset . . . . . . . . . . . . . . . . . . . . . . . . . 745.4.2 IPELA dataset . . . . . . . . . . . . . . . . . . . . . . . . 80

    5.5 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83

    CHAPTER 6: HANDLING VARIABLE ILLUMINATION IN SURVEIL-LANCE DATA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 856.1 Acquisition setup . . . . . . . . . . . . . . . . . . . . . . . . . . . 866.2 Reflecting images to handle uneven illumination . . . . . . . . . . 86

    6.2.1 Averaging images . . . . . . . . . . . . . . . . . . . . . . . 916.3 Comparison approaches . . . . . . . . . . . . . . . . . . . . . . . . 946.4 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96

    6.4.1 Test dataset . . . . . . . . . . . . . . . . . . . . . . . . . . 966.4.2 Face detection . . . . . . . . . . . . . . . . . . . . . . . . . 986.4.3 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . 98

    6.5 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 996.6 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101

    CHAPTER 7: OTHER EXPERIMENTS . . . . . . . . . . . . . . . . . . . 1037.1 Face detection evaluation . . . . . . . . . . . . . . . . . . . . . . . 103

    7.1.1 Background subtraction . . . . . . . . . . . . . . . . . . . 105

    iii

  • 7.1.2 Approach to pick good frames: Gestalt clusters . . . . . . 1087.1.3 Results: Comparing performance on entire dataset and datasets

    pruned using background subtraction and gestalt clustering 1117.2 Distance metrics and number of eigenvectors dropped . . . . . . . 116

    7.2.1 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . 1177.2.2 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 117

    CHAPTER 8: CONCLUSIONS . . . . . . . . . . . . . . . . . . . . . . . . 119

    APPENDIX A: GLOSSARY . . . . . . . . . . . . . . . . . . . . . . . . . . 121

    APPENDIX B: POSE RESULTS . . . . . . . . . . . . . . . . . . . . . . . 123

    APPENDIX C: ILLUMINATION RESULTS . . . . . . . . . . . . . . . . . 130

    BIBLIOGRAPHY . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 135

    iv

  • FIGURES

    1.1 Example showing the problem of variable illumination . . . . . . . 2

    1.2 Example showing the variable pose in two frames of a video clip . 3

    1.3 Example showing the low resolution of the face in the frame, whenthe subject is too far from the camera . . . . . . . . . . . . . . . . 4

    1.4 Example showing the face to be out of view of the camera . . . . 4

    3.1 Camera to capture gallery data: Nikon D80 . . . . . . . . . . . . 30

    3.2 Surveillance camera: NDSP camera . . . . . . . . . . . . . . . . . 31

    3.3 Surveillance camera: Sony IPELA camera . . . . . . . . . . . . . 32

    3.4 High-definition camcorder: Sony HDR-HC7 . . . . . . . . . . . . 32

    3.5 Gallery image acquisition setup . . . . . . . . . . . . . . . . . . . 34

    3.6 Example frames from the NDSP camera . . . . . . . . . . . . . . 363.7 Example frames from the IPELA camera . . . . . . . . . . . . . . 373.8 Example frames from IPELA camcorder for the Comparison dataset 393.9 Example frames from the Sony HDR-HC7 camcorder . . . . . . . 403.10 FaceGen Modeller 3.2 Interface . . . . . . . . . . . . . . . . . . . 44

    3.11 Example of CMC curve . . . . . . . . . . . . . . . . . . . . . . . . 48

    3.12 Example of ROC curve . . . . . . . . . . . . . . . . . . . . . . . . 49

    4.1 Baseline performance for the NDSP dataset . . . . . . . . . . . . 52

    4.2 Detections on surveillance video data acquired indoors . . . . . . 57

    4.3 Detections on surveillance video data acquired outdoors . . . . . . 58

    4.4 Detections on high-definition video data acquired indoors . . . . . 59

    4.5 Detections on high-definition video data acquired outdoors . . . . 60

    4.6 Results: ROC curve comparing performance when using high-definitionand surveillance data (Indoor video) . . . . . . . . . . . . . . . . 61

    4.7 Results: ROC curve comparing performance when using high-definitionand surveillance data (Outdoor video) . . . . . . . . . . . . . . . 62

    v

  • 4.8 Results: CMC curve comparing performance when using high -definition and surveillance data (Indoor video) . . . . . . . . . . . 63

    4.9 Results: CMC curve comparing performance when using high -definition and surveillance data (Outdoor video) . . . . . . . . . . 64

    5.1 Frames showing the variable pose seen in a video clip (the blackdots mark the detected eye locations) . . . . . . . . . . . . . . . . 67

    5.2 Synthetic gallery poses . . . . . . . . . . . . . . . . . . . . . . . . 68

    5.3 Change in rank matrix for a new incoming image . . . . . . . . . 71

    5.4 Results: Comparing rank one recognition rates when adding posesof increasing degrees of off-angle poses . . . . . . . . . . . . . . . 75

    5.5 Results: Comparing rank one recognition rates when using frontal,+/-6 degree and +/-24 degree poses . . . . . . . . . . . . . . . . . 76

    5.6 Results: Comparing rank one recognition rate when using fusiontechniques to improve recognition . . . . . . . . . . . . . . . . . . 78

    5.7 Examples of poorly performing images . . . . . . . . . . . . . . . 81

    6.1 Setup to acquire probe data and resulting illumination variation onthe face . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87

    6.2 Comparison of gallery and probe images . . . . . . . . . . . . . . 88

    6.3 Reflection algorithm . . . . . . . . . . . . . . . . . . . . . . . . . 896.4 Example images: original image, reflected left and reflected right . 92

    6.5 Average intensity of each column . . . . . . . . . . . . . . . . . . 93

    6.6 Reflection algorithm . . . . . . . . . . . . . . . . . . . . . . . . . 946.7 Example images: original image and averaged image . . . . . . . . 95

    6.8 Example images: original image and quotient image . . . . . . . . 97

    7.1 Example eye detections . . . . . . . . . . . . . . . . . . . . . . . . 104

    7.2 Structuring element used for erosion and dilation . . . . . . . . . 106

    7.3 Example subject: Ground truth and Viisage locations . . . . . . . 109

    7.4 Results: Rank one recognition rates when using the entire dataset 113

    7.5 Results: Rank one recognition rates when using the dataset afterbackground subtraction . . . . . . . . . . . . . . . . . . . . . . . . 114

    7.6 Results: Rank one recognition rates when using the dataset afterbackground subtraction and gestalt clustering . . . . . . . . . . . 115

    B.1 CMC curves: Comparing fusion techniques approaches using a sin-gle frame . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 124

    vi

  • B.2 ROC curves: Comparing fusion techniques using a single frame . . 125

    B.3 CMC curves: Comparing approaches exploiting temporal continu-ity, using rank-based fusion . . . . . . . . . . . . . . . . . . . . . 126

    B.4 ROC curves: Comparing fusion techniques exploiting temporal con-tinuity, using rank-based fusion . . . . . . . . . . . . . . . . . . . 127

    B.5 CMC curves: Comparing fusion techniques exploiting temporalcontinuity, using score-based fusion . . . . . . . . . . . . . . . . . 128

    B.6 ROC curves: Comparing fusion techniques approaches exploitingtemporal continuity, using score-based fusion . . . . . . . . . . . . 129

    C.1 CMC curves: Comparing illumination approaches using a singleframe . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 131

    C.2 ROC curves: Comparing illumination approaches using a single frame132

    C.3 CMC curves: Comparing illumination approaches exploiting tem-poral continuity . . . . . . . . . . . . . . . . . . . . . . . . . . . . 133

    C.4 ROC curves: Comparing illumination approaches exploiting tem-poral continuity . . . . . . . . . . . . . . . . . . . . . . . . . . . . 134

    vii

  • TABLES

    2.1 PREVIOUS WORK . . . . . . . . . . . . . . . . . . . . . . . . . 23

    3.1 FEATURES OF CAMERAS USED . . . . . . . . . . . . . . . . . 33

    3.2 SUMMARY OF DATASETS . . . . . . . . . . . . . . . . . . . . . 42

    4.1 COMPARISON DATASET RESULTS: DETECTIONS IN VIDEOUSING PITTPATT . . . . . . . . . . . . . . . . . . . . . . . . . . 55

    4.2 COMPARISON DATASET RESULTS: COMPARISON OF RECOG-NITION RESULTS ACROSS CAMERAS USING PITTPATT . . 56

    5.1 RESULTS: COMPARISON OF RECOGNITION PERFORMANCEUSING FUSION . . . . . . . . . . . . . . . . . . . . . . . . . . . 79

    5.2 RESULTS: COMPARISON OF RECOGNITION PERFORMANCEUSING FUSION ON THE IPELA DATASET . . . . . . . . . . . 82

    6.1 RESULTS: COMPARING RESULTS FOR DIFFERENT ILLUMI-NATION TECHNIQUES . . . . . . . . . . . . . . . . . . . . . . 100

    7.1 COMPARING DATASET SIZE OF ALL IMAGES TO BACK-GROUND SUBTRACTION APPROACH AND GESTALT CLUS-TERING APPROACH . . . . . . . . . . . . . . . . . . . . . . . . 112

    7.2 RESULTS: PERFORMANCE WHEN VARYING DISTANCE MET-RICS AND NUMBER OF EIGENVECTORS DROPPED . . . . 118

    viii

  • CHAPTER 1

    INTRODUCTION

    Face recognition from video is an important area of biometrics research today.

    Most of the existing work focuses on recognition from video where the images are

    of high-resolution, containing faces in a frontal pose and where the lighting condi-

    tions are optimal. However, face recognition from video surveillance has become

    an increasingly important goal as more and more video surveillance cameras are

    installed in public places. For example, the Metropolitan Police Department has

    installed 14 pan, tilt, zoom (PTZ) cameras around the Washington D. C. area

    [12]. Also, there are 2,397 cameras installed in Manhattan [30]. Face recogni-

    tion using such video is a very challenging problem because of the low resolution,

    lighting conditions of such video and the presence of uncontrolled movement. In

    this dissertation, we focus on recognition in the presence of uncontrolled pose and

    lighting in probe data.

    1.1 Description of surveillance-quality video

    We describe surveillance-quality video based on four different features. The

    four characteristics are (1) variable illumination, (2) variable pose of the subjects

    in the video, (3) the low resolution of the faces in the video and (4) obstructions

    of the faces in the video.

    1

  • Figure 1.1. Example showing the problem of variable illumination

    Firstly, such video is affected by variable illumination. Often times, surveil-

    lance cameras are pointed toward doorways where the sun is streaming in, or the

    camera may be in a poorly-lit location. This can change the intensity of the image,

    even causing different parts of the image to be illuminated differently, which can

    cause problems for the recognition system. In Figure 1.1, we show an example

    frame of such video affected by variable illumination.

    The second feature of surveillance video is the variable pose of the subject in

    the video. The subject is often not looking at the camera and the camera may

    be mounted to the ceiling. Therefore, the subject may not be a frontal pose in

    the video. While a lot of work has been done using images where the subject is

    looking directly at the camera, there is a need to explore recognition when the

    subject is not looking at the camera. In Figure 1.2, we two show such examples.

    Another surveillance video characteristic is the low resolution of the face. Usu-

    ally, such video is of low resolution and covers a large scene. Furthermore, the

    2

  • Figure 1.2. Example showing the variable pose in two frames of a videoclip

    camera may be located far from the subject. Hence, the subjects face may be

    small, causing the number of pixels on the subjects face to be low, making it

    difficult for robust face recognition. In Figure 1.3, we show an image where the

    subject is too far from the camera for reliable face recognition.

    The last feature of surveillance-quality video is obstruction to the human face.

    A perpetrator may be aware of the presence of a camera and try to cover their face

    to prevent the camera from capturing their face. Hats, glasses and makeup can

    also be used to change the appearance of the face to cause problems for recognition

    systems. Sometimes, the positioning of the camera may cause the face to be out

    of view of the camera frame as seen in Figure 1.4.

    1.2 Overview of our work

    In this dissertation, we focus on variable pose and illumination. One theme

    that we exploit throughout this dissertation is temporal continuity in the surveil-

    lance video. One feature of video data that still images lack is the temporal

    3

  • Figure 1.3. Example showing the low resolution of the face in the frame,when the subject is too far from the camera

    Figure 1.4. Example showing the face to be out of view of the camera

    4

  • continuity between the frames of the data. The identity of the subject will not

    change in an instant, so the multiple frames available can be used for recogni-

    tion. The matching scores between a pair of probe and gallery subjects can be

    made more robust by using decisions about a previous frame for the current one.

    First, we compare recognition performance when using surveillance video to per-

    formance when using high-resolution video in our probe dataset. We also devise a

    technique to evaluate the face detections to prune the dataset to true detections,

    to improve recognition performance. We use a multi-gallery approach to make the

    recognition system more robust to variable pose in the data. We generate these

    poses using synthetic morphable models. We then create reflected images in order

    to mitigate the effects of variable illumination.

    By combining these techniques we show that we can handle some of the issues

    of variable pose and illumination in surveillance data and improve recognition over

    baseline performance.

    1.3 Organization of the dissertation

    The rest of the dissertation is organized as follows: Chapter 2 describes previ-

    ous work done in the area. In Chapter 3, we describe the sensors and dataset and

    the software used in our experiments. We study the effect of poor quality video on

    recognition in Chapter 4. Chapters 5 through 7 describe the work we have done

    in this dissertation. Finally, we end with our conclusions in Chapter 8.

    5

  • CHAPTER 2

    PREVIOUS WORK

    In this chapter, we describe previous work that looks at face recognition from

    unconstrained video. We first describe three studies that explore face recognition

    from video. We look at two problems, namely uncontrolled pose and poor lighting

    conditions. We then describe different approaches that have been used to handle

    both of these problems.

    2.1 Current evaluations

    Three different studies that address face recognition from video are: the FRVT

    2002 Evaluation, FRVT 2006 Evaluation and the Foto-Fahndung report. The

    FRVT 2002 report describes the use of three-dimensional morphable models with

    video and documents the benefits of using them for face recognition. FRVT 2006

    reports on face recognition performance under controlled and uncontrolled light-

    ing. The Foto-Fahndung report describes the face recognition performance of

    three different pieces of software, when the data comes from video acquired by

    a camera looking at an escalator in a German train station. We describe these

    studies in further detail below.

    In the FRVT 2002 Evaluation Report [32], face recognition experiments are

    conducted in three new areas (three-dimensional morphable models, normalization

    6

  • and face recognition from video). The first experiment compares face recognition

    performance when using still images in the probe set to using 100 frames from a

    video sequence, while the subject is talking with varied expression. The video is

    similar to that of a mugshot with the added component of change in expression.

    The gallery is a set of still images. Among all the participants in FRVT 2002,

    except for DreamMIRH and VisionSphere, recognition performance is better when

    using a still image rather than when using a video sequence. They observe that

    if the subject were walking toward the camera, there would be a change in size

    and orientation of the face that would be a further challenge to the system. In

    this work, we focus on uncontrolled video, where data is captured using a surveil-

    lance camera in uncontrolled lighting conditions, hence performance is expected

    to be poor. They also conclude that 3D morphable models provide only slight

    improvement over 2D images.

    In 2006, the FRVT 2006 Evaluation Report [34], compared face recognition

    when using 2D and 3D data. It also explores face recognition when using con-

    trolled and uncontrolled lighting. When using 3D data, the algorithms were able

    to meet the FRGC [33] goal of an improvement of an order of magnitude over

    FRVT 2002. To test the effect of lighting, the gallery data was captured in a

    controlled environment, whereas the probe data was captured in an uncontrolled

    lighting environment (either indoors or outdoors). Cognitec, Neven Vision, SAIT

    and Viisage outperformed the best FRGC results achieved, with SAIT having a

    false reject rate between 0.103 and 0.130 at a false accept rate of 0.001. The per-

    formance of FRVT participants when using uncontrolled probe data matches that

    of the FRVT participants of 2002 when using controlled data. However, they also

    show that illumination condition does have a huge effect on performance.

    7

  • The Foto-Fahndung report [9] evaluates performance of three recognition sys-

    tems when the data comes from a surveillance system in a German railway station.

    They report recognition performance in four distinct conditions based on light-

    ing and movement of the subjects and show that while face recognition systems

    can be used in search scenarios, environmental conditions such as lighting and

    quick movements influence performance greatly. They conclude that it is possible

    to recognize people from video, provided the external conditions are right, espe-

    cially lighting. They also state that high recognition performance can be achieved

    indoors, where the light does not change much. However, drastic changes in

    lighting conditions affect performance greatly. They state that High recognition

    performance can be expected in indoor areas which have non-varying light con-

    ditions. Varying light conditions (darkness, black light, direct sunlight) cause a

    sharp decrease in recognition performance. A successful utilization of biometric

    face recognition systems in outdoor areas does not seem to be very promising for

    search purposes at the moment. [9] They suggest the use of 3D face recognition

    technology as a way to improve performance.

    2.2 Pose handling

    Zhou and Chellappa [51] state that researchers have handled rotation problems

    in three ways: (1) Using multiple images per person when they are available (2)

    Using multiple training images but only one database image per subject when

    running recognition and (3) Using a single image per subject where no training is

    required.

    Zhou et al. [54] apply a condensation approach to solve numerically the prob-

    lem of face recognition from video. They point out that most surveillance video

    8

  • is of poor quality and low image resolution and has large illumination and pose

    variations. They believe that the posterior probability of the identity of a sub-

    ject varies over time. They use a condensation algorithm that determines the

    transformation of the kinematics in the sequence and the identity simultaneously,

    incorporating two conditions into their model: (1) motion in a short time interval

    depends on the previous interval, along with noise that is time-invariant and (2)

    the identity of the subject in a sequence does not change over time. When they use

    a gallery of 12 still images and 12 video sequences as probes, they achieve 100%

    rank one recognition rate. However, the small size of the dataset may contribute

    to the high accuracy.

    In a later work, they extend this approach to apply to scenarios where the

    illumination of probe videos is different from that of the gallery [21], which is also

    made up of video clips. Each subject is represented as a set of exemplars from

    a video sequence. They use a probabilistic approach to determine the set of the

    images that minimizes the expected distance to a set of exemplar clusters and

    assume that in a given clip, the identity of the subject does not change, Bayesian

    probabilities are used over time to determine the identity of the faces in the frames.

    A set of four clips of 24 subjects each walking on a treadmill is used for testing.

    The background is plain and each clip is 300 frames long. They achieve 100%

    rank one recognition rate on all four combinations of clips as probe and gallery.

    Chellappa et al. build on this in [52]. They incorporate temporal information

    in face recognition. They create a model that consists of a state equation, an

    identity equation (containing information about the temporal change of the iden-

    tity) and an observation equation. Using a set of four video clips with 25 subjects

    walking on a treadmill, (from the MoBo [16] database), they train their model

    9

  • on one or two clips per subject and use the remaining for testing. They are able

    to achieve close to 100% rank one recognition rate overall. They expand on this

    work [53] to incorporate both changes in pose within a video sequence and the

    illumination change between a gallery and probe. They combine their likelihood

    probability between frames over time which improves performance overall. In a

    set of 30 subjects, where the gallery set consists of still images, they achieved 93%

    rank one recognition rate.

    Park and Jain [31] use a view synthesis strategy for face recognition from

    surveillance video, where the poses are mainly non-frontal and the size of the faces

    is small. They use frontal pose images for their gallery, whereas the probe data

    contains variable pose. They propose a factorization method that develops 3D

    face models from 2D images using Structure from Motion (SfM). They select a

    video frame in which the pose of the face is the closest to a frontal pose, as a

    texture model for the 3D face reconstruction. They then use a gradient descent

    method to iteratively fit the 3D shape to the 72 feature points on the 2D image.

    On a set of 197 subjects, they are able to demonstrate a 40% increase in rank one

    recognition performance (from 30% to 70%).

    Blanz and Vetter [10] describe a method to fit an image to a 3D morphable

    model to handle pose changes for face recognition. Using a single image of a per-

    son, they automatically estimate 3D shape, texture and illumination. They use

    intrinsic characteristics of the face that are independent of the external conditions

    to represent each face. In order to create the 3D morphable model, they use a

    database of 3D laser scans that contains 200 subjects from a range of demograph-

    ics. They build a dense point-to-point correspondence between the face model and

    a new face using optical flow. Each face is fit to the 3D shape using seven facial

    10

  • feature points (tip of nose, corners of eyes, etc.). They try to minimize the sum of

    squared differences over all color channels from all pixels in the test image to all

    pixels in the synthetic reconstruction. On a set of 68 subjects of the PIE database

    [40], they achieve 95% rank one recognition rate when using the side view gallery.

    Using the FERET set, with 194 subjects, they achieve 96% rank one recognition

    when using the frontal images as gallery and the remaining images as probes.

    Huang et al. [48] use 3D morphable models to handle pose and illumination

    changes in face video. They create 3D face models based on three training images

    per subject and then render 2D synthetic images to be used for face recognition.

    They apply a component-based approach for face detection that uses 14 indepen-

    dent component classifiers. The faces are rotated from 0 to 34 in increments of

    2 using two different illuminations. At each instance, an image is saved. Out of

    the 14 components detected, nine are used for face recognition. The recognition

    system consists of second degree polynomial Support Vector Machine classifiers.

    When they use 200 images of six different subjects, they get a true accept rate of

    90% at a false accept rate of 10%.

    Beymer [8] uses a template based approach to represent subjects in the gallery

    when there are pose changes in the data. He first applies a pose estimator based

    on the features of the face (eyes and mouth). Then, using the nose and the eyes,

    the recognition system applies a transform to the input image to align the three

    feature points with a training image. When using 930 images for training the

    detector and 520 images for testing, the features are correctly detected 99.6% of

    the time. For recognition, a feature-level set of systems is used for each eye, nose

    and mouth. The probe images are compared only to those gallery images closest

    to its pose. Then he uses a sum of correlations of the best matching eye, nose and

    11

  • mouth templates to determine the best match. On the set of 62 subjects, when

    using 10 images per subject with an inter-ocular distance of about 60 pixels in the

    images, the rank one recognition rate is 98.39%. However, this is a relatively large

    inter-ocular distance for good face recognition and not usually typical of faces in

    surveillance quality video.

    Arandjelovic and Cipolla [3] deal with face movement and observe that most

    strategies use the temporal information of the frames to determine identity. They

    propose a strategy that uses Resistor Average Distance (RAD), which is a measure

    of dissimilarity between two disjoint probabilities. They claim that PCA does not

    capture true modes of variation well and hence a Kernel PCA is used to map the

    data to a high-dimensional space. Then, PCA can be applied to find the true

    variations in the data. For recognition, the RAD between the distributions of

    sets of gallery and probe points is used as a measure of distance. They test their

    approach on two databases. One database contains 35 subjects and the other

    contains 60 subjects. In both datasets, the illumination conditions are the same

    for training and testing. They achieve around 98% rank one recognition rate on

    the larger dataset.

    Thomas et al. [43] use synthetic poses and score-level fusion to improve recog-

    nition when there is variable pose in the data. They show that recognition can

    be improved by exploiting temporal continuity. The gallery dataset consists of

    one high-quality still image per subject. Using the approach in [10] to generate

    synthetic poses, the gallery set is enhanced with multiple images per subject. A

    dataset of 57 subjects is used, which contains subjects walking around a corner in

    a hallway. When they use the original gallery images and treat each probe image

    as a single frame with no temporal continuity, they achieve a rank one recog-

    12

  • nition rate of 6%. However, by adding synthetic poses and exploiting temporal

    continuity, they improved rank one recognition performance to 21%.

    2.3 Illumination handling

    Zhou et al. [55] separate the strategies to handle changes in illumination into

    three categories. The first set of approaches is called subspace methods. These

    approaches are most commonly used in recognition problems. Some common ex-

    amples of this class of approaches are PCA [44] and LDA [50]. However, the

    disadvantage of such techniques is that they are tuned to the illumination condi-

    tions that they are trained on. When the gallery set consists of still images taken

    indoors under controlled lighting conditions and the probe set is of surveillance

    quality video acquired under uncontrolled lighting conditions, recognition perfor-

    mance is poor. The second set of approaches is reflectance model methods. A

    Lambertian reflectance model is used to model lighting. The disadvantage of this

    approach is that it is not as effective an approach when the subjects in the testing

    set are not encountered in the training set. The third set of approaches uses 3D

    models for representation. These models are robust to illumination effects. How-

    ever, they require a sensor that can capture such data or the data needs to be

    built based on 2D images.

    Adini et al. [2] describe four image representations that can be used to han-

    dle illumination changes. They divide the approaches to handle illumination into

    three categories: (1) Gray level information to extract a three-dimensional shape

    of the object (2) A stored model that is relatively insensitive to changes in illumi-

    nation and (3) A set of images of the same object under different illuminations.

    The third approach may be not be realistic given the experiment and the setup.

    13

  • Furthermore, one may not be able to fully capture all the possible variations in

    the data. While it has been shown theoretically that a function invariant to il-

    lumination does not exist, there are representations that are more robust than

    others [2]. The four representations they consider are (1) the original gray-level

    image, (2) the edge map of the image, (3) the image filtered with 2D Gabor like

    filters and (4) the second-order derivative of the gray level image [2]. Some edges

    of the image can be insensitive to illuminations whereas others are not. However,

    an edge map is useful in that it is a compact representation of the original im-

    age. Derivatives of the gray level image are useful because while ambient light

    will affect the gray level image, under certain conditions it does not affect the

    derivatives. In order to make the images more robust, they divide the face into

    two sub parts by creating subregions of the eyes area and the lower part of the

    face. They show that in highly variable lighting, the error rate is 100% on raw

    gray level images, where there are changes in illumination direction. Performance

    improves when using the filtered images. They also show that even though the

    filtered images do not resemble the original face, they encode information to im-

    prove recognition. However, they conclude that no one representation is sufficient

    to overcome variations in illumination. While some are robust to changes along

    the horizontal axis, others are more robust along the vertical axis. Hence, the

    different approaches need to be combined to exploit the benefits of each of them.

    Zhao and Chellappa [51] use 3D models to handle the problems of illumina-

    tion in face recognition. They create synthesized images acquired under different

    lighting and viewing conditions. They develop a 2D prototype image from a 2D

    image acquired under variable lighting using a generic 3D model, rather than a

    full 3D approach that uses accurate 3D information. For the generic 3D model,

    14

  • a laser scanned range map is used. They use a Lambertian model to estimate the

    albedo value, which they determine using a self-ratio image, which is the illu-

    mination ratio of two differently aligned images. Using a 3D generic model, they

    bypass the 2D to 3D step, since the pose is fixed in their dataset. When they test

    their approach using the Yale database, on a set of 15 subjects with 4 images each

    they obtain 100% rank one recognition rate, which was an improvement of about

    25% improvement over using the original images (about 75% rank one recognition

    rate).

    Wei and Lai [47] describe a robust technique for face recognition under vary-

    ing lighting conditions. They use a relative image gradient feature to represent

    the image, which is the image gradient function of the original intensity image,

    where each pixel is scaled by the maximum intensity of its neighbors. They use

    a normalized correlation of the gradient maps of the probe and gallery images to

    determine how well the images match. On the CMU-PIE face database [40], which

    contains 22 images under varying illuminations of 68 individuals, they obtain an

    equal error rate of 1.47% and show that their approach outperforms recognition

    when using the original intensity images.

    Price and Gee [36] also propose a PCA-based approach to address three issues

    that could cause problems in face recognition, namely illumination, expression and

    decoration (specifically, glasses and facial hair). They use an LDA-based approach

    to handle changes in illumination and expression. They note that subregions of

    the face are less sensitive to expression and decoration than the full face. So they

    break the face into modular subregions: the full face, the region of the eyes and

    the nose and then just the eyes. For each region, they independently determine

    the distance from that region to each of the corresponding images in the database.

    15

  • Hence, they have a parallel system of observations, one for each region mentioned

    above. They then use a combination of results as their matching score to determine

    the best match. They use a database of 106 subjects with varied illumination,

    expression and decoration, where 400 still images are used for training and 276

    for testing. When they combine the results from the three observers, using PCA

    and LDA, they achieve a rank one recognition rate of 94.2%.

    Hiremath and Prabhakar [18] use interval-type discriminating features to gen-

    erate illuminant invariant images. They create symbolic faces for each subject

    in each illumination type based on the maximum and minimum value found at

    each pixel for a given dataset. While this is an appearance-based approach, it

    does not suffer the same drawbacks as other approaches because it uses interval

    type features. Therefore, it is insensitive to the particular illumination conditions

    in which the data is captured within the range of illuminations in the training

    data. They then use Factorial Discriminant Analysis to find a suitable subspace

    with optimal separation between the face classes. They test their approach using

    the CMU PIE [40] database and get a 0% error rate. This approach is advanta-

    geous in that it does not require a probability distribution of the image gradient.

    Furthermore, it does not use any complex modeling of reflection components or

    assume a Lambertian model. However, it is limited by the range of illuminations

    found in the training data. Therefore, it may not be applicable in cases where

    there is a difference in the illuminations between the gallery and probe sets.

    Belhumeur et al. [6] use LDA to produce well-separated classes for robustness

    to lighting direction and facial expression, and compare their approach to using

    eigenfaces (PCA) for recognition. They conclude that LDA performs the best

    when there are variations in lighting or even simultaneous changes in lighting

    16

  • and expression. They also state that In the [PCA] method, removing the first

    three principal components results in better performance under variable lighting

    conditions. [6] Their experiments use the Harvard database [17] to test variation

    in lighting. The Harvard database contains 330 images from 5 subjects (66 images

    each). The images are divided into five subsets based on the direction of the light

    source (0, 30, 45, 60, 75 degrees). The Yale database consists of 16 subjects with

    10 images each taken on the same day but with variation in expression, eyewear

    and lighting. They use a nearest neighbor classifier for matching, though the

    measure used to determine distance was not specified. The variation in expression

    and lighting is tested using a leave-one-out error estimation strategy on all 16

    subjects. They train the space on nine of the images and then tested it using the

    image left out and achieve a 0.6% recognition error rate using LDA and a 19.4%

    recognition error rate using PCA, with the first three dimensions dropped. They

    do mention that the databases are small and more experimentation using larger

    databases is needed.

    Arandjelovic and Cipolla [5] handle variation in illumination and pose using

    clustering and Gamma intensity correction. They create three clusters per sub-

    ject corresponding to different poses and use locations of pupils and nostrils to

    distinguish between the three clusters. Illumination is handled using Gamma in-

    tensity correction. Here, the pixels in each image are transformed so as to match

    a canonically illuminated image. Pose and illumination are combined by per-

    forming PCA on variations of each persons images under different illuminations

    from a given persons mean image and using simple Euclidean distance as their

    distance measure. In order to match subjects to a novel image, they use the ratio

    of the probability that three clusters belong to the same subject over the proba-

    17

  • bility that they belong to a different subject. Their dataset consist of 20 subjects

    for training and 40 others for testing, where each subject has 20-100 images in

    random motion. They achieve 95% rank one recognition rate using this approach.

    Arandjelovic and Cipolla [4] evaluate strategies to achieve illumination in-

    variance when there are large and unpredictable illumination changes. In these

    situations, the difference between two images of the same subject under different

    illuminations is larger than that of two images under the same illumination but

    of different subjects. Hence, they focus on ways to represent the subjects face

    and put more emphasis on the classification stage. They show that both the high

    pass filter and the self quotient image operations on the original intensity image

    show recognition improvement over the raw grayscale representation of the images,

    when the imaging conditions between the gallery and probe set are very different.

    However, they also note that while they improve recognition in the difficult cases,

    they actually reduce performance in the easy cases. They conclude that Lapla-

    cian of Gaussian representation of the image as described in [2] and a quotient

    image representation perform better than using the raw image. They demonstrate

    a rank one recognition rate improvement from about 75% using the raw images,

    to 85% using the Laplacian of Gaussian representation, to about 90%, using quo-

    tient images. Since we are dealing with conditions which change drastically and

    where the conditions for gallery and probe data differ, we use these approaches to

    improve recognition in this work.

    Gross and Brajovic [15] use an illuminance-reflectance model to generate im-

    ages that are robust to illumination changes. Their model makes two assumptions:

    human vision is mostly sensitive to scene reflectance and mostly insensitive to il-

    lumination conditions and secondly that human vision responds to local changes

    18

  • in contrast rather than to global brightness levels. [15] Since they focus on pre-

    processing the images based on the intensity, there is no training required. They

    test their approach using the Yale database, which contains 10 subjects acquired

    under 576 lighting conditions. When using PCA for recognition, they improve

    the rank one recognition rate from 60% to 93%, when using reflectance images in-

    stead of the original intensity images. Since we are dealing with conditions which

    change drastically and where the conditions for gallery and probe data differ, we

    use these approaches to improve recognition in this work.

    Wang et al. [46] expand on the approach in [15] and used self-quotient images

    to handle the illumination variation for face recognition. The Lambertian model

    of an image can be separated into two parts, the intrinsic and extrinsic part. If

    one can estimate the extrinsic part based on the lighting, it can be factored out

    of the image to retain the intrinsic part for face recognition. The image is found

    by using a smoothing kernel and dividing the image pixels by this filter. Let F

    be the smoothing filter and I the original image, then the self-quotient image, Q

    is defined as IF [I]

    . They demonstrate their approach on the Yale and PIE dataset

    and show improvement over using the intensity images for recognition, from about

    50% to about 95% rank one recognition rate.

    Nishiyama et al. [25] show that self-quotient images [46] are insufficient to han-

    dle partial cast shadows or partial specular reflection. They handle this weakness

    by using an appearance-based quotient image. They use photometric linearization

    to transform the image into the diffuse reflection. A linearized image is defined as

    a linear combination of three basis images. In order to generate the basis images

    to find the diffuse image, different images from other subjects are used. They

    acquired images under fixed pose with a moving light source. The reflectance

    19

  • image is then factored out using the estimated diffuse image. They compare their

    algorithm to the self-quotient image and the quotient image and show that on the

    Yale B database and show that they achieve a rank one recognition rate of 96%,

    whereas self-quotient images achieve 87% rank one recognition rate and Support

    Retinex images [37] achieves a rank one recognition rate of 93%.

    2.4 Other issues

    Howell and Buxton [19] propose a strategy for face recognition when using

    low-resolution video. Their goal is to capture similarities of the face over a wide

    range of conditions and solve the problem for just a small group (less than 100)

    of subjects. The environment is unconstrained in that there are no restrictions on

    movement. They use the temporal information of the frames linked by movement

    information to match the frames. This allows them to make the assumption

    that between two consecutive frames, the identity of the subject will not change

    instantly. They use a two-layer, hybrid learning network with a supervised and

    unsupervised layer and adjust weights using the Widrow-Hoff delta learning rule.

    The network is trained to include the variation that they want their system to

    tolerate. From a set of 400 images of 40 people, using 5 images per subject, and

    discarding frames that do not include a face, they are able to achieve 95% rank

    one recognition rate.

    Lee et al. [22] discuss an approach to handle low resolution video using support

    vector data description (SVDD). They project the input images as feature vectors

    on the spherical boundary of the feature space and conduct face recognition using

    correlation on the images normalized based on the inter-ocular distance. They use

    the Asian Face database for their experiments and different resolutions, ranging

    20

  • from 16 x 16 pixels to 128 x 128 pixels and achieve a rank one recognition rate of

    92% when using the lowest resolution images.

    Lin et al. [23] describe an approach to handle face recognition from video of low

    resolution like those found in surveillance. They use optical flow for registration

    to handle issues of non-planarity, non-rigidity, self-occlusion and illumination

    and reflectance variation. [23] For each image in the sequence, they interpolate

    between the rows and columns to obtain an image that is twice the size of the

    original image. They then compute optical flow between the current frames and

    the two previous and two next images and register the four adjacent images us-

    ing displacements estimated by the optical flow. Then they compute the mean

    using the registered images and the reference images. The final step is to apply a

    deblurring Wiener deconvolution filter to the super resolved image. They tested

    their approach on the CUAVE database, which contains 36 subjects. When they

    reduce the images to 13x18 pixels, their approach (approximately 15% FRR at

    1% FAR) performs slightly better than bilinear interpolated images and far out-

    performs nearest neighbor interpolation. They expand on this work in [24] and

    compare their approach to a hallucination approach (assumes a frontal view of

    face and works well when faces are aligned exactly). They conclude that while

    there is some improvement gained over using over the lower resolution images,

    a fully automated recognition system is currently impractical, given the perfor-

    mance. Hence, they relax their constraint to a rank ten match and can achieve

    87.3% rank ten recognition rate on XM2VTS dataset that contains 295 subjects.

    In Table 2.1, we summarize the different approaches along with their assump-

    tions, dataset size and performance. We divide up the works based on the problem

    they are trying to solve: (1) Variable pose (2) variable illumination and (3) other

    21

  • problems, such as low resolution on the face. Performance is reported in rank one

    recognition rate, unless otherwise specified. Some of the results are reported in

    terms of equal error rate (or EER). Also, the results must be viewed in light of

    the difficulty of the dataset (data features) and dataset size.

    22

  • TA

    BL

    E2.

    1

    PR

    EV

    IOU

    SW

    OR

    K

    Auth

    ors:

    Tit

    leB

    asic

    idea

    Dat

    afe

    ature

    sD

    atas

    etsi

    zeP

    erfo

    r-m

    ance

    1

    AP

    PR

    OA

    CH

    ES

    TO

    HA

    ND

    LE

    VA

    RIA

    BL

    EP

    OSE

    Zhou

    :[5

    4]P

    oste

    rior

    pro

    bab

    ilit

    yov

    erti

    me

    Con

    stra

    ined

    vid

    eo12

    100%

    Bla

    nz

    and

    Vet

    ter:

    [10]

    3Dm

    orphab

    lem

    odel

    ,p

    oint-

    to-p

    oint

    corr

    esp

    onden

    cefo

    rsi

    milar

    ity

    Var

    iable

    pos

    e19

    496

    %

    Wey

    rauch

    ,et

    al:

    [48]

    3Dm

    odel

    sbas

    edon

    2Dtr

    ainin

    gim

    ages

    ,re

    nder

    syn-

    thet

    icp

    oses

    for

    reco

    gnit

    ion

    2diff

    eren

    tillu

    min

    atio

    ns,

    face

    sro

    tate

    d

    690

    %

    Par

    kan

    dJai

    n:

    [31]

    Vie

    wsy

    nth

    esis

    stra

    tegi

    es,

    3Dfa

    cem

    odel

    susi

    ng

    SfM

    Non

    fron

    tal

    face

    s19

    370

    %

    Bey

    mer

    :[8

    ]T

    empla

    te-b

    ased

    appro

    ach,

    feat

    ure

    -lev

    elsy

    stem

    com

    -bin

    edusi

    ng

    sum

    ofco

    rrel

    a-ti

    ons

    Pos

    ech

    ange

    s62

    98.3

    9%

    23

  • TA

    BL

    E2.

    1

    Con

    tin

    ued

    Auth

    ors:

    Tit

    leB

    asic

    idea

    Dat

    afe

    ature

    sD

    atas

    etsi

    zeP

    erfo

    r-m

    ance

    Ara

    ndje

    lovic

    ,C

    ipol

    la:

    [3]

    Use

    Ker

    nel

    PC

    Ato

    reduce

    the

    dim

    ensi

    onal

    ity

    ofim

    ages

    tonea

    rly

    linea

    r.A

    pply

    RA

    Dto

    calc

    ula

    tedis

    tance

    sb

    e-tw

    een

    two

    sets

    ofim

    ages

    Fac

    em

    ove-

    men

    t;N

    oillu

    min

    atio

    nch

    ange

    36A

    bou

    t97

    %

    AP

    PR

    OA

    CH

    ES

    TO

    HA

    ND

    LE

    VA

    RIA

    BL

    EIL

    LU

    MIN

    AT

    ION

    Kru

    eger

    ,Z

    hou

    :[2

    1]E

    xem

    pla

    rcl

    ust

    ers

    tore

    p-

    rese

    nt

    sub

    ject

    s,B

    ayes

    ian

    pro

    bab

    ilit

    ies

    over

    tim

    eto

    eval

    uat

    eid

    enti

    ty

    Sub

    ject

    ontr

    eadm

    ill,

    fron

    tal

    vid

    eo

    2410

    0%

    Zhou

    :C

    hel

    lappa:

    [52]

    Sta

    teof

    iden

    tity

    equat

    ion,

    tem

    por

    alco

    nti

    nuit

    ySub

    ject

    ontr

    eadm

    ill,

    fron

    tal

    vid

    eo

    25A

    bou

    t10

    0%

    Zhou

    ,C

    hel

    lappa:

    [53]

    Lik

    elih

    ood

    pro

    bab

    ilit

    yb

    e-tw

    een

    fram

    esov

    erti

    me

    Sub

    ject

    ontr

    eadm

    ill,

    fron

    tal

    vid

    eo

    3093

    %

    24

  • TA

    BL

    E2.

    1

    Con

    tin

    ued

    Auth

    ors:

    Tit

    leB

    asic

    idea

    Dat

    afe

    ature

    sD

    atas

    etsi

    zeP

    erfo

    r-m

    ance

    Zhao

    ,C

    hel

    lappa:

    [51]

    Synth

    etic

    imag

    esac

    quir

    edunder

    diff

    eren

    tligh

    ting,

    use

    aL

    amb

    erti

    anm

    odel

    tohan

    -dle

    the

    alb

    edo

    Var

    ied

    illu

    mi-

    nat

    ion

    1510

    0%

    Wei

    and

    Lai

    :[4

    7]M

    odifi

    edim

    age

    inte

    n-

    sity

    funct

    ion,

    nor

    mal

    ized

    corr

    elat

    ion

    for

    mat

    chin

    g

    Var

    yin

    gillu

    -m

    inat

    ions

    681.

    47%

    EE

    R

    Pri

    cean

    dG

    ee:

    [36]

    LD

    Abas

    edap

    pro

    ach

    usi

    ng

    subre

    gion

    sof

    the

    face

    Var

    ied

    il-

    lum

    inat

    ion,

    expre

    ssio

    n,

    dec

    orat

    ion

    106

    94.2

    %

    Hir

    emat

    h,

    Pra

    b-

    hak

    ar:

    [18]

    Sym

    bol

    icin

    terv

    alty

    pe

    feat

    ure

    sto

    repre

    sent

    face

    clas

    ses,

    Fac

    tor

    Dis

    crim

    i-nan

    tA

    nal

    ysi

    sto

    reduce

    dim

    ensi

    ons

    Diff

    eren

    til-

    lum

    inat

    ions,

    still

    imag

    es

    680% E

    ER

    25

  • TA

    BL

    E2.

    1

    Con

    tin

    ued

    Auth

    ors:

    Tit

    leB

    asic

    idea

    Dat

    afe

    ature

    sD

    atas

    etsi

    zeP

    erfo

    r-m

    ance

    Bel

    hum

    eur

    etal

    .:[6

    ]L

    DA

    for

    reco

    gnit

    ion,tr

    ained

    onm

    ult

    iple

    sam

    ple

    sp

    ersu

    b-

    ject

    wit

    hva

    ryin

    gillu

    min

    a-ti

    on

    Var

    iable

    ligh

    t-in

    g5

    0.6%

    EE

    R

    Ara

    nje

    lovic

    and

    Cip

    olla

    :[4

    ]U

    seim

    age

    filt

    ers

    since

    the

    il-

    lum

    inat

    ion

    isunpre

    dic

    table

    Var

    iable

    ligh

    t-in

    g10

    073

    .6%

    Ara

    ndje

    lovic

    ,C

    ipol

    la:

    [5]

    Cre

    ate

    Gau

    ssia

    ncl

    ust

    ers

    corr

    esp

    ondin

    gto

    vari

    ous

    pos

    es;

    apply

    Gam

    ma

    inte

    n-

    sity

    corr

    ecti

    on;

    Eucl

    idea

    ndis

    tance

    sto

    det

    erm

    ine

    diff

    eren

    ce

    Var

    ied

    illu

    mi-

    nat

    ion

    96

    AP

    PR

    OA

    CH

    ES

    TO

    HA

    ND

    LE

    LO

    W-

    RE

    SO

    LU

    TIO

    NV

    IDE

    O

    26

  • TA

    BL

    E2.

    1

    Con

    tin

    ued

    Auth

    ors:

    Tit

    leB

    asic

    idea

    Dat

    afe

    ature

    sD

    atas

    etsi

    zeP

    erfo

    r-m

    ance

    How

    ell,

    Buxto

    n:

    [19]

    Explo

    itte

    mp

    oral

    info

    rma-

    tion

    offr

    ames

    ,tw

    o-la

    yer

    hy-

    bri

    dle

    arnin

    gnet

    wor

    k,

    ad-

    just

    edusi

    ng

    the

    Wid

    row

    -H

    offle

    arnin

    gru

    le

    Unco

    nst

    rain

    eden

    vir

    onm

    ent

    and

    mov

    e-m

    ent

    4095

    %

    Lee

    ,et

    al:

    [22]

    Supp

    ort

    vect

    ordat

    ades

    crip

    -ti

    on,

    corr

    elat

    ion

    for

    reco

    gni-

    tion

    Low

    reso

    lu-

    tion

    92%

    Lin

    etal

    .:[2

    3]U

    seop

    tica

    lflow

    for

    regi

    stra

    -ti

    on,

    crea

    tesi

    ngl

    eSR

    fram

    efr

    om5

    regi

    ster

    edfr

    ames

    ,use

    PC

    Aw

    ith

    Mah

    Cos

    ine

    dis

    -ta

    nce

    for

    reco

    gnit

    ion

    Sub

    ject

    talk

    -in

    g36

    15FA

    R

    27

  • 2.5 How this dissertation relates to prior work

    In this dissertation, we focus on the variations in pose and uncontrolled light-

    ing.

    To handle the variations in pose, we use a multi - gallery approach to repre-

    sent all the poses in the dataset. We create synthetic poses to represent those

    that may be present in our probe set. We then use score - level fusion. This ap-

    proach requires no training and thus is useful for datasets where the poses in the

    probe and gallery sets differ. On a dataset of 57 subjects, we achieve a rank one

    recognition rate of 21%, which is an improvement of 6% rank one recognition rate

    achieved using the baseline approach. The baseline approach used for comparison

    is described in Section 4.1. This work is described in Chapter 5.

    To handle the lighting conditions, we use an appearance-based model, which

    doesnt require any training data or knowledge of the model. We create reflected

    images by using one half of the face and reflecting it over the other half. We

    then use score-level fusion to combine the two sets of results. We demonstrate

    an improvement relative to the self-quotient image and quotient image approach,

    which assumes a Lambertian model. On a dataset of 26 subjects, we show a rank

    one recognition rate of 49.88% and an equal error rate of 18.27%, whereas base-

    line performance using the original images 38.62% rank one recognition rate and

    19.27% equal error rate. Here, baseline performance is the performance achieved

    when using the original images as obtained from the surveillance cameras, with

    no preprocessing. This work is described in Chapter 6.

    28

  • CHAPTER 3

    EXPERIMENTAL SETUP

    In this chapter, we describe the sensors, data sets and software used in our

    experiments. We acquire three different datasets for our experiments. We label

    them the NDSP, IPELA and Comparison dataset. The first dataset is used to

    show baseline performance and used in our pose and face detection experiments.

    The IPELA dataset is used for our reflection experiments to handle pose and

    illumination variation. Finally, the Comparison dataset is used to compare face

    recognition performance when using high-quality data acquired on a camcorder

    and when using data acquired on a surveillance camera.

    The rest of the chapter is organized as follows: Section 3.1 describes the differ-

    ent sensors we use in our experiments. We then describe our datasets in Section

    3.2. Finally, the software we use is described in Section 3.3.

    3.1 Sensors

    We capture data using four different sensors. The first camera is a Nikon D80,

    used to acquire the gallery data used in the experiments. The second camera is

    a PTZ camera installed by the Notre Dame Security Police. The third camera is

    a Sony IPELA camera with PTZ capability. The fourth camera is a Sony HDR

    camcorder used to capture data as a comparison to the surveillance-quality data.

    We describe each in detail below.

    29

  • Figure 3.1. Camera to capture gallery data: Nikon D80

    3.1.1 Nikon D80

    The gallery data is acquired on a Nikon D80 [28]. It is a digital single-lens

    reflex (SLR) camera. The resolution of the images is 3872 x 2592 pixels. The

    camera is shown in Figure 3.1.

    3.1.2 Surveillance camera installed by NDSP

    The probe data is acquired using a surveillance video camera with PTZ (pan,

    tilt, zoom) capability. The camera is part of the NDSP security system and is

    attached to the ceiling on the first floor of Fitzpatrick Hall, as seen in Figure

    3.2. The resolution of this camera is 640 x 480 pixels. The data is captured in

    30

  • Figure 3.2. Surveillance camera: NDSP camera

    interlaced mode.

    3.1.3 Sony IPELA camera

    We also acquire data on a surveillance camera called Sony SNC RZ25N surveil-

    lance camera [41]. The resolution of this camera is 640 x 480 pixels and the data

    is captured in interlaced mode. In Figure 3.3, we show an image of such a camera.

    3.1.4 Sony HDR Camcorder

    For our comparison dataset, we also acquire high quality data on a Sony HDR

    camcorder [42]. The video was captured at a frame rate of 29.97 frames per second

    in interlaced mode. In Figure 3.4, we show an image of this camcorder.

    In Table 3.1, we compare all the cameras used in this dissertation.

    31

  • Figure 3.3. Surveillance camera: Sony IPELA camera

    Figure 3.4. High-definition camcorder: Sony HDR-HC7

    32

  • TABLE 3.1

    FEATURES OF CAMERAS USED

    Features Camera

    Name used Nikon D80 NSDP camera IPELA Sony HD

    Model Nikon D80 Not available Sony SNCRZ25N

    Sony HDRHC

    Resolution 2592x3872 640x480 640x480 1920x1080

    Image size 3,732 kb 40kb 52kb 466 kb

    Interlaced No Yes Yes Yes

    3.2 Dataset

    We describe three datasets. They are named NDSP, IPELA and Comparison

    dataset based on the camera used to acquire them and the experiments for which

    they are used.

    3.2.1 NDSP dataset

    We use two kinds of sensors to acquire data for this dataset. The gallery

    data containing high quality still images is acquired using the Nikon D80 camera.

    The subject is sitting about two meters from the camera in a controlled well-lit

    environment, in front of a gray background. The inter-ocular distance is about

    230 pixels, with a range of between 135 and 698 pixels. In Figure 3.5, we show

    the set up and two of the images acquired for the gallery.

    The probe data is acquired using the NDSP surveillance camera, located on

    the first floor of Fitzpatrick Hall. The video consists of a subject entering through

    33

  • (a) Acquisition setup for gallery data

    (b) Example gallery images

    Figure 3.5. Gallery image acquisition setup

    34

  • a glass door, walking around the corner till they are out of the camera view.

    Each video sequence consists of between 50 and 150 frames. In Figure 3.6 we

    show 10 frames acquired from this camera. We see that the illumination is highly

    uneven due to the glare of the sun on the subject. The inter-ocular distance is

    about 40 pixels on average. The pan, tilt and zoom are not changed during data

    acquisition but could vary from day to day since the camera is part of a campus

    security system. There are 57 subjects in this dataset. The time lapse between

    the probe and gallery data varies from two weeks to about six months.

    3.2.2 IPELA dataset

    The gallery images are acquired using the Nikon D80, in a well-lit room under

    controlled conditions. The subject is sitting about two meters from the camera in

    front of a gray background, with a neutral expression, as seen in Figure 3.5. Since

    these images are acquired indoors, the illumination is controlled. The inter-ocular

    distance is about 300 pixels.

    The probe data is acquired from the IPELA camera. The zoom on this camera

    is set so that the inter-ocular distance of the subject is about 50 pixels, starting at

    about 30 pixels when the subject enters the scene and is farthest from the camera

    to about 115 pixels when the subject is closest to the camera. It is mounted on

    a tripod set at a height of about five and a half feet. The camera position is not

    changed during a day of capture, but may vary slightly from day to day. Each

    clip consists of the subject walking around a corner until they are out of the view

    of the camera. Therefore, we capture data of the subject in a variety of poses and

    face sizes. Each video sequence is made up of 100 to 200 frames. In Figure 3.7,

    we show 10 example frames acquired from one subject from one of the clips. The

    35

  • Figure 3.6: Example frames from the NDSP camera

    36

  • Figure 3.7: Example frames from the IPELA camera

    illumination is also uncontrolled. This dataset consists of 104 subjects. The time

    lapse between the probe and gallery data is between two weeks and six months.

    Splitting up the dataset: In order to test our approach when using the

    surveillance data, we use four-fold cross validation. We split up the dataset into

    four disjoint subsets, where each set contains 26 subjects. The sets are subject-

    disjoint. For our experiments, we train the space on three subsets and test on

    the remaining subset. We use the average of the four scores as our measure of

    37

  • performance of the different approaches.

    3.2.3 Comparison dataset

    For each of the subjects in this dataset, we have one high quality still im-

    age, one high-quality video sequence and one video clip acquired from the IPELA

    surveillance camera. The gallery data is acquired in a well-lit room under con-

    trolled conditions. The subject is sitting about 2 meters from the camera against

    a gray background. We show an example image in Figure 3.5.

    The IPELA camera and HD Sony camcorder are set up to acquire data in

    the same setting. The zoom on the IPELA camera is set so that the inter-ocular

    distance of the subject is about 40 pixels on average. It is mounted on a tripod

    set at a height of about five and one half feet. The camera position is not changed

    during a day of capture, but may vary slightly from day to day. In each clip, the

    subject walks toward the left, picks up an object and then walks towards the right

    of the frame. Therefore, we capture data of the subject in a variety of poses and

    face sizes. We acquired data on three consecutive days. Each video sequence was

    made of between 100 and 300 frames. We show examples of each of these images

    in Figures 3.8 to 3.9.

    The Sony HDR camcorder is also mounted on a tripod set at a height of about

    five and a half feet and adjusted according to the height of the subject. This

    is captured simultaneously with the surveillance video, and thus consists of the

    subject walking to the left, picking up an object and then walking towards the

    right of the frame. The interocular distance of this dataset is about 45 pixels with

    a range of about 15 pixels to 110 pixels. In Figure 3.9, we show 10 example frames

    acquired from one subject from one of the clips.

    38

  • Figure 3.8: Example frames from IPELA camcorder for the Comparison dataset

    39

  • Figure 3.9: Example frames from the Sony HDR-HC7 camcorder

    40

  • This dataset contains 176 subjects. Out of the 176 subjects, 78 are acquired

    indoors in a hallway on the first floor of Fitzpatrick Hall. One half of the face

    is partially lit by the sun. The remaining 98 subjects are acquired outdoors

    in uncontrolled lighting conditions. We separate out these datasets to compare

    recognition performance when using data acquired indoors rather than outdoors.

    The probe and gallery data in this dataset are acquired on the same day. This

    dataset partly overlaps with the data of the Multi - Biometric Grand Challenge

    (or MBGC) dataset [29], but also includes surveillance data that is not part of

    the MBGC dataset.

    In Table 3.2, we summarize the details of the datasets we use in this disserta-

    tion.

    41

  • TABLE 3.2

    SUMMARY OF DATASETS

    Features Dataset

    Name NDSPDataset

    IPELADataset

    Comparison Dataset

    Gallery datasource

    Nikon D80 Nikon D80 Nikon D80 Nikon D80

    Probe datasource

    NDSP in-stalledsurveillancecamera

    Sony IPELAcamera

    Sony IPELAcamera

    Sony HDcamcorder

    Number sub-jects

    57 104 176 176

    Numberimages pergallery sub-ject

    1 1 1 1

    Numberimages perprobe subject

    50 - 150frames

    100 - 300frames

    100 - 300frames

    300 - 450frames

    Acquisitionenvironmentof probe data

    FitzpatrickHallway

    FitzpatrickHallway

    Indoor andoutdoor

    Indoor andoutdoor

    Activity Subject en-ters through aglass door andwalks arounda corner

    Subject walksaround a cor-ner and downa hallway andout of view ofthe camera

    Subject picksup an ob-ject andwalks out ofcamera-view

    Subject picksand objectand walks outof camera -view

    Time lapsebetweenprobe andgallery data

    2 weeks to 6months apart

    2 weeks to 6months apart

    Same day Same day

    42

  • 3.3 Software

    We use a variety of software for our work: FaceGen Modeller 3.2, Viisage

    IdentityExplorer, Neurotechnologija, PittPatt and CSUs PCA code. They are

    described in further detail below:

    3.3.1 FaceGen Modeller 3.2

    For each gallery image, we create a 3D model using the Nikon image as input

    and then rotate the model to get different poses. In order to create the models, we

    use the FaceGen Modeller 3.2 Free Version manufactured by Singular Inversions

    [26]. The software is based on the work by Vetter et al. [10].

    This modeler creates a 3D model using the notion of an average 3D face and

    a still frontal image. It is trained on a set of subjects from various demographics

    such as age, gender and ethnicity. It requires eleven points to be marked on the

    face: centers of the eyes, edges of the nose, the corners of the mouth, the chin,

    the point at which the jaw line touches the face visually and the points at which

    the ears touch the face. Once the 3D model is rendered using the still image,

    different parameters, such as gauntness of cheeks and the jaw line can be tweaked

    to represent the particular subject in the 2D image more accurately. The synthetic

    3D face can then be rotated to get different views of the face. A screen shot of

    the software is shown in Figure 3.10.

    3.3.2 IdentityEXPLORER

    Viisage manufactures an SDK for multi-biometric technology, called Identity-

    EXPLORER. It provides packages for both face and fingerprint recognition. It

    is based on Viisages Flexible Template Matching technology and a new set of

    43

  • Figure 3.10. FaceGen Modeller 3.2 Interface

    44

  • powerful multiple biometric recognition algorithms, incorporating a unique com-

    bination of biometric tools [45]. We use it for detection and recognition:

    1. Detection: It gives the centers of the eyes and the mouth, with an associated

    confidence measure in the face localization, ranging from 0.0 to 100.0.

    2. Recognition: It takes two images and gives a matching score between the

    faces in the two images. The scores range from 0.0 to 100.0, where a higher

    score implies a better match.

    3.3.3 Neurotechnoligija

    Neurotechnology [27] manufactures an SDK for face and fingerprint biometrics.

    The face recognition package is called Neurotechnologija Verilook. It includes face

    detection and face recognition capability. The face detection gives the eye and

    mouth locations. The software also includes recognition software, which gives the

    matching score between two faces in two images.

    3.3.4 PittPatt

    PittPatt manufactures a face detection and recognition package [35] that we

    use in our comparison experiments. The face detection component is robust to

    illumination and pose changes in the data and to a variety of demographics. Along

    with its detection capability it can determine the pose of the face. It is able to

    capture small faces, such as faces with an inter-ocular distance of eight pixels. The

    face recognition component is also robust to a variety of poses and expressions by

    using statistical learning techniques. By combining face detection and tracking,

    PittPatt can also be used to recognize humans across video sequences.

    45

  • 3.3.5 CSUs preprocessing and PCA software

    In order to form a template image of the face that is found in the image, we

    use CSUs preprocessing code [13]. We create images that are 65x75 pixels in

    size, based on the eye locations found by Viisage, because the subjects face in the

    surveillance video has an average inter-ocular distance of about 40 pixels. In the

    normalization stage, the images are first centered, based on eye locations, and the

    mean image of the set is subtracted from each image in the set.

    The CSU software also includes an implementation of Principal Component

    Analysis for face recognition [44]. We use this software when using reflectance

    images as input for recognition to handle illumination effects [15]. The basic PCA

    algorithm is described by Turk and Pentland [44]. The process consists of two

    parts, the oine training phase and the online recognition phase.

    In the oine phase, the eigenspace is created. Each image is unraveled into

    a vector and each vector becomes a column in an MxN matrix, where N is the

    number of images and M is the number of pixels. Then the covariance matrix Q

    is defined as the outer product of this metric.

    The next step is to calculate the eigenvalues and eigenvectors of the matrix Q

    and then keep the k eigenvectors with the k largest eigenvalues (which correspond

    to the dimensions of highest variation). This defines a k-dimensional eigenspace

    into which new images can be projected. In the recognition phase, the normalized

    images are projected along their eigenvectors into the k-dimensional face space

    and the projected gallery image closest to a projected probe image is the best

    match.

    46

  • 3.4 Performance metrics

    In this dissertation, we use two different performance metrics to evaluate recog-

    nition performance. They are called rank one recognition rate and equal error

    rate. They are shown graphically using cumulative matching curves and receiver

    operating curves respectively. We describe each metric in further detail below.

    3.4.1 Rank one recognition rate

    When a image is probed against a set of gallery images, the gallery image

    that has the highest matching score to that probe image is considered its rank

    one match. Rank one recognition rate is then defined as the ratio of set of probe

    images where the rank one match of each is its true match.

    A CMC curve plots the change in recognition rate as the rank of acceptance is

    increased. The x-axis ranges from 1 through M , where M is the number of unique

    gallery subjects and the y - axis ranges from 0 to 100%. In Figure 3.11, we show

    an example of such a curve. In this example, there are 26 subjects in the gallery

    set.

    Assume that there are n images the probe set and m images in the gallery set.

    Let p be the number of probe images for which the rank one match is its true

    match, then the rank one recognition rate R is defined as in Equation 3.1.

    R =p

    100 n (3.1)

    3.4.2 Equal error rate

    Another metric to measure the equal error rate of the receiver operating (ROC)

    curve. An ROC curve plots the false accept rate against the true accept rate. The

    47

  • Figure 3.11. Example of CMC curve

    48

  • Figure 3.12. Example of ROC curve

    rate at which the true accept rate equals the false accept rate is called the equal

    error rate.

    An ROC curve plots the change in false accept rate versus the true accept

    rate. At each point on the graph, the threshold of acceptance as a true match is

    varied. In Figure 3.12, we show an example of such a curve.

    3.5 Conclusions

    In this chapter, we discussed the sensors and datasets used in our experiments.

    We also described the software we used to support our work. Finally, we closed

    with a discussion about the metrics used to evaluate performance.

    49

  • CHAPTER 4

    A STUDY: COMPARING RECOGNITION PERFORMANCE WHEN USING

    POOR QUALITY DATA

    The sensor used to capture data used for face recognition can affect recognition

    performance. Low quality cameras are often used for surveillance, which can

    result in poor recognition because of the poor video quality and low resolution

    on the face. In this chapter, we conduct two sets of experiments. The first

    set of experiments demonstrate baseline performance using the NDSP dataset.

    This dataset is captured indoors, where the sunlight streaming through the doors

    affects the illumination of the scene. Then we show recognition experiments using

    the Comparison dataset, where video data acquired from two different sources:

    a high-quality camcorder and a surveillance camera. We also capture data both

    indoors and outdoors to compare performance when acquiring data in different

    acquisition settings. We then compare recognition performance when using each

    of these two sources of video data as our probe set and show that performance

    falls drastically when we use poor quality video and when we move from indoor

    to outdoor settings.

    The rest of the chapter is organized as follows: First, we describe baseline

    performance for the NDSP dataset in Section 4.1. Then, Section 4.2 describes

    the experiments we run to compare performance and in Sections 4.2.2 and 4.3, we

    describe our results and conclusions.

    50

  • 4.1 NDSP dataset: Baseline performance

    We first define baseline performance using the NDSP dataset to show the

    difficulty of this dataset. While there is significant research done in the area of

    face recognition using high - quality video where the subject is looking directly

    at the camera, research using poor - quality data with off-angle poses is also

    needed. So we define baseline performance for this dissertation to show that it is

    a challenging problem.

    4.1.1 Experiments

    For each subject in the NDSP probe set, we compare each frame of their

    probe video clip to the set of gallery images of the same subject. We describe

    how we generate the multiple gallery images per subject in Section 5.1. For each

    subject, we predetermine the best single probe video frame to use for that person.

    We do this by picking the frame that gives us the highest matching score to

    this corresponding set of gallery images. This gives us a new image set of 57

    images (one image per subject), where each image represents the highest possible

    matching score of that subject to the gallery images of the same subject. We use

    this oracle set of probe video frames as our probe set. This is an optimistic

    baseline, in that a recognition system would not necessarily be able to find the best

    frame in each probe video clip. We then run recognition using this set of images

    as probes and report the rank one recognition rate as our baseline performance.

    4.1.2 Results

    In Figure 4.1, we show the rank one recognition rates using this set of 57

    images, when the images in the gallery set correspond to an off-angle degree from

    51

  • Figure 4.1. Baseline performance for the NDSP dataset

    the frontal position of 0, +/- 6, +/-12, +/-18 in the yaw angle. The face is

    also rotated to +/- 6 in pitch angle.

    We see that performance steadily increases as we increase the range of poses

    available in the gallery set. We determine the best frame per subject based on its

    matching score to all 17 poses. This explains why performance peaks when we use

    all 17 poses, since we use all 17 poses to pick the frames that make up the oracle

    probe set. This shows that this is a challenging dataset, where performance is

    poor even when we pick out the probe frame with the best matching score to its

    52

  • gallery image. Secondly, we demonstrate that using a variety of poses increases

    recognition performance.

    We show that performance increases as we increase the number of poses, till we

    stop at 17 poses. So the question arises as to whether or not performance would

    continue to increase if we were to increase the off-angle of the poses and add more

    images to our gallery. However, we generated additional synthetic poses, but the

    face detection system was unable to handle poses that were greater than 18 from

    a frontal position. So, those poses were not used by the recognition system and

    even if they were, would have not been useful for recognition. Furthermore, if the

    video contained images of the subject in a strictly frontal position, the additional

    poses would not be useful for recognition. However, as we showed in [43], multiple

    images can be used to be improve recognition even in instances where the subject

    is in a frontal pose.

    4.2 Comparison dataset

    For comparison, we run recognition experiments using the Comparison dataset

    described in Section 3.2. This set contains high quality still images as gallery data

    and two sets of probe data, one acquired on a high - quality camcorder and the

    other on a surveillance camera. This dataset also contains data acquired indoors

    and outdoors. This shows how the change in lighting can also affect recognition

    performance.

    4.2.1 Experiments

    For our experiments, we use PittPatts detector and recognition system. Once

    we have detected all the faces in the probe and gallery data, we create a single

    53

  • gallery of all the gallery images, with minimal clustering to ensure that each image

    is considered a unique subject. Then for each video sequence, we create a gallery

    and cluster it so that they all correspond to the same subject. We then run

    recognition of each set of videos against the gallery of high-quality still images.

    We report results using rank one recognition rate and equal error rate.

    Since we cluster the video frames to correspond to one subject, distances are

    reported between one sequence and a gallery image. So results are reported per

    video sequence, rather than per frame. Our experiments are grouped into four

    categories, depending on the sensor used and the acquisition condition in which

    the data is acquired.

    4.2.2 Results

    In this section, we describe the detection and recognition results when we run

    recognition experiments described in 4.2.

    Detection results: In Table 4.1 we show the results of the face detection

    and how many faces were detected in the video sequences. The number of faces

    detected in the outdoor video is far fewer than that in the indoor video. We

    also notice that the number of faces detected reduces as we move from high-

    quality video to outdoor video. With the high - quality video indoors, detection is

    about 50% and falls to less than 5% when we move outdoors, using a surveillance

    camera. So we see that the type of camera and the acquisition condition affects

    face detection performance.

    In Figures 4.2 through 4.5, we show an example frame from each acquisition

    and camera. We also show some of the thumbnails created after we run detection

    on the surveillance and high-definition video (both indoor and outdoor video). We

    54

  • TABLE 4.1

    COMPARISON DATASET RESULTS: DETECTIONS IN VIDEO

    USING PITTPATT

    Performance Indoor video Outdoor video

    metric High-resolutionvideo

    Surveillancevideo

    High-resolution