IEEE TRANSACTIONS ON INFORMATION FORENSICS AND … · IEEE TRANSACTIONS ON INFORMATION FORENSICS...

11
IEEE TRANSACTIONS ON INFORMATION FORENSICS AND SECURITY, VOL. X, NO. X, SEPTEMBER 2007 1 Real-time Face Detection and Motion Analysis with Application in “Liveness” Assessment Klaus Kollreider*, Hartwig Fronthaler, Maycel Isaac Faraj and Josef Bigun, Fellow, IEEE, Abstract—A robust face detection technique along with mouth localization, processing every frame in real-time (video-rate) is presented. Moreover, it is exploited for motion analysis on-site, to verify “liveness” as well as to achieve lip-reading of digits. A methodological novelty are the suggested quantized angle features (“quangles”), being designed for illumination invariance without the need for pre-processing, e.g. histogram equalization. This is achieved by using both the gradient direction and the double angle direction (the structure tensor angle), and by ignoring the magnitude of the gradient. Boosting techniques are applied in a quantized feature space. A major benefit is reduced processing time, i.e. that the training of effective cascaded classifiers is feasible in very short time, less than 1 hour for data sets of order 10 4 . Scale invariance is implemented through the use of an image scale-pyramid. We propose “liveness” verification barriers as applications, for which a significant amount of computations is avoided when estimating motion. Novel strategies to avert advanced spoofing attempts, e.g. replayed videos which include person utterances are demonstrated. We present favorable results on face detection for the YALE face test set and competitive results for the CMU-MIT frontal face test set, as well as on “liveness” verification barriers. Index Terms—Object detection, face detection, landmark de- tection, quantized angles, real-time processing, Optical Flow of Lines, liveness, anti-spoofing, lip reading, AdaBoost, SVM EDICS Category: BIO-ATTA I. I NTRODUCTION I N face analysis related to biometrics, one may distinguish between face detection and face recognition. The former deals with locating one or multiple instances of human faces in a photograph or video whereas the latter targets on estab- lishing a unique link between two or more (independent) face recordings of the same person. It is worth noting that face detection is a prerequisite to both of the sub-categories of face recognition: Authentication (1:1 matching) and identification (1:n matching). This is because an accurate registration be- tween a pair of face images, and/or between the landmarks (e.g. eyes and mouth) of the face images, has a pivoting role on the way toward a good recognition performance. Realistic application hardens the task. These considerations also imply that any image region deemed as face – even non-faces – will be considered, ignoring the fact that the identity establishment can be compromised much prior to authentication/identification, because non-faces are given a real chance to impost. Additionally, material for counterfeiting any biometrics is easy to acquire (e.g. photographs from a K. Kollreider, H. Fronthaler, M. Faraj and J. Bigun are with Halmstad University, Box 823, SE-30118, Halmstad, Sweden. E-mail: {klaus.kollreider, hartwig.fronthaler, maycel.faraj, josef.bigun}@ide.hh.se. website, or traces of fingerprints left on a glass) and thus, raising anti-spoofing barriers is indispensable at the very start of a biometric person recognition. Methods bringing benefits to both face and “liveness” detection are desirable, which is also the application focus of the paper. A. Face/Landmark Detection When attempting to detect faces (or locate a single face) in a visual representation, image-based and landmark-based methods may be primarily distinguished between [1], [2]. This paper focuses on the detection of frontal faces in 2D images although we will be using a sequence of them to assess the 3D information for “liveness” assessment purposes. Features here represent measurements made by means of some basis functions in a multidimensional space which should be contrasted to the term “facial features” sometimes used in the published studies to name subparts of a face, e.g. the eyes, mouth, etc. We will call the latter “facial landmarks” or “landmarks”. Challenges in face detection generally comprise varying illumination, expression changes, (partial) occlusion, pose extremities [1] and requirements on real-time computa- tions. The main characteristics of still image-based methods, is that they process faces in a holistic manner. That is no parts of the face are intentionally favored to be used for face detection, or when a favoring is undertaken, the selection of face parts is left to the training/classifier. Faces are learned by training on roughly aligned portraits as well as non-face-like images. The training is typically part of a high level statistical pattern recognition method, e.g. a Bayesian classifier, an artificial neural network, or an SVM. When operational, the trained classifier provides a decision on being face or non-face at sub- parts of an image and at various scales. A popular approach, although primarily designed for face recognition, uses the so-called Eigenfaces [3], or the PCA (Principal Component Analysis) coordinates, to quantify the “faceness” of an image (region). More recent face detection systems include [4], [5] which use neural networks, whereas Support Vector Machines (SVM) [6], were employed in [7] to classify image regions as face or non-face. A naive Bayes scheme was implemented in [8], and recently in [9], whereas an AdaBoost procedure [10] was concurrently adapted in [11], [12], for the purpose of classification of regions with respect to “faceness”. Image- based methods can easily be trained in an analogous manner for the purpose of detecting other objects in images, e.g. cars. The AdaBoost based face detection in [11], [12] has been suggested as being real-time, and has been followed

Transcript of IEEE TRANSACTIONS ON INFORMATION FORENSICS AND … · IEEE TRANSACTIONS ON INFORMATION FORENSICS...

IEEE TRANSACTIONS ON INFORMATION FORENSICS AND SECURITY, VOL. X, NO. X, SEPTEMBER 2007 1

Real-time Face Detection and Motion Analysis withApplication in “Liveness” Assessment

Klaus Kollreider*, Hartwig Fronthaler, Maycel Isaac Faraj and Josef Bigun,Fellow, IEEE,

Abstract—A robust face detection technique along with mouthlocalization, processing every frame in real-time (video-rate) ispresented. Moreover, it is exploited for motion analysis on-site,to verify “liveness” as well as to achieve lip-reading of digits. Amethodological novelty are the suggested quantized angle features(“quangles”), being designed for illumination invariance withoutthe need for pre-processing, e.g. histogram equalization. This isachieved by using both the gradient direction and the doubleangle direction (the structure tensor angle), and by ignoring themagnitude of the gradient. Boosting techniques are applied in aquantized feature space. A major benefit is reduced processingtime, i.e. that the training of effective cascaded classifiers isfeasible in very short time, less than 1 hour for data sets oforder 104. Scale invariance is implemented through the use of animage scale-pyramid. We propose “liveness” verification barriersas applications, for which a significant amount of computationsis avoided when estimating motion. Novel strategies to avertadvanced spoofing attempts, e.g. replayed videos which includeperson utterances are demonstrated. We present favorable resultson face detection for the YALE face test set and competitiveresults for the CMU-MIT frontal face test set, as well as on“liveness” verification barriers.

Index Terms—Object detection, face detection, landmark de-tection, quantized angles, real-time processing, Optical Flow ofLines, liveness, anti-spoofing, lip reading, AdaBoost, SVM

EDICS Category: BIO-ATTA

I. I NTRODUCTION

I N face analysis related to biometrics, one may distinguishbetween face detection and face recognition. The former

deals with locating one or multiple instances of human facesin a photograph or video whereas the latter targets on estab-lishing a unique link between two or more (independent) facerecordings of the same person. It is worth noting that facedetection is a prerequisite to both of the sub-categories of facerecognition: Authentication (1:1 matching) and identification(1:n matching). This is because an accurate registration be-tween a pair of face images, and/or between the landmarks(e.g. eyes and mouth) of the face images, has a pivotingrole on the way toward a good recognition performance.Realistic application hardens the task. These considerationsalso imply that any image region deemed as face – evennon-faces – will be considered, ignoring the fact that theidentity establishment can be compromised much prior toauthentication/identification, because non-faces are given areal chance to impost. Additionally, material for counterfeitingany biometrics is easy to acquire (e.g. photographs from a

K. Kollreider, H. Fronthaler, M. Faraj and J. Bigun are with HalmstadUniversity, Box 823, SE-30118, Halmstad, Sweden. E-mail:{klaus.kollreider,hartwig.fronthaler, maycel.faraj, josef.bigun}@ide.hh.se.

website, or traces of fingerprints left on a glass) and thus,raising anti-spoofing barriers is indispensable at the very startof a biometric person recognition. Methods bringing benefitsto both face and “liveness” detection are desirable, which isalso the application focus of the paper.

A. Face/Landmark Detection

When attempting todetect faces (orlocate a single face)in a visual representation, image-based and landmark-basedmethods may be primarily distinguished between [1], [2].This paper focuses on the detection of frontal faces in 2Dimages although we will be using a sequence of them toassess the 3D information for “liveness” assessment purposes.Features here represent measurements made by means of somebasis functions in a multidimensional space which should becontrasted to the term “facial features” sometimes used inthe published studies to name subparts of a face, e.g. theeyes, mouth, etc. We will call the latter “facial landmarks” or“landmarks”. Challenges in face detection generally comprisevarying illumination, expression changes, (partial) occlusion,pose extremities [1] and requirements on real-time computa-tions.

The main characteristics of still image-based methods, isthat they process faces in a holistic manner. That is no parts ofthe face are intentionally favored to be used for face detection,or when a favoring is undertaken, the selection of face partsis left to the training/classifier. Faces are learned by trainingon roughly aligned portraits as well as non-face-like images.The training is typically part of a high level statistical patternrecognition method, e.g. a Bayesian classifier, an artificialneural network, or an SVM. When operational, the trainedclassifier provides a decision on being face or non-face at sub-parts of an image and at various scales. A popular approach,although primarily designed for face recognition, uses theso-called Eigenfaces [3], or the PCA (Principal ComponentAnalysis) coordinates, to quantify the “faceness” of an image(region). More recent face detection systems include [4], [5]which use neural networks, whereas Support Vector Machines(SVM) [6], were employed in [7] to classify image regionsas face or non-face. A naive Bayes scheme was implementedin [8], and recently in [9], whereas an AdaBoost procedure[10] was concurrently adapted in [11], [12], for the purposeof classification of regions with respect to “faceness”. Image-based methods can easily be trained in an analogous mannerfor the purpose of detecting other objects in images, e.g.cars. The AdaBoost based face detection in [11], [12] hasbeen suggested as being real-time, and has been followed

2 IEEE TRANSACTIONS ON INFORMATION FORENSICS AND SECURITY, VOL. X, NO. X, SEPTEMBER 2007

up by other studies extending it to multi-poses, and reducingclassifier complexity, e.g. [13], [14]. However, the employedfeatures play a decisive role besides the used classifiers. Of allthe published methods, only a few use the gray values directlyas features to be classified. However, almost all approaches usea preprocessing of the gray values (e.g. histogram equalizationor normalization) to minimize the effect of adverse lightconditions, at the expense of computational processing. Themethods suggested by [11], [13], [14] use Haar-like rectanglefeatures, translating into a high detection speed whereas [9],[12] employed edge features with arguably lower executionspeed. As will be detailed further below, some of the advan-tages of those methods can also be identified in the schemewe suggest here.A novelty in this study is the use of gradient angles only, al-lowing the gray value preprocessing to be expendable. Also theuse of hierarchical and adaptive quantization levels improvesthe detection performance, as opposed to few, and fixed anglequantizations, e.g. 7 in [9]. Another contribution is the useof not only the gradient direction information, but in additionthe structure tensor direction [15] is exploited to encode thelocal structure. Because we use quantized angle features, weterm the latter “quangles” for expediency. Furthermore, thesequangles are boosted in layers of a decision cascade as in [11],enabling also small classifiers. Separable filtering and the useof lookup tables amount for the remaining speedup that resultin efficient computations when the system is operational, aswell as off-line when it is used for training. We achieve scaleinvariance through signal theoretically correct downsamplingin a pyramidal scheme. The usefulness of the method isshown in the context of face and landmark (mouth) detection.A methodological advantage of the suggested scheme is thereadily availability of some filtered signals that are also highlydesirable for other tasks, for example, real-time optical flowcalculations when implementing “liveness” [16] verificationbarriers in biometric person authentication. In comparison, therectangle features suggested in [11], despite their value in pureobject detection in still images, have limited reusability whenit comes to motion tasks, e.g. motion quantification.

To have a fuller picture, we also summarize the basicrationale of landmark-based methods. Often citing biologicalmotivations, they focus on a few salient face parts, landmarks,at which most of the processing is concentrated. Such land-marks are, for example, the single eyes, mouth, nose (nostrils),eyebrows, etc. Many ideas have been published on how toextract those, e.g. using Gabor features and SVM, whichmay include techniques referred to in the former category,but employed in a local window. During training, face partsare primarily learned as opposed to the whole face (holistic).All candidate sites are examined according to local modelsbut in a global scope by means of a (global) shape model.However, if more than one face is present in an image, thecomplexity quickly increases and landmark-based methods en-counter computational problems when real-time performanceis a demand. With currently available technologies, their useis hampered in such applications. Nevertheless, they are con-sidered potentially more robust to partial occlusion and posechanges of a face, as well as being more precise in localization.

A survey of face detection methods using landmarks is givenin [1], [17].

B. “Liveness” Verification

“Liveness” verification studies, in connection with biometricperson recognition are few in the literature, especially for facebiometrics. Also, the commercial face recognition systemssuffer from having a comprehensive strategy towards theproblem of “liveness”. Intuitively, one could use a multi-modalsystem, e.g. face tracking and fingerprint [18], with numeroussensors (e.g several cameras, heat sensitive cameras, etc.) toease the task. Such configurations might not be applicable fora variety of reasons, e.g. due to being perceived intrusive orcostly, and besides, being complex. One can instead avail theadditional “liveness” information hidden in an image sequence(video). As an effort in this direction, methods exploitingchanges in face expression [19] and landmark trajectories[16], [20] have been suggested. Although these methods areweak against video replay attacks, they can nevertheless beused as barriers against less intricate attacks, e.g. those usingphotographs. Nowadays, an attacker could benefit from theadvancing portable digital video player device technologies.Such a device positioned well enough in front of an unso-phisticated camera system could pose a potential threat tobreak into a biometric recognition system. However, it islikely that an advanced “liveness” verification barrier wouldcontain a reliable face detector and tracker as well as localmotion (optical flow) estimators. Presented photographs canbe rejected in a way that is shown in [16] by using on-site motion information. The main idea in this barrier isto accumulate evidence for the three-dimensionality of the(somehow) detected face(s) by employing the OFL (OpticalFlow of Lines) to conclude on the 3D structure. Here, incontrast to [19], face expression changes are helpful, yet notrequired because any head movements will suffice. In thisstudy, we propose another barrier that uses lip movementclassification and lip-reading for the purpose of “liveness”detection in a text-prompted dialogue scenario. This meansthat the system will expect the person to utter something, e.g.a prompted random sequence of digits, and it should verifywhether the observed lip dynamics fit in. For so-called “talkingface” systems, this has also been responded to, most recently in[21], by inspecting the correlation of audio and video channels.The novelty in our study is that we suggest a method forreal-time assessment of lip-motion to recognize digits withoutaudio information. A simple analysis of the mouth regions wasdone in [21], whereas [9] focused on four corner points of themouth/lips. On the other hand, tracking the lip contours as, forexample, used in audio-visual recognition [22], is susceptibleto noise, often requiring human intervention. We propose toautomatically locate the mouth region and extract the enclosedOFL in real-time. This implies that we model the motion oflips by moving line patterns in space-time planes [23]. Weassume a digit prompted scenario using an SVM expert toclassify the lip dynamics. Lip reading by motion analysis hasalso been shown useful for person authentication [23], [24].The scheme allows for a person to just whisper or to mimethe digits, thereby counteracting eavesdropping.

KOLLREIDER et al.: REAL-TIME FACE DETECTION AND MOTION ANALYSIS WITH APPLICATION IN “LIVENESS” ASSESSMENT 3

We present experimental results on several public databases:The pure face detection is assessed on the MIT-CMU [5] andthe YALE [25] face test sets, both being representative forreal-world (and extreme light) conditions, also used by otherstudies to this end. We also provide experimental consider-ations concerning computational complexity, contrasting ourmethod to Viola&Jones’ approach [11], using OpenCV (theOpen Computer Vision library). The “liveness” experimentsare conducted on the XM2VTS database [26], simulating adigit prompted scenario, in which the combined face andmouth detection is employed for autonomous processing.

II. OBJECT/FACE DETECTION

A. The Quantized Angle Features (Quangles)

In this section we present the features for object/facedetection, which we call “quangles”, representing quantizedangle features. We exploit parts of the gradient informationin order to determine the presence of a certain object. Thegradient of an image is given in equation (1),

∇f =(

fx

fy

)(1)

wherefx and fy denote the derivatives in x and y directionrespectively. Furthermore,|∇f | indicates the magnitude ofthe gradient and6 ∇f refers to its angle. For the sake ofobject detection, we disregard the magnitude or intensity sinceit is highly affected by undesired external influences likeillumination variations (noise).

The key instrument of our quangle features are thequanglemasks, which are denoted as follows:

Q(τ1, τ2, φ) =

{1, if τ1 < φ < τ2,

0, otherwise(2)

The thresholdsτ1 and τ2 constitute the boundaries of apartition in [0, 2π]. The quangle mask yields 1 if an angleφ, is located within such a partition and 0 otherwise. Inorder to produce a set of quangle masks, we divide the fullangle range[0, 2π] into an increasing number of quantizations(partitions), which are additionally rotated. An example isdepicted in figure 1. A set of quangle masks{Q}maxQuant,nRot

is fully determined by the maximum number of quantizationsmaxQuantand the number of rotationsnRot. The parametermaxQuanthas to be interpreted cumulatively, meaning that allquangle masks with less quantization steps are included in theset as well. The second parameter,nRot, indicates the numberof rotations applied to each basic quangle mask. The final rowin figure 1, for example, corresponds to{Q}4,2, which consistsof 27 different quangle masks. In order to create such a set ofquangle masks the thresholdsτ1 andτ2 of each partition needto be determined. This can be done in a three step procedure:

1) First we define a sequence of threshold pairsα1 andα2

delimiting the desired number of partitionsnQuantin theinterval [0, 2π], disregarding the rotational component:

α1 =2π

nQuant·quant α2 =

nQuant·(quant+ 1) ,

(3)wherequant∈ {0, ..., nQuant− 1}.

Fig. 1. Example of a set of quangle masks (angle displayed in polar form)using up to 4 quantizations and 2 rotations. The quangle masks defininglarger partitions are automatically included in the set. The gray shaded areascorrespond to the partitions yielding value 1 in equation (2).

2) In the second step we create the final threshold sequencecontaining pairs ofτ1 and τ2. For each partitionquantwe includenRot rotated versions:

τk = mod

(αk −

π

nQuant· rot, 2π

), (4)

whererot ∈ {0, ..., nRot} andk ∈ {1, 2}.3) Performing the first two steps corresponds to creating a

single cell in figure 1. In order to produce a completequangle set, the two steps above need to be repeated fornQuant= {2, ..., maxQuant}.

To detect objects in a single scale we use a sliding windowapproach, where an image is scanned by a so-called search ordetection window. Scale invariance is achieved by successivelydownsizing the original image. In order to look for candidates,these quangle masks need to be assigned to positions(i, j)within the detection windowx. This defines at the same timeour quangle features. We furthermore distinguish between twodifferent types. Equation (5a) describes a quangle feature usingthe original gradient angle, whereas double angle representa-tion is employed in equation (5b).

q1 (x, i, j, τ1, τ2) = Q (τ1, τ2, 6 ∇x(i, j)) (5a)

q2 (x, i, j, τ1, τ2) = Q (τ1, τ2, mod(2 · 6 ∇x(i, j), 2π)) (5b)

Both quangle feature types in the equations above take the de-tection windowx, the position(i, j) within x and a particularquangle mask out of{Q}. Using bothq1 andq2, the numberof possible features is determined by the size of the detectionwindow and the number of quangle masks in{Q}. We includeboth single and double angle representation in our set ofquangle features since they are meaningful at different siteswithin the search window. Considering a face, the original gra-dient is more informative between the landmarks, because thisinformation can effectively differentiate between dark-lightand light-dark transitions which get lost when doubling theangle, to be discussed next. The double angle representation,

4 IEEE TRANSACTIONS ON INFORMATION FORENSICS AND SECURITY, VOL. X, NO. X, SEPTEMBER 2007

which mapsφ to 2φ, has been shown to represent the structuretensor eigenvector directions. Despite being a simple linearmapping, the2φ mapping does not carry the same informationas φ becauseφ is an angle and the mapping2φ is on thering of [0, 2π] (not on the ring of[−∞,∞]) which creates anequivalence class, [15], for angles differing withπ. Thus∇fand−∇f are equivalent after the2φ mapping which moreeffectively represents the linear structures that are invariant toa rotation withπ. Such structures include, but are not limitedto, lines. The double angle representation at boundaries isthus more resistant to illumination and background changes,because it can represent the symmetry inherent to such pointsmore effectively. Accordingly, both single angle and doubleangle features are complementary and meaningful features torepresent objects/faces.

B. Classifier Building

A good classification (yielding a low error rate) cannotbe obtained with a single quangle feature, but obviously, itis neither meaningful nor practical to evaluate all of themwithin the detection window. In order to find the most suitablequangles we employ AdaBoost [10], which provides a veryefficient feature selection procedure. In the process, a numberof good features (termed weak classifiers) are combined,yielding a so-called strong classifier. In the following weprovide a brief step-by-step description of how such a strongclassifier can be built using AdaBoost learning:

1) xi denotes the training images andyi ∈ {0, 1} labelsthem as positive or negative class examples.

2) Weights’ initialization:w1,i = 12m , 1

2l , wherem and lare the number of positive and negative class examples.

3) Weights’ normalization:wt,i ← wt,i∑nj=1 wt,j

, wheren de-notes the total number of training images andt (initially1) is used as weak classifier index.

4) Weighted error calculation:et = mini,j,τ1,τ2,k

∑i wi ·

|qk(x, i, j, τ1, τ2)− yi|, wherek ∈ {1, 2} represents theaccording feature type (single or double angle).

5) Select the weak classifier with the lowest error:ht(x) =qkt(x, it, jt, τ1t , τ2t), where the parametersit, jt, τ1t ,τ2t

andkt, minimize the error.6) Weights’ updating:wt+1,i = wt,i ·β1−|qt(xi)−yi|

t , whereβt = et

1−et

7) Step 3-6 are repeatedT times, which is the desirednumber of weak classifiers.

8) The final strong classifier:

C(x) ={

1∑T

t=1 αt · ht(x) ≥ 12

∑Tt=1 αt

0 otherwise(6)

with αt = log 1βt

Following the description above we obtain a strong classifier,which is composed ofT weak classifiers. Adding more weakclassifiers will probably result in a higher detection rate (lowererror) but unfortunately also directly affect the computationtime, in the operational as well as training phase. Comparedto the rectangle features originated from [11], our features omita parameter yielding a tremendous speed-up during training.This will become clear further below.

An alternative to the single strong classifier is the so-called cascaded classifier scheme, which is computationallyefficient and provides high detection rates. The idea is tocreate a number of less complex strong classifiers and toevaluate them consecutively. A single negative decision atany level of such a cascade leads to an immediate disregardof the concerned candidate image region. In this case, theclassifiers at the remaining levels do not need to be evaluated,whereas positive decisions trigger further consideration ofother (weak) classifiers. Note that the final detection rateDfor the cascade can be expressed in detection rates per leveld,such thatD =

∏Ki=1 di. This is also valid for the final false-

positive rateF and the false-positive rate per levelf , yieldingF =

∏Ki=1 fi. In both cases,K is the number of levels. In

this way, each level of the cascade has to keep nearly all thepositive training data, but on the other hand only needs to sortout a portion of negative examples in order to deliver goodresults, e.g. a 10 level cascade withf = 0.33 will lead to afalse-positive rate of1.5 × 10−5. This fact makes it possibleto achieve very high detection rates per level, while keepingdown the false positives. The configuration of a series of strongclassifiers is driven by the detection and false-positive rates perlevel. First, the detection rate of the current strong classifierat the top layer is checked. Ifd is too low, the threshold ofthe classifier (right hand-side of the inequality in equation (6))is successively reduced until a predefinedd is reached. Iffis still obeyed, the level is complete. Otherwise further weakclassifiers have to be added until both ratesd andf is compliedwith.When establishing the cascade, i.e. configuring and train-ing a strong classifier per level, we apply a bootstrappingstrategy. After completing each level, the rejected negativeclass examples are replaced by new ones, which the cascadewould still (wrongly) classify as being positive. It is worthmentioning that this remarkably prolongs the training since itbecomes naturally more unlikely to encounter such examples.Nevertheless, this results in a very discriminative cascade, inwhich the classifiers become more and more complex, as theyfocus on trickier examples.

C. Implementation

In order to make the object detection computationally effi-cient, the number of operations required for each detectionwindow has to be minimized. By means of the cascadedclassifier approach described in the previous section it ispossible to make early decisions on whether a detectionwindow is useful or not. This is in favor of the executiontime, since only a few features have to be evaluated in case ofnon-object sites. Furthermore, we can reduce the number ofoperations needed to calculate and classify a single feature.In this study, we employ so-called lookup tables to speedup this process. From a programmer’s point of view, theyare simple arrays containing precalculated values, which areaccessed by their indices. Recalling the quangle features oftype q1 and particularlyq2 in equations (5a) and (5b), lookuptables provide an effective solution for both of them. This isespecially beneficial, since we can avoid a separate calculation

KOLLREIDER et al.: REAL-TIME FACE DETECTION AND MOTION ANALYSIS WITH APPLICATION IN “LIVENESS” ASSESSMENT 5

of the double angle representation in case ofq2 and thus savemany operations. Figure 2 depicts two exemplary lookup tablesfor both, q1 and q2. Each quangle mask is represented by a

Fig. 2. Example lookup tables (rightmost) representing quangle masks in caseof single angle (top) and double angle (bottom) representation. The originalgradient angle is floored in[0, 360[ and used as an index for the binary lookuptables. Furthermore, for evaluation of any mask used inq2, a “helper quangle”can be imagined, which directly corresponds to double angle representation,and exists only in the form of lookup tables. This is shown in the lower partof the figure.

binary lookup table, which uses the angle as an index. Thevalue at a certain position corresponds to 1 or 0 dependingon whether the index (angle) is within a certain partition ornot. This partition is defined by the respective quangle mask.In order to be able to use the angle as an index, we floor theoriginal gradient angle to integer values in[0, 360[. However,the quangle features of typeq2, need some further attention.As visualized at the bottom of figure 2,mod(2 · 6 ∇, 2π)can be represented by the ordinary gradient angle6 ∇, bymeans of “helper quangles” (displayed in light gray), onlyexisting in the form of lookup tables. First, we divide thethresholdsτ1 and τ2 of the original quangle mask by two.The resulting partition together with a 180◦ shifted version areset to 1 in the respective lookup table. Since this correspondsto an inversion of the double angle calculation, we can usethe original gradient angle6 ∇ as an index, yet getting aclassification for the doubled angle. As a consequence, 1 arrayaccess is needed per weak classification for any quangle.

Having a trained (cascaded) classifier, we can use thealgorithm illustrated in figure 3 to detect objects/faces withinany given image. The image to be analyzed serves as a startingpoint at scales = 0 and scale factorsf = 1. In the next steps,we calculate the gradient (see equation (1)) of the whole imageat scales using separable Gaussians and their derivatives andextract the angle information. After this, we scan the imagewith the detection window. The trained classifier describes asequence of thresholds and weights for the most discriminativequangle features to be evaluated. The actual classification isdone using the lookup tables introduced above. Having thecandidates of the first scale, we successively reduce the imagesize by a factor 1.25 and start over with the calculation ofthe gradient. This is done for 10 further scales or until the

new size would become smaller than the detection window.The objects detected at each scale are integrated in a final listof candidates. In order to ensure that an actual object is only“counted” once (indicated by a single, framing rectangle inthe image), neighboring candidates in position and scale aregrouped (averaged).

D. Face and Facial Landmark Detection

In this section we apply the object detection system intro-duced above to face and facial landmark detection. The sizeof the search window for face detection is22 × 24 pixels.Our system operates in real-time at a resolution of640× 480using 11 scales on a standard desktop computer. We havebeen collecting approximately 2000 faces of varying qualityfrom online newspapers for training purposes. In addition, tocompensate for the difference in size and rotation, all faceimages were aligned by means of three facial landmarks,namely the mouth and both eyes. Having that, slight rotationsin the interval [−10◦, ...,+10◦] are applied to the trainingfaces. Some background is included in a typical positive (face)example. On the other hand, the negative examples are chosenrandomly from a large amount of images, which do not containany faces. Also, when training a cascaded classifier, furthernegative examples are automatically “harvested” from theseimages.In the following, we present a small experiment to pro-vide some important initial results and insights. In order tostrengthen the argument in section II-A, where we suggestthe use of both single and double angle features, we train astrong classifier employing both feature types vs. classifiersusing either of the features. Empirical tests on a subset ofpositive examples (900) and 9000 negative examples revealedthat {Q}8,5 is an eligible set of quangle masks for facedetection. The least number of quangle features to separatethis 9900 examples errorfree served as a criterion, besideseconomic parameters for the set{Q}. By doing so, we alsoadvanced to reduce the complexity of strong classifiers andconsequently, cascaded classifiers. A number of 36 quanglefeatures (28 of typeq1 and 8 of typeq2) were selected inthe case of using{Q}8,5. Figure 4 visualizes the selectedfeatures in separate detection windows. In both cases theblack arrows indicate6 ∇ of the underlying average faceat feature sites. The white partitions show the range, therespective angle is supposed to be in. The hourglass-shapedpartitions in the second image indicate double angle features,which would also tolerate if the gradient angle pointed in theopposite direction (indicated by the gray arrows). The radiiare modulated byαt, the weights of the corresponding weakclassifier, as an indication of the features’ relevance. It canbe observed that single angle features frequently occur inthe inner facial regions, whereas features of the second typeare situated in the bounding regions. In a further step, wetrained two strong classifiers using the same training setup,yet employing either features of typeq1 or q2. Error-freeseparation of the training data involved 49 single angle or83 double angle features. Compared to the combined setup,these numbers are significantly higher.

6 IEEE TRANSACTIONS ON INFORMATION FORENSICS AND SECURITY, VOL. X, NO. X, SEPTEMBER 2007

Fig. 3. The object/face detection process. An input image is investigated at 11 scales. At each of them, the angle of the gradient is analyzed by a trained(cascaded) classifier within a sliding window. The object candidates of each scale are integrated in a final list and multiple detections are eliminated.

Fig. 4. Exemplary strong classifier employing both, single and double anglefeatures, which are displayed side by side. Most of the single angle features(left) can be found in the inner facial region, whereas double angle features(right) are likely to be positioned at the face boundary. The black arrows pointin the gradient direction, whereas the gray arrows indicate a180◦ shiftedversion in case of the double angle features.

Assuming that we have the sizes and positions of all facesin an image, we continue with facial landmark detection.The focus is on the mouth, since it is needed for “liveness”inspection later on. However, the presented concept also worksfor other facial landmarks. For training purposes, we extractedthe mouthes from 600 out of the 2000 available faces. The sizeof the mouth regions is11 × 7 pixels and small rotations inthe interval[−5◦, ...,+5◦] are included as well. All other faceparts, but also non-face patterns are used as negative examples.Furthermore, the training proceeds as implemented for facedetection.Having automatically detected faces, among-scale groupinghas affected their observed sizes. Therefore, we use the two(nearest) neighboring scales for mouth detection. In each ofthem, we classify the mouth detection window at 5 positions,sampled fromN(µ, σ). Here,µ is the expected mouth centerwithin the face detection window andσ is set to 1.5. Thiscorresponds to a biased random search in an approximately5 × 5 neighborhood. In a final step, the mouth candidatesof both scales are grouped in order to eliminate multipledetections.Since we already know the size and the position of each

face and do not need to calculate the gradients, face andfacial landmark detection can be combined without increasingthe computational load noticeably. We show an example ofcombined face and mouth detection by our method in figure5.

Fig. 5. A (cropped) example image from the CMU-MIT face test set, withfaces and mouthes detected by the proposed method.

III. “L IVENESS” A NALYSIS

We stated that our “liveness” verification barriers needreliable face (part) detection and optical flow estimation.Having explained the former we shall also summarize themethod employed for fast optical flow calculation first. Itis worth noting that because we switch into the “liveness”verification module after having detected any face (and alandmark), the gradient images are readily available for it.This offers, besides the usage of common features for all tasks,a significant reduction of computations when calculating theOFL features discussed next.

A. Summary of the OFL

In [16], [23] the optical flow of lines (OFL) has beenproposed, to measure the velocity at which the lines move inimage sequences, i.e. the technique approximates the velocitycomponent normal to the line direction, thenormal flow. TheOFL is a lightweight optical flow alternative to all-embracing,

KOLLREIDER et al.: REAL-TIME FACE DETECTION AND MOTION ANALYSIS WITH APPLICATION IN “LIVENESS” ASSESSMENT 7

but time-consuming variants of motion estimation. The goalof the OFL is to approximate the two components of thenormal velocity vectors of a line in the image plane,vx andvy. In the following, we denote a Gaussian derivative filterwith respect to an arbitrary dimension axisz ashz, wherezcan be either ofx, y or t, the horizontal, vertical and timedimension axes, respectively. Given a space-time stack of 3-5consecutive images{Im}, the following steps are performedto obtainvx:

1) Filter each frame of{Im} with hx to extract verticallines, and save the result in a helper stack{f}

2) Permutate{f} along the y-axis to get all 2D space-timeslices, and save them (absolute values only) in{xt}

3) For each of{xt}, calculate its gradient by filtering withhx andht, and compile a complex representation havingthe derivatives with respect tox and t as components.Take the square of the complex slices and save themover the previous versions in{xt}.

4) Average each of{xt} by use of a Gaussian kernel (withlarger σ than for hz), yielding a set of images withcomplex values representing the linear symmetry [27],or the eigenvector of the local structure tensor,{ls}.

5) For every{ls} slice, consider the values at the centert-position only and taketan (0.5 · 6 ls) at everyx-position,yielding a single row per slice. Store each of the rowsat the correspondingy-position in the finalvx (the y-position at which the original{xt} slice was taken outof f ).

Two constraints are to be considered: Firstly, velocities invx are only valid if the corresponding values at the centerframe of f are (in magnitude) above a threshold. Secondly,velocities are only valid if they are below a threshold. Tocalculatevy, the algorithm above just needs to be fed withthe x/y-transposed space-time stack. For details and resultson error quantification of the OFL we refer to [16], [23].Like for the gradient approximation in section II, separableGaussians and their derivatives are used here for all filteringtasks, requiring only a few operations per point. More specifi-cally, if the object detection from above precedes, we can plugthe gradient components within a ROI (region of interest) into{f} and proceed straightly to step 2 in the OFL algorithmabove. Usually, step 1 would involve the biggest dimensionsto deal with. In our analysis toward “liveness” detection, wefocus on detected faces and mouthes as ROIs. Also, we canjump to scales of interest, since the image gradient for apyramid of the original image is already estimated.

B. Repelling Spoofing Attacks

One can utilize a simple physical fact when assessingwhether a face imaged by the system camera is “live” orif it comes from a photograph, presented to the system ina spoofing attempt. In figure 6, two frames cut out of camerafootage are displayed. The rectangles indicate the detectedfaces, whereupon the actual face regions are replaced by thevelocity magnitudes|v| = | (vx, vy)T |. We may use as few as3 consecutive frames to calculate the OFL, at face sites only.On the left hand side of figure 6, we can observe that the

velocity values remain rather constant, due to the photographbeing plain. In contrast, on the right hand side we can observea higher variation of the velocity. This is due to 3D facelandmarks having different distances and mutual relationshipswith respect to to the camera, thus generating non-uniformmotion vectors with a specific mutual dependence over the faceregion [16]. The latter study compared three parts (face center+ 2 sides), using Gabor-based techniques. Such a “liveness”barrier could be efficiently implemented using the proposedobject detection.

Fig. 6. On the left hand side, a bent photograph is held in front of thecamera, contrary to an actual person being there on the right hand side.The faces are located (rectangles) by the method from section II. The OFL(optical flow of lines) is efficiently calculated through 3 frames in time, andthe according (absolute) velocity values are displayed on top. We can observea larger variation of values in case of the “live” face due to its depth.

However, assuming that a perpetrator manages to place aportable video device replaying a video of another person’sface (with expression changes, other movements, speaking,etc.) at an adequate position in front of the system camera,most of the (commercial) recognition systems as well as“liveness” detection modules will have a hard time to notbe spoofed. Here we try to oppose this spoofing attemptin a fully automatic manner, requiring some interaction ofthe person in question. This interaction happens through theutterance of a specific digit sequence, either known previouslyby the person or prompted randomly. The latter strategy is tobe favored, since a sound utterance can easily be recorded.We suggest a “liveness” verification barrier functioning au-tonomously (without assistance from voice) at the visual level,by exploiting how a sequence of digit-utterances changes thefacial expression. Here, the digits are decoded using the lipmotion, on one hand, to enable a comparison to what shouldhave been said, and, to be available for comparison to the resultof another autonomous (speech) expert’s digit recognitionon the other. The problem to solve in our approach is toidentify digits 0..9, one-by-one in a sequence, by assigning theobserved lip motion to either of the 10 classes. First, however,we have to confine the mouth region of a “talking face”. In[23], the mouth center was pinpointed manually, to be ableto assess the potential of the used motion estimation (OFL).Here we can employ the face (part) detection from section II toautomatically locate the mouth and proceed through a fully au-tomatic lip-reading of digits. Thus, the OFL computations arecarried out only within an automatically placed ROI centeredat the mouth. A square region is used as a sliding window tospan space-time stacks of five consecutive frames each. Next,a dimension reduction for the purpose of classification is done.

8 IEEE TRANSACTIONS ON INFORMATION FORENSICS AND SECURITY, VOL. X, NO. X, SEPTEMBER 2007

The calculated velocity vectorsv = (vx, vy)T are first reducedto 1D by projection onto an intuitive stick-mouth model. Thisis also illustrated on the left hand side of figure 7, whichindicates a mouth region divided (by the dashed lines) into6 parts. The expected orientations of motion are -45◦, 90◦

Fig. 7. Dimension reduction of the extracted velocities in the mouth region:On the left hand side, the 2D vectors in each part (divided by the dashedlines) are reduced to scalars, by projecting them on the orientations indicatedby the solid bars. On the right hand side, these scalars are clustered within 4parts (divided by the dashed lines), with the gray values representing velocitymagnitudes

and 45◦ in the upper 3 regions, and in the reversed orderfor the lower 3 parts, also marked by the solid bars in thefigure. The velocity vectors in each part are projected ontothe corresponding direction, yielding signed scalars. Thesescalars are further clustered within 4 parts of the mouthregion. This is visualized on the right hand side of figure 7,where dashed lines indicate the considered 4 parts. In each ofthem, 20 cluster centers (with population 20) are automaticallyestablished using fuzzy C-means [28]. The final lip motionstatistics are then represented by a 40 dimensional vector perpart containing the cluster centers and populations. These 160dimensional vectors are extracted along a “talking face” video,for a person uttering a single digit. This way a final matrixg of the size160 ×N , whereN is the number of frames, isgathered. For classification,g is reshaped into a row vectorused for training a 10-class SVM and to be classified by theready expert, respectively. LIBSVM [29] was employed forthis purpose.

IV. EXPERIMENTS

A. Face Detection

For the experiments, the face detector was configured asfollows: The size chosen for the detection window was22×24and the employed quangle masks were in{Q}8,5. A cascadedclassifier, comprising 22 levels was trained on 2000 faces.This number suffices since all of the latter belong to differentpersons and include slight, artificially introduced rotations.Each of the cascade’s layers had to reject 2000 out of 4000non-faces. The total number of weak classifiers in the cascadewas 700. Such a classifier complexity is very small comparedto a couple of thousands as suggested in [11], [13]. This isdue to the quangles employing derivative features and in effectincorporating more discriminative information than rectanglefeatures. In operation, the first two levels of the cascade,comprising only 3 and 5 quangle features, respectively, arealready able to reject 75% of all non-faces. Furthermore, we

used s = 0 (original resolution) as the starting scale andsf = 1.1 as the factor for downsizing.The performance of this face detection system is benchmarkedon two publicly available databases, namely the the YALE [25]and the CMU-MIT [5] face test sets. Extreme illuminationchanges are the main challenge of the former test set, whichconsists of 165 frontal face images of 15 subjects. Especiallythe changing light sources, often resulting in semi-illuminatedfaces, present a major source of errors/false negatives. Inaddition, facial expressions are varied (sad, sleepy, surprised,...). The background is monotonous, apart from faces’ shadowscast on the wall, and, there is only little variation in the scaleof the faces. Table I shows the detection rates and the numberof false positives of our method together with the ones for theface detection algorithm proposed in [9], on the YALE facetest set.

Method Detection Rate False PositivesNguyen [9] 86,6% 0Proposed method 100% 0

TABLE IDETECTION RATES AND THE NUMBER OF FALSE POSITIVES ON THEYALE

FACE TEST SET

The CMU-MIT frontal face test set is among the mostcommonly used data sets for performance assessment of facedetection systems. It is composed of 130 images containing507 frontal faces in total. The quality of the images variessubstantially, besides the large variation in the scale of thefaces which increases the difficulty of the detection task.Even a number of simple, hand-drawn face examples occur.In addition to the detection rate, this set also permits togive representative numbers for the false positives because itcontains many high resolution images. In table II, the resultsof our technique on the CMU-MIT frontal test set are relatedto those of two prominent face detectors [5], [11], by adjustingthe false positive rate to a common level. Also, the detectionrate achieved by our method at 1 false detection per millionevaluated windows is given, constituting our best result.

Method Detection Rate False Positive RateRowley [5] 89,2% 1, 27× 10−6

Viola&Jones [11] 92,9% 1, 27× 10−6

Proposed method 94,2% 1, 25× 10−6

Proposed method 93% 1× 10−6

TABLE IIDETECTION AND FALSE POSITIVE RATES ON THECMU-MIT FRONTAL

FACE TEST SET

The results on the YALE test set confirm that our face de-tection method is resistant to substantial illumination changeswithout performing any (histogram related) preprocessing.Note, that the latter is actually done in all methods we compareour results to. In figure 8, two “YALE faces” are shown,with indicated detections by the proposed method. Note theseverity of the illumination conditions. Regarding moreoverthe achievements on the CMU-MIT test set, the proposedtechnique reveals highly competitive performance.

KOLLREIDER et al.: REAL-TIME FACE DETECTION AND MOTION ANALYSIS WITH APPLICATION IN “LIVENESS” ASSESSMENT 9

Fig. 8. Two images from the YALE face test set, illustrating what we consider”severe” illumination changes, managed by the proposed method though, ascan be seen.

B. “Liveness” Verification

In order to explore the potential of the proposed “liveness”verification barrier via a digit prompted system, we use anumber of 100 users from the XM2VTS audio-visual database[26]. For each of the users, a “talking face” video in which theperson’s face was recorded while speaking aloud the sequence“0 1 2 3 4 5 6 7 8 9” is available. The main goal hereis to recognize the digits by lip-motion only, to assess itsvalue for “liveness” independent of other modalities. For testpurposes, these videos were segmented semi-automatically, sothat short image sequences of single digit utterances wereavailable in the end, yielding 100 short videos for every digit.Furthermore, the dataset was divided into 60% training and40% test set. We used the public SVM implementation of[29] for the experiment. Presented results were attained by useof an RBF-kernel, adjusted with the help of cross validationover the training set. During the experiment were the velocityfeature vectors extracted at the mouth region and given to the10-class SVM for each of the single digit videos.For automatically locating the mouth, we employ the cascadedclassifier from the experiments above plus an ad hoc cascadedclassifier for mouth detection only. Using the face-mouthcombination shortens the search for the mouth region within avideo frame, allowing also for reduced training efforts for themouth detector. The latter uses quangle masks from the set{Q}8,5 within the features as well, and has a complexity of310 weak classifiers (spread out to 14 layers). This relativelylarge number of utilized quangle features comes presuminglyfrom the characteristics of the mouth region, not being asdiscriminative as, for example, the upper face part or thewhole face. Furthermore, no emphasis was put on trainingon especially opened mouthes (as during speaking). The sizeof the ROI for optical flow extraction was kept constantat 128 × 128 pixels, which was possible due to negligibleface/mouth scale variations throughout the videos.

In table III, we show the confusion matrix obtained for per-son “005” for all digits “0” to “9”, containing the probabilityestimations of (false) assignment of the automatically extractedlip-velocities by the SVM. Looking at it, for example, theprobability of classifying an uttered digit as “9”, when “0”would be correct is 0.2. Accordingly, the probability of falselyidentifying an uttered digit is the sum of all non-diagonalelements of the concerned row, e.g.,p

(1|1C

)= (1− p1) =

1 − 0.6 = 0.4 for digit “1”. Aggregating over all digits, the

0 1 2 3 4 5 6 7 8 90 0,8 0 0 0 0 0 0 0 0 0,21 0 0,6 0 0 0 0 0 0 0 0,42 0,4 0 0,6 0 0 0 0 0 0 03 0,3 0 0 0,7 0 0 0 0 0 04 0,3 0 0 0 0,7 0 0 0 0 05 0,3 0 0 0 0 0,7 0 0 0 06 0,3 0 0 0 0 0 0,7 0 0 07 0,3 0 0 0 0 0 0 0,7 0 08 0,2 0 0 0 0 0 0 0 0,8 09 0 0 0 0 0 0 0 0 0,2 0,8

TABLE IIICONFUSION MATRIX FOR THE RECOGNITION OF DIGITS“0” TO “9”,

UTTERED BY PERSON“005” OF THE XM2VTS DATABASE, WHEN

AUTOMATICALLY EXTRACTED LIP -MOTION IS CLASSIFIED BY A SVMEXPERT

average misclassification probabilityp for a particular personis expressed in equation (7).

p =110·

10∑i

p(i|iC

)(7)

Successful digit recognition for exemplary person “005” in-volving all utterances is done with a probability of approx-imately 0.7, p being 0.3. Furthermore, one could excludethe most frequently confused digits, e.g. “0”, to support theclassification. Successful recognition of single digit utterancesfor all 100 persons, based on an averaged1− p is done withprobability 0.73. This should be viewed as an indication forthe potential of using bare lip-motion to assess “liveness”.Assuming an error-free lip-expert for digit recognition, anydisagreement between uttered and the expected sequence(known by the user or randomly prompted) can of course beconsidered a spoofing attempt. Nevertheless, the improbabilityof an actual attack using the correct sequence enables alsoa moderate machine expert to be highly useful. It is worthpointing out the difficulty of performing text-prompted speechrecognition only from video, not the least experimentally,because the amount of data to be collected is an order ofmagnitude larger than for speech.

Finally, we will give the results for face/mouth detection onthe 100 used videos from the XM2VTS database, which hascontrolled conditions (uniform background, high resolution).Our face detector was successful in all observed frames withno false positives. We also quantify the accuracy of the mouthdetection/localization. From previous studies [23], we havemanually pinpointed (center) coordinates of the mouth for the100 original videos (only for the first frame though) available.We compare these with the automatically derived ones for thesame frames. In figure 9, the occurred differences inx andy-values are displayed on the left and right hand side, with thecorresponding means and standard deviations being -1.7±3.5and -1±3.1, respectively. The systematic deviation in mean,which is a matter of detector bias, was compensated for in theexperiment above. Note, that the standard deviations of 3.5and 3.1 are to be seen in context of a720× 576 frame, with(roughly) estimated average face and mouth sizes of280×300and80×50 pixels to be located, respectively. Taking also intoaccount, that an image-based method (in contrast to landmark-

10 IEEE TRANSACTIONS ON INFORMATION FORENSICS AND SECURITY, VOL. X, NO. X, SEPTEMBER 2007

Fig. 9. Histograms showing the differences inx and y-coordinates onthe left and right hand side, respectively, when comparing manually withautomatically pinpointed mouth centers for the initial frames of 100 “talkingface” videos. Note, that the average mouth size is roughly80× 50 pixels.

based) was used, this accuracy is satisfying.

V. D ISCUSSION

In this section, we tie the facts that differentiate the proposedmethods against related approaches.An important factor for the performance of an object detectoris speed, not excluding training time, since a fast trainingwill enable a system to be rapidly adapted to new tasks.Viola&Jones, for example, reported in [11] that the trainingtime of their final classifier was in the order of weeks ona single machine. Employing our features, the training ofa comparable cascade takes about an hour on an ordinarydesktop computer. To be fair with respect to computationalpower advancements, we have timed the training of the900/9000 classifier from section II-D for both our features andViola&Jones’ using OpenCV. The latter provides functionalitythat directly implements Viola&Jones’ approach, presuminglywith high efficiency. A standard desktop computer with a 2.2Ghz Intel processor and 1 GB of RAM was used for thispurpose. It turned out that our (errorfree) 36-quangle classifierwas ready after 17 minutes. On the contrary, the same traininglasted 529 minutes and results in a 133-feature classifier usingViola&Jones’ original basis functions. Since we can use lessfeatures which do not need the greedy search for the bestthresholds on the training data, this translates accordingly tothe training/complexity of a detection cascade. We infer thatrectangle features, which rely on gray level information, arenot as discriminative as our quantized gradient-angle featuresand need to be calibrated more precisely. Furthermore, theoperational speed of the proposed scheme (our implemen-tation) is compared with the object detector’s included inOpenCV (Viola&Jones-based). Compared to the improvementin training speed, the face detection speed is not differingremarkably which is probably related to our implementation.To process a 1280x1024 image on the mentioned system(standard desktop), both methods take about 2 seconds.Other studies have suggested schemes for reducing classifiercomplexity [13], [14], which we did not investigate yet,because our combined (q1 and q2) features resulted in aclassifier that is simple enough. In a related study, [9], using anaive Bayes classifier, merely the original gradient angle wasquantized by a fixed number of seven steps without a furtherstudy of flexible and lower quantization levels. In [12], noquantization at all was done (except for integer conversion)and only the structure tensor directions (the doubled gradient

angle) was used. Furthermore, the weak classifiers were differ-ently constructed there, involving significantly more operationsfor evaluation. Also, most object detection methods depend ona preprocessing step (gray value normalization, equalization)prior to feature extraction, which we can omit because weexclusively operate on the gradient angle.Coming to the application part, other studies have suggestedto track mouth minutiae, e.g. corner points [9] or lip contours[22] (not for “liveness” purposes, though). In a realisticenvironment, these approaches may heavily suffer from noise.Another disadvantage is the non-constant computation timedue to the iterative process in convergence of the contourfitting. Instead, we have (both the face and) the mouthrobustly located by the introduced method frame-by-frame,and we reuse the gradient to calculate the OFL within themouth region in real-time. Firstly, this assistance equatesto savings in computations, e.g. for gradient approximationand interpolation. Had we used an object detection schemeemploying different features, we would have had to computethem increasing the processing load. Secondly, the chances ofovercoming severe noise are big, since the object detector hasalready been trained on numerous realistic samples. Lastly,based on our knowledge, no work has been presented on digitrecognition based on visual observations only. Our motionestimation technique can be used to separate the pauses fromspeech [23], but there are more experiments to be conductedwith respect to automatic digit segmentation in sequences. Inits current state, the system works round-driven with a singledigit per time slot.

VI. CONCLUSION

In this study, we presented a novel real-time method forface detection. However, it is possible to use the technique asa general image-object detector. An experimental support ofthis view is its straightforward usability to detect mouthes. Theintroduced quantized angle (“quangle”) features were studiedexperimentally and we presented evidence for their richnessof information measured by their discriminative properties andtheir resilience to the impacts of severe illumination changes.They do not need preprocessing, e.g. histogram equaliza-tion/normalization, adding to their computational advantage.This was achieved by considering both the gradient directionand orientation, yet ignoring the magnitude. A quantizationscheme was presented to reduce the feature space prior toboosting, i.e. it enables fast evaluation (1 array access). Scaleinvariance was implemented through an image pyramid. Thetraining excels in rapidness, which enables the use of ourobject detector for changing environments and applicationneeds. Furthermore, we proposed novel strategies to avert ad-vanced spoofing attempts like replayed videos (and presentedphotographs), which took advantage of calculations from theobject detection stage. For this purpose, we showed a possibil-ity to restore utterances of a person by analyzing the motionof the lips only. The practicability of the proposed methodsand ideas was corroborated by satisfying experimental resultsfor face detection (e.g. 93% Detection Rate at1× 10−6 FalsePositive Rate on the CMU-MIT frontal face test set) and mouthdetection as well as “liveness” assessment.

KOLLREIDER et al.: REAL-TIME FACE DETECTION AND MOTION ANALYSIS WITH APPLICATION IN “LIVENESS” ASSESSMENT 11

REFERENCES

[1] M.-H. Yang, D. Kriegman, and N. Ahuja, “Detecting Faces in Images:A Survey,” IEEE-PAMI, vol. 24, no. 1, pp. 34–58, Jan. 2002.

[2] M. Hamouz, J. Kittler, J. Kamarainen, P. Paalanen, H. Kalviainen, andJ. Matas, “Feature-Based Affine-Invariant Localization of Faces,”PAMI,vol. 27, no. 9, pp. 1490–1495, September 2005.

[3] M. Turk and A. Pentland, “Eigenfaces for recognition,”J. CognitiveNeuroscience (Winter), vol. 3, no. 1, pp. 71–86, 1991.

[4] K. K. Sung and T. Poggio, “Example Based Learning for View-BasedHuman Face Detection,” Cambridge, MA, USA, Tech. Rep., 1994.

[5] H. Rowley, S. Baluja, and T. Kanade, “Human Face Detection in VisualScenes,” inAdvances in Neural Information Processing Systems 8, 1996,pp. 875 – 881.

[6] C. Cortes and V. Vapnik, “Support-Vector Networks,”Machine Learn-ing, vol. 20, pp. 273–297, 1995.

[7] E.Osuna, R. Freund, and F. Girosi, “Training Support Vector Machines:an Application to Face Detection,” inIEEE Conference on ComputerVision and Pattern Recognition, June 17-19 1997.

[8] H. Schneiderman and T. Kanade, “Probabilistic Modeling of LocalAppearance and Spatial Relationships for Object Recognition,”cvpr,vol. 00, p. 45, 1998.

[9] D. Nguyen, D. Halupka, P. Aarabi, and A. Sheikholeslami, “Real-TimeFace Detection and Lip Feature Extraction Using Field-ProgrammableGate Arrays,”SMCB, vol. 36, no. 4, pp. 902–912, August 2006.

[10] Y. Freund and R. E. Shapire, “A decision-theoretic generalization ofonline learning and an application to boosting,”Journal of Computerand System Sciences, vol. 5, no. 1, pp. 119–139, 1997.

[11] P. Viola and M. Jones, “Rapid object detection using a boosted cascadeof simple features,” inIn Proceedings of the CVPR, Kauai, USA,December 2001, pp. 511–518.

[12] B. Froeba and C. Kueblbeck, “Real-Time Face Detection Using Edge-Orientation Matching,” inAVBPA ’01: Proceedings of the Third In-ternational Conference on Audio- and Video-Based Biometric PersonAuthentication. London, UK: Springer-Verlag, 2001, pp. 78–83.

[13] S. Z. Li, L. Zhu, Z. Zhang, A. Blake, H. Zhang, and H. Shum, “StatisticalLearning of Multi-view Face Detection,” inECCV ’02: Proceedings ofthe 7th European Conference on Computer Vision-Part IV. London,UK: Springer-Verlag, 2002, pp. 67–81.

[14] J. Sochman and J. Matas, “WaldBoost ” Learning for Time ConstrainedSequential Detection,” inCVPR ’05: Proceedings of the 2005 IEEEComputer Society Conference on Computer Vision and Pattern Recog-nition (CVPR’05) - Volume 2. Washington, DC, USA: IEEE ComputerSociety, 2005, pp. 150–156.

[15] J. Bigun,Vision with Direction. Springer, 2006.[16] K. Kollreider, H. Fronthaler, and J. Bigun, “Evaluating Liveness by

Face Images and the Structure Tensor.” inFourth IEEE Workshop onAutomatic Identification Advanced Technologies AutoID 2005, Buffalo,New York, 17–18 October 2005, pp. 75–80.

[17] F. Smeraldi and J. Bigun, “Retinal vision applied to facial featuresdetection and face authentication,”Pattern Recognition Letters, vol. 23,pp. 463–475, 2002.

[18] J. Bigun, H. Fronthaler, and K. Kollreider, “Assuring liveness in biomet-ric identity authentication by real-time face tracking,” inCIHSPS2004- IEEE International Conference on Computational Intelligence forHomeland Security and Personal Safety, Venice, Italy. IEEE CatalogNo. 04EX815, ISBN 0-7803-8381-8, 21–22 July 2004, pp. 104–112.

[19] J. Li, Y. Wang, T. Tan, and A. K. Jain, “Live face detection based onthe analysis of Fourier spectra,” inBiometric Technology for HumanIdentification. SPIE Volume: 5404, August 2004, pp. 296–303.

[20] T. Choudhury, B. Clarkson, T. Jebara, and A. Pentland, “MultimodalPerson Recognition using Unconstrained Audio and Video,” in2ndInternational Conference on Audio-Visual Biometric Person Authenti-cation, Washington D.C, 22–23 March 1999.

[21] I. H. W. G. C. H. Bredin, A. Miguel, “Detecting Replay Attacks inAudiovisual Identity Verification,” inICASSP 2006 - IEEE InternationalConference on Acoustics, Speech, and Signal Processing, Toulouse,France, May 14-19 2006.

[22] S. Dupont and J. Luettin, “Audio-Visual Speech Modelling for Contin-uous Speech Recognition,”IEEE Transactions on Multimedia, vol. 2,no. 3, pp. 141–151, Sep 2000.

[23] M. I. Faraj and J. Bigun, “Person verification by lip-motion,” inComputer Vision and Pattern Recognition Workshop on Biometrics,Piscataway, New York, June 2006, pp. 37–44.

[24] N. T. J. Luettin and S. Beet, “Speaker identification by lipreading,” inProceedings of the 4th International Conference on Spoken LanguageProcessing, vol. 1, Oct 1996, p. 62 65.

[25] P. N. Belhumeur, J. P. Hespanha, and D. J. Kriegman, “Eigenfacesvs. Fisherfaces: Recognition using class specific linear projection,”PAMI, vol. 19, no. 7, pp. 711–720, 1997, special Issue on FaceRecognition.

[26] K. Messer, J. Matas, J. Kittler, J. Luettin, and G. Maitre, “XM2VTSDB:The extended M2VTS database,” inAudio and Video based PersonAuthentication - AVBPA99. University of Maryland, 1999, pp. 72–77.

[27] J. Bigun, T. Bigun, and K. Nilsson, “Recognition by symmetry deriva-tives and the generalized structure tensor,”IEEE-PAMI, vol. 26, pp.1590–1605, 2004.

[28] J. B. N. Pal, “On cluster validity for the fuzzy c-means model,”IEEETransactions on Fuzzy Systems, vol. 3, no. 3, pp. 370–379, Aug 1995.

[29] C.-C. Chang and C.-J. Lin,LIBSVM: a library for support vectormachines, 2001, software available at http://www.csie.ntu.edu.tw/∼cjlin/libsvm.

Klaus Kollreider received a M.S. in Computer Sys-tems Engineering from Halmstad University in 2004,after which he joined its signal analysis group. Hisscientific interests include signal analysis and com-puter vision, in particular face biometrics and anti-spoofing measures by object detection and tracking.He is involved in an European project focused onbiometrics, BioSecure, where he has also contributedto a reference system for fingerprint matching.

Hartwig Fronthaler obtained a Master’s degree atHalmstad University at the department of ComputerSystems Engineering. His specialization has been inthe field of image analysis with focus on biometrics.He joined the signal analysis group of HalmstadUniversity in 2004. His major research interestsare in the field of fingerprint processing, includingautomatic quality assessment and feature extraction.Furthermore he has been involved in research on facebiometrics.

Maycel Isaac Faraj is a PhD student within thesignal analysis group at Halmstad University. Hereceived a M.S. in Computer Systems Engineeringfrom Halmstad University in 2003. After contribut-ing to a software product in the field of medicalimages analysis, he joined the Intelligent SystemLaboratory at Halmstad University in 2004. Hisresearch interests include signal analysis for audio-visual biometrics, computer vision and graphics.

Josef Bigun obtained his MS and Ph.D. degreesfrom Linkoeping University in Sweden in 1983and 1988, respectively. Between 1988 and 1998 heworked at the EPFL in Switzerland. He was electedProfessor to the Signal Analysis Chair at HalmstadUniversity and Chalmers University of Technologyin 1998. He has co-chaired several internationalconferences. He has been contributing as a referee oras an editorial board member of journals includingPRL and IEEE Image Processing. He served in theexecutive committees of several associations includ-

ing IAPR. His scientific interests include a broad field in computer vision,texture and motion analysis, biometrics and the understanding of biologicalrecognition mechanisms. He has been elected Fellow of IAPR and IEEE.