ROBUST FACE ALIGNMENT USING CONVOLUTIONAL NEURAL … · Keywords: Face alignment, Face...

ROBUST FACE ALIGNMENT USING CONVOLUTIONAL NEURALNETWORKS

Stefan Duffner and Christophe GarciaOrange Labs, 4, Rue du Clos Courtel, 35512 Cesson-Sevigne, France

{stefan.duffner, christophe.garcia}@orange-ftgroup.com

Keywords: Face alignment, Face registration, Convolutional Neural Networks.

Abstract: Face recognition in real-world images mostly relies on three successive steps: face detection, alignment andidentification. The second step of face alignment is crucialas the bounding boxes produced by robust face de-tection algorithms are still too imprecise for most face recognition techniques,i.e. they show slight variationsin position, orientation and scale. We present a novel technique based on a specific neural architecture which,without localizing any facial feature points, precisely aligns face images extracted from bounding boxes com-ing from a face detector. The neural network processes face images cropped using misaligned bounding boxesand is trained to simultaneously produce several geometricparameters characterizing the global misalign-ment. After having been trained, the neural network is able to robustly and precisely correct translations ofup to±13% of the bounding box width, in-plane rotations of up to±30◦ and variations in scale from 90% to110%. Experimental results show that 94% of the face images of the BioID database and 80% of the images ofa complex test set extracted from the internet are aligned with an error of less than 10% of the face boundingbox width.

1 INTRODUCTION

In the last decades, much work has been conductedin the field of automatic face recognition in images.The problem is however far from being solved as thevariability of the appearance of a face image is verylarge under real-world conditions because of non-constrained illumination conditions, variable posesand facial expressions. To cope with this large vari-ability, face recognition systems require the input faceimages to be well-aligned in such a way that char-acteristic facial features (e.g. the eye centers) are ap-proximately located at pre-defined positions in the im-age. As pointed out by Shanet al. (Shan et al., 2004)and Rentzeperiset al. (Rentzeperis et al., 2006), slightmisalignments,i.e. x and y translation, rotation orscale changes, cause a considerable performance dropfor most of the current face recognition methods.

Shanet al. (Shan et al., 2004) addressed this prob-lem by adding virtual examples with small transfor-mations to the training set of the face recognition sys-tem and, thus, making it more robust. Martinez (Mar-tinez, 2002) additionally modeled the distribution inthe feature space under varying translation by a mix-ture of Gaussians. However, these approaches can

translation

scalerotation

Figure 1: The principle of face alignment.Left: bound-ing rectangle of face detector,right: rectangle aligned withface.

only cope with relatively little variations and, in prac-tice, cannot effectively deal with imprecisely local-ized face images like those coming from a robust facedetection systems.

For this reason, most of the face recognition meth-ods require an intermediate step, calledface align-mentor face registration, where the face transformedin order to be well-aligned with the respective bound-ing rectangle, or vice versa. Figure 1 illustrated this.

Existing approaches can be divided into two maingroups: approaches based on facial feature detectionand global matching approaches, most of the pub-lished methods belonging to the first group. Berget

30

al. (Berg et al., 2004), for example, used a SVM-based feature detector to detect eye and mouth cor-ners and the nose and then applied an affine trans-formation to align the face images such that eye andmouth centers are at pre-defined positions. Wiskottet al. (Wiskott et al., 1997) used a technique calledElastic Bunch Graph Matching where they mapped adeformable grid onto a face image by using local setsof Gabor filtered features. Numerous other methods(Baker and Matthews, 2001; Edwards et al., 1998;Hu et al., 2003; Li et al., 2002) align faces by ap-proaches derived from the Active Appearance Modelsintroduced by Cooteset al. (Cootes et al., 2001).

Approaches belonging to the second group areless common. Moghaddam and Pentland (Moghad-dam and Pentland, 1997), for example, used a max-imum likelihood-based template matching method toeliminate translation and scale variations. In a sec-ond step, however, they detect four facial features tocorrect rotation as well. Jiaet al. (Jia et al., 2006) em-ployed a tensor-based model to super-resolve face im-ages of low resolution and at the same time found thebest alignment by minimizing the correlation betweenthe low-resolution image and the super-resolved one.Rowley et al. (Rowley et al., 1998) proposed a facedetection method including a Multi-Layer Perceptron(MLP) to estimate in-plane face rotation of arbitraryangle. They performed the alignment on each candi-date face location and then decided if the respectiveimage region represents a face.

The method proposed in this paper is similar to theone of Rowleyet al. (Rowley et al., 1998) but it notonly corrects in-plane rotation but alsox/y translationand scale variations. It is further capable of treatingnon-frontalface images and employs an iterative es-timation approach. The system makes use of a Con-volutional Neural Network (CNN) (LeCun, 1989; Le-Cun et al., 1990) that, after being trained, receives amis-aligned face image and directly and simultane-ously responds with the respective parameters of thetransformation that the input image has undergone.

The remainder of this article is organized as fol-lows. In sections 2 and 3, we describe the neuralnetwork architecture and training process. Section 4gives details about the overall alignment procedureand in section 5 we experimentally assess the preci-sion and the robustness of the proposed approach. Fi-nally, conclusions are drawn in section 6.

2 THE PROPOSED SYSTEMARCHITECTURE

The proposed neural architecture is a specific type ofneural network consisting of seven layers, where thefirst layer is the input layer, the four following lay-ers are convolutional and sub-sampling layers and thelast two layers are standard feed-forward neuron lay-ers. The aim of the system is to learn a function thattransforms an input pattern representing a mis-alignedface image into the four transformation parameterscorresponding to the misalignment,i.e. x/y transla-tion, rotation angle and scale factor. Figure 2 gives anoverview of the architecture.

The retinal1 receives a cropped face image of46×56 pixels, containing gray values normalized be-tween−1 and+1. No further pre-processing likecontrast enhancement, noise reduction or any otherkind of filtering is performed.

The second layerl2 consists of four so-called fea-ture maps. Each unit of a feature map receives itsinput from a set of neighboring units of the retinaas shown in Fig. 3. This set of neighboring units isoften referred to aslocal receptive field, a conceptwhich is inspired by Hubel and Wiesel’s discovery oflocally-sensitive, orientation-selective neurons in thecat visual system (Hubel and Wiesel, 1962). Such lo-cal connections have been used many times in neuralmodels of visual learning (Fukushima, 1975; LeCun,1989; LeCun et al., 1990; Mozer, 1991). They allowextracting elementary visual features such as orientededges or corners which are then combined by subse-quent layers in order to detect higher-order features.Clearly, the position of particular visual features canvary considerably in the input image because of dis-tortions or shifts. Additionally, an elementary featuredetector can be useful in several parts of the image.For this reason, each unit shares its weights with allother units of the same feature map so that each maphas a fixed feature detector. Thus, each feature mapy2i of layerl2 is obtained by convolving the input mapy1 with a trainable kernelw2i :

y2i(x,y) = ∑(u,v)∈K

w2i(u,v)y1(x+u,y+v) +b2i , (1)

where K = {(u,v) | 0 < u < sx ; 0 < v < sy} andb2i ∈ R is a trainablebias which compensates forlighting variations in the input. The four feature mapsof the second layer perform each a different 7×7 con-volution (sx = sy = 7). Note that the size of the ob-tained convolutional maps inl2 is smaller than theirinput map inl1 in order to avoid border effects in theconvolution.

Layer l3 sub-samples its input feature maps intomaps of reduced size by locally summing up the out-

ROBUST FACE ALIGNMENT USING CONVOLUTIONAL NEURAL NETWORKS

31

Translation Y

Scale factor

Translation X

Rotation angle

convolution 7x7

convolution 5x5

subsamplingsubsampling

l1: 1x46x56

l2: 4x40x50

l3: 4x20x25

l4: 3x16x21

l5: 3x8x10

l6: 40

l7: 4

Figure 2: Architecture of the neural network.

convolution 5x5

subsampling

Figure 3: Example of a 5×5 convolution map followed bya 2×2 sub-sampling map.

put of neighboring units (see Fig. 3). Further, this sumis multiplied by a trainable weightw3 j , and a trainablebiasb3 j is added before applying a sigmoid activationfunctionΦ(x) = arctan(x):

y3 j(x,y) = Φ(

w3 j ∑u,v∈{0,1}

y2 j(2x+u,2y+v)+b3 j

)

.

(2)Thus, sub-sampling layers perform some kind of aver-aging operation with trainable parameters. Their goalis to make the system less sensitive to small shifts,distortions and variations in scale and rotation of theinput at the cost of some precision.

Layer l4 is another convolutional layer and con-

sists of three feature maps, each connected to twomaps of the preceding layerl3. In this layer, 5× 5convolution kernels are used and each feature map hastwo different convolution kernels, one for each inputmap. The results of the two convolutions as well asthe bias are simply added up. The goal of layerl4 isto extract higher-level features by combining lower-level information from the preceding layer.

Layer l5 is again a sub-sampling layer that worksin the same way asl3 and again reduces the dimensionof the respective feature maps by a factor two.

Whereas the previous layers act principally as fea-ture extraction layers, layersl6 and l7 combine theextracted local features from layerl4 into a globalmodel. They are neuron layers that are fully con-nected to their respective preceding layers and usea sigmoid activation function.l7 is the output layercontaining exactly four neurons, representing x andy translation, rotation angle and scale factor, normal-ized between−1 and+1. After activation of the net-work, these neurons contain the estimated normalizedtransformation parametersy7i of the mis-aligned faceimage presented atl1. Each final transformation pa-rameterpi is then calculated by linearly rescaling thecorresponding valuey7i from [-1,+1] to the interval of

VISAPP 2008 - International Conference on Computer Vision Theory and Applications

32

Figure 4: Examples of training images created by manuallymisaligning the face (top left: well-aligned face image).

the respective minimal and maximal allowed valuespmini andpmaxi:

pi =pmaxi − pmini

2(y7i +1)+ pmini , i = 1..4 .

(3)

3 TRAINING PROCESS

We constructed a training set of about 30,000 face im-ages extracted from several public face databases withannotated eye, nose and mouth positions. Using theannotated facial features, we were able to crop well-aligned face images where the eyes and the mouth areroughly at pre-defined positions in the image whilekeeping a constant aspect ratio. By applying trans-formations on the well-aligned face images, we pro-duced a set of artificially mis-aligned face images thatwe cropped from the original image and resized tohave the dimensions of the retina (i.e. 46×56). Thetransformations were applied by varying the transla-tion between−6 and+6 pixels, the rotation anglebetween−30 and+30 degrees and the scale factorbetween 0.9 and 1.1. Figure 4 shows some train-ing examples for one given face image. The respec-tive transformation parameterspi were stored for eachtraining example and used to form the correspondingdesired outputs of the neural network by normalizingthem between−1 and+1:

di =2(pi − pmini)

pmaxi − pmini−1 . (4)

Training was performed using the well-knownBackpropagation algorithm which has been adaptedin order to account for weight sharing in the convo-lutional layers (l2 and l4). The objective function issimply the Mean Squared Error (MSE) between thecomputed outputs and the desired outputs of the fourneurons inl7. At each iteration, a set of 1,000 faceimages is selected at random. Then, each face im-age example of this set and its known transformation

Face Detection

Alignment estimationby CNN

Translation Y = +4Rotation angle = −14Scale factor = 0.95

Translation X = − 2

Adjust by 10% usingthe inverse parameters

30 iterations ?

Face extractionand rescaling

Save finalparameters

no

yes

Figure 5: Overall alignment procedure.

parameters are presented, one at a time, to the neu-ral network and the weights are updated accordingly(stochastic training). Classically, in order to avoidoverfitting, after each training iteration, a validationphase is performed using a separate validation set. Aminimal error on the validation set is supposed to givethe best generalization, and the corresponding weightconfiguration is stored.

4 ALIGNMENT PROCESS

We now explain how the neural network is used toalign face images with bounding boxes obtained froma face detector. The overall procedure is illustrated inFigure 5. Face detection is performed using the Con-volutional Face Finder by Garcia and Delakis (Garciaand Delakis, 2004) which produces upright boundingboxes. The detected faces are then extracted accord-ing to the bounding box and resized to 46×56 pix-els. For each detected face, the alignment processis performed by presenting the mis-aligned croppedface image to the trained neural network which in turngives an estimation of the underlying transformation.


33

A correction of the bounding box can then simply beachieved by applying the inverse transformation pa-rameters (−pi for translation and rotation, and 1/pifor scale). However, in order to improve the correc-tion, this step is performed several (e.g. 30) times inan iterative manner, where at each iteration, only acertain proportion (e.g. 10%) of the correction is ap-plied to the bounding box. Then, the face image isre-cropped using the new bounding box and a newestimation of the parameters is calculated with thismodified image. The transformation with respect tothe initial bounding box is obtained by simply accu-mulating the respective parameter values at each it-eration. Using this iterative approach, the system fi-nally converges to a more precise solution than whenusing a full one-step correction. Moreover, oscilla-tions can occur during the alignment cycles. Hence,the solution can further be improved by reverting tothat iteration where the neural network estimates theminimal transformation,i.e. where the outputsy7i arethe closest to zero.

The alignment can be further enhanced by succes-sively executing the procedure described above twotimes with two different neural networks: the first onetrained as presented above for coarse alignment andthe second one with a modified training set for finealignment. For training the fine alignment neural net-work, we built a set of face images with less variation,i.e. [−2,+2] for x and y translations, [−10,+10] forrotation angles and [0.95,1.05] for scale variation. Aswith the neural network for coarse alignment, the val-ues of the output neurons and the desired output val-ues are normalized between−1 and+1 using reducedextremapmin′i andpmax′i (cf. equation 4).

5 EXPERIMENTAL RESULTS

To evaluate the proposed approach, weused two different annotated test sets, thepublic face database BioID (available ath t t p : / / www.humanscan.com/support/downloads/facedb.php) containing 1,520 images and a privateset of about 200 images downloaded from the Inter-net. The latter, referred to as Internet test set, containsface images of varying size, with large variations inillumination, pose, facial expressions and containingnoise and partial occlusions. As described in theprevious section, for each face localized by the facedetector (Garcia and Delakis, 2004), we performthe alignment on the respective bounding box andcalculate the precision errore which is defined as themean of the distances between its cornersai ∈ R

2

and the respective corners of the desired bounding

box bi ∈ R2 normalized with respect to the widthW

of the desired bounding box:

e=1

4W

4

∑i=1

‖ai −bi‖ . (5)

Figure 6 shows, for the two test sets, the propor-tion of correctly aligned faces varying the allowed er-ror e.

For example, if we allow an error of 10% of thebounding box width, 80% and 94% of the faces ofthe Internet and BioID test sets respectively are well-aligned. Further, for about 70% of the aligned BioIDfaces,e is below 5%.

We also compared our approach to a differenttechnique (Duffner and Garcia, 2005) that localizesthe eye centers, the nose tip and the mouth center. Theface images with localized facial features are alignedusing the same formula as used for creating the train-ing set of the neural network presented in this paper.Figure 7 shows the respective results for the Internettest set.

To show the robustness of the proposed method,we added Gaussian noise with varying standard de-viation σ to the input images before performing facealignment. Figure 8 shows the errore versusσ, aver-aged over the whole set for both of the test sets. Notethate remains below 14% for the Internet test set andbelow 10% for the BioID test set while adding a largeamount of noise (i.e. σ up to 150 for pixel values be-ing in [0,255]).

Another experiment demonstrates the robustnessof our approach against partial occlusions whileadding black filled rectangles of varying area to theinput images. Figure 9 shows, for two types of oc-clusions (explained below), the errore averaged overeach test set with varyings, representing the occludedproportion with respect to the whole face rectangle.

For a given detected face, letw be the width andh be the height of the bounding box. The first typeof occlusion is a black strip (”scarf”) of widthw andvarying heighth at the bottom of the detected facerectangle. The second type is a black box with as-pect ratiow/h at a random position inside the detectedrectangle. We notice that while varying the occludedarea up to 40% the alignment error does not substan-tially increase, especially for the scarf type. It is, how-ever, more sensitive to random occlusions. Neverthe-less, fors < 30% the error stays below 15% for theInternet test set and below 12% for the BioID test set.

Figure 10 shows some results on the Internet testset. For each example, the black box represents thedesired box, while the white box on the respective leftimage represents the face detector output and the one


34

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

0 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.45 0.5

Rat

e of

cor

rect

alig

nmen

t

Mean corner distance: e

Internet test setBioID test set

Figure 6: Correct alignment rate vs. allowed mean cornerdistance.

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

0 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.45 0.5

Rat

e of

cor

rect

alig

nmen

t

Mean corner distance: e

Feature detection approachOur approach

Figure 7: Precision of our approach and an approach basedon facial feature detection (Duffner and Garcia, 2005).

0

0.05

0.1

0.15

0.2

0 20 40 60 80 100 120 140

mea

n e

standard deviation

InternetBioID

Figure 8: Sensitivity analysis: Gaussian noise.

0

0.05

0.1

0.15

0.2

0 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4

mea

n e

occluded porportion s

Internet: random rectangleInternet: scarf

BioID: random rectangleBioID: scarf

0

0.05

0.1

0.15

0.2

0 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4

mea

n e

occluded porportion s

Internet: random rectangleInternet: scarf

BioID: random rectangleBioID: scarf

Figure 9: Sensitivity analysis: partial occlusion.

on the respective right image represents the alignedrectangle as estimated by the proposed system.

The method is also very efficient in terms of com-putation time. It runs at 67 fps on a Pentium IV3.2GHz and can easily be implemented on embeddedplatforms.

6 CONCLUSIONS

We have presented a novel technique that aligns facesusing their respective bounding boxes coming from aface detector. The method is based on a convolutionalneural network that is trained to simultaneously out-put the transformation parameters corresponding to agiven mis-aligned face image. In an iterative and hier-archical approach this parameter estimation is gradu-ally refined. The system is able to correct translations

of up to±13% of the face bounding box width, in-plane rotations of up to±30 degrees and variations inscale from 90% to 110%. In our experiments, 94%of the face images of the BioID database and 80%of a test set with very complex images were alignedwith an error of less than 10% of the bounding boxwidth. Finally, we experimentally show that the pre-cision of the proposed method is superior to a featuredetection-based approach and very robust to noise andpartial occlusions.


35

Figure 10: Some alignment results with the Internet test set.


36

REFERENCES

Baker, S. and Matthews, I. (2001). Equivalence and ef-ficiency of image alignment algorithms. InCom-puter Vision and Pattern Recognition, volume 1, pages1090–1097.

Berg, T., Berg, A., Edwards, J., Maire, M., White, R., Teh,Y.-W., Learned-Miller, E., and Forsyth, D. (2004).Names and faces in the news. InComputer Vision andPattern Recognition, volume 2, pages 848–854.

Cootes, T., Edwards, G., and Taylor, C. (2001). Active ap-pearance models.IEEE Transactions on Pattern Anal-ysis and Machine Intelligence, 23(6):681–685.

Duffner, S. and Garcia, C. (2005). A connexionist ap-proach for robust and precise facial feature detectionin complex scenes. InFourth International Sympo-sium on Image and Signal Processing and Analysis(ISPA), pages 316–321, Zagreb, Croatia.

Edwards, G., Taylor, C., and Cootes, T. (1998). Interpretingface images using active appearance models. InAuto-matic Face and Gesture Recognition, pages 300–305.

Fukushima, K. (1975). Cognitron: A self-organizing mul-tilayered neural network. Biological Cybernetics,20:121–136.

Garcia, C. and Delakis, M. (2004). Convolutional facefinder: A neural architecture for fast and robust facedetection.IEEE Transactions on Pattern Analysis andMachine Intelligence, 26(11):1408 – 1423.

Hu, C., Feris, R., and Turk, M. (2003). Active wavelet net-works for face alignment. InBritish Machine VisionConference, UK.

Hubel, D. and Wiesel, T. (1962). Receptive fields, binocu-lar interaction and functional architecture in the cat’svisual cortex.Journal of Physiology, 160:106–154.

Jia, K., Gong, S., and Leung, A. (2006). Coupling faceregistration and super-resolution. InBritish MachineVision Conference, pages 449–458, Edinburg, UK.

LeCun, Y. (1989). Generalization and network designstrategies. In Pfeifer, R., Schreter, Z., Fogelman, F.,and Steels, L., editors,Connectionism in Perspective,Zurich.

LeCun, Y., Boser, B., Denker, J., Henderson, D., Howard,R., Hubbard, W., and Jackel, L. (1990). Handwrittendigit recognition with a back-propagation network. InTouretzky, D., editor,Advances in Neural InformationProcessing Systems 2, pages 396–404. Morgan Kauf-man, Denver, CO.

Li, S., ShuiCheng, Y., Zhang, H., and Cheng, Q. (2002).Multi-view face alignment using direct appearancemodels. InAutomatic Face and Gesture Recognition,pages 309–314.

Martinez, A. (2002). Recognizing imprecisely localized,partially occluded, and expression variant faces from asingle sample per class.IEEE Transactions on PatternAnalysis and Machine Intelligence, 24(6):748–763.

Moghaddam, B. and Pentland, A. (1997). Probabilistic vi-sual learning for object representation.IEEE Trans-actions on Pattern Analysis and Machine Intelligence,19(7):696–710.

Mozer, M. C. (1991).The perception of multiple objects:a connectionist approach. MIT Press, Cambridge,USA.

Rentzeperis, E., Stergiou, A., Pnevmatikakis, A., and Poly-menakos, L. (2006). Impact of face registration errorson recognition. InArtificial Intelligence Applicationsand Innovations, Peania, Greece.

Rowley, H. A., Baluja, S., and Kanade, T. (1998). Rota-tion invariant neural network-based face detection. InComputer Vision and Pattern Recognition, pages 38–44.

Shan, S., Chang, Y., Gao, W., Cao, B., and Yang, P. (2004).Curse of mis-alignment in face recognition: problemand a novel mis-alignment learning solution. InAuto-matic Face and Gesture Recognition, pages 314–320.

Wiskott, L., Fellous, J., Krueger, N., and von der Malsburg,C. (1997). Face recognition by elastic bunch graphmatching.IEEE Transactions on Pattern Analysis andMachine Intelligence, 19(7):775–779.


37

ROBUST FACE ALIGNMENT USING CONVOLUTIONAL NEURAL … · Keywords: Face alignment, Face...

Documents

Transcript of ROBUST FACE ALIGNMENT USING CONVOLUTIONAL NEURAL … · Keywords: Face alignment, Face...