novak-msc01

Detection of Humans Using ColorInformation

Vıt Novak

May 23, 2001

Prohlasenı

Prohlasuji, ze jsem na resenı diplomove prace pracoval samostatne s pomocı ve-doucıho prace a ze jsem nepouzil jinou literaturu, nez je uvedena v seznamu.Zaroven prohlasuji, ze nemam namitek proti vyuzitı vysledku teto prace elektro-technickou fakultou CVUT.

Praha 23. kvetna Vıt Novak

Abstract

In this report a method for detection of human faces from color images based onhuman skin color detection and subsequent segmentation and verification schemeis described. Using the large Compaq database, skin and non-skin color mod-els are build on the basis of simple histogram approximation of the probabilitydistributions and the Neyman-Pearson classifier used in [8] for pixel-level skindetection is designed. A connected component analysis for reducing the detectionerrors is proposed, resulting in a slight improvement of ROC curve in compari-son with [8]. Using this method and the output mask from a skin classifier, thesegmentation algorithm based on a region growing technique using the color dis-tribution is applied. The set of face candidates obtained from the region growingalgorithm is analyzed in a very simple rejecting scheme. We show respectableresults from the face detection algorithm, although the face shape model is veryprimitive.

Anotace

Tato prace popisuje metodu detekce obliceju v barevnych obrazcıch. Hlavnı prvkymetody spocıvajı v detekci kuze, segmentaci homogennıch oblastı a nasledneverifikaci jednotlivych oblastı . K detekci kuze jsou pouzity dva modely. His-togram barvy kuze a histogram ostatnıch barev (nekuze). Tyto modely slouzıjako aproximace rozlozenı pravdepodobnosti barvy kuze resp. nekuze s jejichzpomocı je navrzen Neyman-Pearsonuv klasifikator barvy kuze poprve imple-mentovany v [8]. Vystup klasifikatoru je pouzit pro naslednou segmentaci,ktera vyuzıva techniku narustanı zalozenou na barve a prostorovem rozlozenı.Vysledna mnozina kandidatu je postupne overena pomocı primitivnıho pros-toroveho modelu obliceje. Prestoze zmıneny model obliceje je velmi jednoduchy,vysledny algoritmus vykazuje slibne vysledky.

I would like to express my thanks to Mr. Jones who provided us with the Compaqimage database and especially to my supervisor Mr. Matas for his time and a greatdeal of help he gave me.

Contents

1 Introduction 3

2 Related published work 5

2.1 Segmenting Hands of Arbitrary Color [19] . . . . . . . . . . . . . 5

2.2 Robust Face Tracking using Color [13] . . . . . . . . . . . . . . . 6

2.3 Self-Organized Integration of Visual Cues for Tracking [17] . . . 7

2.4 Statistical Color Models with Application to Skin Detection [8] . . 9

2.5 Three Approaches to Pixel-level Human Skin detection [2] . . . . 11

2.6 Tracking Interacting People [9] . . . . . . . . . . . . . . . . . . . 11

2.7 Segmentation and Tracking of Faces in Color Images [14] . . . . 13

2.8 Detecting Human Faces in Color Images [18] . . . . . . . . . . . 15

3 Training Dataset 16

3.1 XM2VTS Database [10] . . . . . . . . . . . . . . . . . . . . . . 16

3.2 Compaq Database [7] . . . . . . . . . . . . . . . . . . . . . . . . 18

3.3 WWW Face Database . . . . . . . . . . . . . . . . . . . . . . . . 18

4 Statistical approaches to skin detection 19

4.1 Bayesian Theory . . . . . . . . . . . . . . . . . . . . . . . . . . 20

4.2 The Neyman-Pearson Strategy . . . . . . . . . . . . . . . . . . . 21

4.3 Non-random intervention . . . . . . . . . . . . . . . . . . . . . . 22

1

4.4 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23

5 A Multi Color Model for Face 27

5.1 Face color model . . . . . . . . . . . . . . . . . . . . . . . . . . 27

5.2 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28

6 Connected Components 31

6.1 Single face color distribution . . . . . . . . . . . . . . . . . . . . 31

6.2 Region Growing . . . . . . . . . . . . . . . . . . . . . . . . . . . 32

6.3 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34

6.3.1 Color variability of the region . . . . . . . . . . . . . . . 35

6.3.2 Difference at the boundary . . . . . . . . . . . . . . . . . 38

6.3.3 Compactness of the region . . . . . . . . . . . . . . . . . 39

6.3.4 Homogeneous regions detection . . . . . . . . . . . . . . 39

6.3.5 Starting points . . . . . . . . . . . . . . . . . . . . . . . 44

7 Face detection 46

7.1 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48

8 Discussion and conclusions 51

A Applications description 56

A.1 Face detector . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56

A.1.1 Application control . . . . . . . . . . . . . . . . . . . . . 56

A.1.2 Functions description . . . . . . . . . . . . . . . . . . . . 57

A.2 Selector . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59

A.2.1 Application control . . . . . . . . . . . . . . . . . . . . . 59

A.2.2 Functions description . . . . . . . . . . . . . . . . . . . . 59

2

Chapter 1

Introduction

The great increase of computational resources in the last decade has providedthe computers with the means to solve many computer vision tasks like objecttracking and recognition. Within these tasks, tracking and recognition of humansoccupy one of the most important roles. Human motion tracking is essential forperceptual user interfaces, indoor and outdoor surveillance, sign-language recog-nition, efficient video coding etc. Recognition of human faces is useful in any useridentification system or face database management.

Detection of humans can be used as a initializing step to human motion, lip, orgestures tracking, as a localization step for face recognition as well as for imagedatabases queries. The task of detection of people from a static image can bydefined as a process with an image at the input and a set of positions of humansas the output. A simplified version of this task, would only return the number ofpeople in the image, or just signaled if any people are present. We seek to find amethod which would solve the first task for a large number of image conditions.

Color is an often used feature in human motion tracking, because it is a wellsuited orientation and scale invariant feature for segmentation and localization.The color of the human skin fills only a small fraction from the whole color spaceand thus any frequent appearance in the image can be a pointer to a human pres-ence. To localize the people in the images, the face is used very often, since theshape of the head is approximately invariant considering different positions, andthe face features such as eyes and mouth are well suited to distinguish the facefrom other parts of the human body.

In our work, we have studied the skin detection problem based on the color ofindividual pixels. The skin color model is build using a large database of imagescontaining human skin. Using this model together with the color model of non-

3

skin colors occurring in the images we have build a skin pixels classifier basedon the statistical theory of pattern recognition, which was first implemented byJones and Rehg [8]. To detect faces in the scene, we have attempt to build a multicolor model of the human face using the color of skin lips and hair and a similarstatistical approach. But we have found this model to be too erroneous for anyrobust application. Instead, we have proposed a connected component analysis,interconnected with a segmentation method based on the color. Finally we applya very simple validation scheme for the face detection based on the shape featuresof the human face.

The report is arranged in the following way. Chapter 2 introduces severalprojects solving human tracking and detection tasks, generally based on the color.Then the used training image datasets are briefly mentioned in the chapter 3 andthe statistical approach, solving skin detection problem is analyzed in the chapter4. The proposed multicolor face model is outlined in the chapter 5 and resultsare shown. Then the proposed segmentation technique is reported and examinedin the chapter 6 and the final algorithm for the face localization together with thesimple verification scheme is presented in the chapter 7. Finally the results arediscussed in the chapter 8.

4

Chapter 2

Related published work

The problem of detection of people arises in many applications of computer vi-sion, e.g. in the area of human-computer interaction, video coding, lip trackingand surveillance. For this purpose, color information is widely used. In the de-scription of related published work, we focus on the detection methods which usescolor as the primary feature. Our attention is kept mainly at the human skin de-tection part. However, a brief overview of the application area is given. The keymethod for our skin segmentation technique is based on [8, 2] which are describedin sections 2.4 and 2.5.

2.1 Segmenting Hands of Arbitrary Color [19]

In human–computer interaction, visual perception of hands can be very useful,since the number of possible expressions with the hand gestures is great and thediscriminability between individual gestures is relatively simple. Zhu and Waibeluse the color to build a model for hand segmentation method, based on simplestatistical approach using Bayes decision theory. There, the hand segmentation isused for a wearable computer application. When the user moves his open handinto the field of vision of the camera, menu items are shown at each finger. Theuser can select one item by bending the relevant finger.

The color is used to model hand and background, but the method do not re-quires a predefined skin color model. Instead it generates a hand color model anda background color model for any given image. Then the generated models areused to segment the hand. The key to build the model is the prior knowledge aboutthe hand position in the image. This is expressed by the probability distribution

5

�� over the image pixels

�� , which is learned from the training data.

Models are generated using the restricted EM algorithm. It means one of thecomponents has a fixed mean and limited prior probability. This component isused to model hand color while the rest of the components is used to model thebackground. The mean of the hand color (first component) is gained from thegiven image using the distribution

�� and the weight is fixed about��

which is obtained from the training data.

As soon as the hand color model��

, background color model�� !

-"�#%$'& �(� and the prior probability��(�

is known, the Bayesian rule can beapplied to each pixel in the given image.

Since the color models are build for each image apart, the prior knowledge ofthe distribution

�� and probability

�)�*��(�are necessary. Then the

hand needs to appear in particular place in the image and should have approxi-mately same size as in the training data. Another limitation stems from the use ofa single Gaussian for modeling the hand color. The method can fail if the handcolor distribution is not consistent.

2.2 Robust Face Tracking using Color [13]

Schwerdt and Crowley describe robust face tracking technique used for video cod-ing. The technique has two components: 1) A face tracking system which keepsa face centered in the image at particular size, and 2) an orthogonal basis codingtechnique, in which the normalized face image is projected onto a space of basisimages. We will focus at the tracking technique.

Because the face is a highly curved surface, the observed intensity of aface exhibits strong variations. These variations are eliminated by transforming��+,�.-/�.01�

color space into two dimensional intensity-normalized chromaticityspace of

� # � " � vectors.

#32 ++545-6450 "12 -+748-9450To detect the face in an image the Bayes rule is applied to every pixel, using twohistograms: color distribution of the skin

�� # � " �;:' �<*�� and color distribution ofthe whole image P(r,g). The histogram of skin color is made from a region of animage known to contain skin. This histogram is initialized by eye blink detectionand updated within the tracking procedure. The result of the Bayes rule, applied

6

on the image, is the probability map�)�*:' �<��=� �� 2 ��:' �<*�>� # �� " �? ��@�for each pixel

�� . Using this map, the skin pixels are grouped together by

counting the position and spatial extent of the color region. To reduce the jitterfrom outlying pixels coming from the pixels with skin color (e.g. hands), the prob-ability map is weighted by a Gaussian function placed at the location where theface is expected. The initial estimate of the covariance of this Gaussian is the sizeof the expected face. Once initialized, the covariance is estimated recursively fromthe previous image using the mean and covariance of the obtained skin region.

The use of a Gaussian weighting function for new input data can lead to aproblem, if the object moves above a certain speed. If the speed is too hight,the product of the new skin position and the old Gaussian function would vanish.Authors use a Kalman filter to eliminate this property.

2.3 Self-Organized Integration of Adaptive VisualCues for Face Tracking [17]

Another tracking method proposed by Triesch and Malsburg uses more cues tolocalize a human and rather focuses on integrating the individual results obtainedby using the cues than to provide more sophisticated model for skin color. How-ever the integration method can be viewed as the basic scheme for robust trackingyet using more specific models.

To cope with extensive environmental conditions like illumination and back-ground changes, occlusions by other people and so on, the system uses severalcues, which agree on a result and each cue adapts towards the result agreed upon.Experiments has showed that the system is robust if the changes in the environ-ment disrupt only minority of cues at the same time, although all cues may beaffected in the long run.

There are five visual cues that the system uses for detecting the head of trackedperson. Each cue produces a saliency map ACB �? ��D@� where

<indexes the cue andEGF AHB �? ��D@� FJI

. Positions

with value close to one, indicate high confidencethat the head is present. To extract the saliency map, the cues

� B �� D@� are com-pared at each position in the current image with a cue prototype

� B ��D@� .AHB �� D@� 2LK B �� B ��D@�M�.�N<.�� @D@��

7

For the integration, each cue dispose of the reliability coefficient # B �?D@� , withO B # B ��D@� 2 I. The saliency maps are then combined to produce the total result+/�� D@�

by computing a weighted sum, with reliabilities acting as weights.+/�� D@� 2LP B # B �?D@� AQB �� @D@�The estimated target position R S��D@� is the point yielding the highest result

R T�?D@� 2VUXW@Y=Z)U\[]_^ +/�� @D@�M`After finding the result R a��D@� , the quality bcB ��D@� is defined for each cue, which mea-sures how successful the cue was in predicting the total result. The qualities arenormalized that

O B b ��D@� 2 Iand are updating the reliabilities with d# B ��D@� using the

time constant e as flexibility parameter

efd# B �?D@� 2 b!B �?D@�ag # B �?D@�When the position R T�?D@� is found, the prototypes are updated alike the reliabil-

ities e!B d� B �?D@� 2 R� B ��D@�Tgh� B ��D@�where R� B ��D@� is the feature vector extracted from position R . As far as the systemseeks to be robust it is more concerned about the integration of the cues than thecues itself. So that there are very simple techniques employed to extract the cues:

i Intensity change tries to detect motion based on the thresholded differenceof subsequent images, where the threshold is a fixed value, so that this cueis not updated.

i The color cue is computed by comparing the color of each pixel to a regionof skin colors in HSI (hue, saturation, intensity) color space. If the pixelfalls within the interval of allowed values, the result is one, otherwise zero.The prototype color region is adapted towards the color average taken fromneighborhood of the estimated position if the standard deviation does notexceed certain threshold.

i Motion continuity tries to forecast actual persons position in virtue of thelast two estimated positions. This cue is not adaptive.

i Shape cue computes the correlation of a j/kGj -pixel grey level template ofthe target to the image. High correlations indicate a high likelihood of thetarget being at this particular position.

8

i The contrast at the position x is defined as the standard deviation of thegray level values of the pixel distribution of a j/klj -pixel image region.

The tracking technique was tested on 84 image sequences of people crossinga room. The testing sequences set consists of six classes from normal, where theperson just crosses the room, via lighting changes and turning, to occlusions byother moving human. The system is initialized with the reliabilities of the colorand intensity change cue set to 0.5 and all other reliabilities set to zero. Thischoice reflects limited a priori knowledge about the target. A detection thresholdm

was defined for deciding whether the person was in the scene or not comparingit with the total result

+/� R �� . The most important parameters are the time constantse and e�B for the adaptation and the detection thresholdm

. If the system adapts toofast, it has no memory and will be easily disturbed by high temporal changes. Ifadaptation is too slow, the system has problems with harmless changes occurringin quick succession. The system does not get adapted fast enough and the result ismissed detection. If

mis too high, relatively small changes in the scene result in

missed detection and vice versa ifm

is too low, if the person leaves the room thesystem tends to track the background.

The total error rate over the whole set of sequences was 58 % of correct track-ing. The worst disturbance appeared the occlusion by other human. This is obvi-ous as long as the color, contrast and intensity change can give the same vote forboth humans.

2.4 Statistical Color Models with Application toSkin Detection [8]

The existence of large image datasets such as photos on the World Wide Webmake it possible to build powerful generic models for low-level image attributeslike color using simple histogram learning techniques. Jones and Rehg use thecolor models constructed from nearly 1 billion labeled pixels. They compare theperformance of mixture models and histograms and find the histogram model to besuperior in accuracy and computational cost. Using aggregate features computedfrom the skin detector an effective detector for naked people is build.

The histogram models were constructed using a subset of 13,640 photos sam-pled from the total set of 18,696 photographs. In the 4675 photos containingskin, the skin pixels were segmented by hand. Pixels not labeled as skin werediscarded to reduce the chance that segmentation errors would contaminate themodels. These labeled pixels were placed into the skin histogram model. The

9

n n o n p n o q n o q p n o r n o r p n o s n o s p n o t n�o t p n o pn o pn o un o vn o wn o xq

y;z { | } | ~ � ~ � ��{ � � } � � �� ~ { �

� ��

�;�� z � � �� { z � � ~ �� ~ { �

qrs tp

¢¡£ ¥¤ s�¦¨§ ©�ª «¥¬� ®�¯'¯*«£°¢±�² ª ®¢§ ³£±¥°´«¢³ ;µ «¥¶�§ ¯*®¥¬�±¥©·�¡£¸ § ¹�ª º£ ±»¯*«£°¢±�² ª ®¢§ ³£±¥°´«¢³ ;µ «¥¶�§ ¯*®£¬¢±£©¼�¡£¸ § ¹�ª º£ ±»¯*«£°¢±�² ª ®¢§ ³£±¥°´«¢³ ;µ «¥¶¢°¢®¥ª ®½�¡£ ¥¤ s ¦¨§ ©�ª «¥¬� ®�¯'¯*«£°¢±�² ª ®¢§ ³£±¥°´«¢³ ;µ «¥¶¢°¢®¥ª ®¾�¡�¼¢· s ¦¨§ ©�ª «¥¬� ®�¯'¯*«£°¢±�² «�³�¶ º¢² ² ª ®¢§ ³¢§ ³;¬¿°�®;ª ®

Figure 2.1: ROC curves from [8] for a family of skin detectors based on differenthistogram and mixture models. The best ROC curve (number 5) is the result ofusing a À�Á%Â bin histogram model

remaining 8965 photos which did not contain any skin pixels were placed into thenon-skin color histogram model.

Given these two histograms, the probability that a given pixel belongs to theskin and non-skin classes can be computed:

��:' �<*�� 2ÄÃ/Å ��Æ�Ç Ã/Å Ç� �)��;ÈÉ:' �<*�� 2ÄÃ,Ê ��Æ�Ç Ã,Ê Ç

where c denotes Ë +,�.-/�.0ÍÌ triple, Ã)Å is the histogram of skin color and Ã/Ê his-togram of non-skin color. As soon as the skin and non-skin distributions are avail-able, skin classifier can be obtained. A particular RGB value is labeled as skinif Î ��:\ (<*��Î ��;ÈÉ:' �<*��,ÏÑÐ �where

EÒF Ð FÓIis a threshold. The prior probabilities

�)�*:' �<*��and

��È>:' �<��are unknown and Ð depends upon application-specific cost which is expressed byrequired false positive or false negative error. This problem had been solved byNeyman and Pearson (see section 4.2).

An important property of the pixel classifier is the receiver operating char-acteristic (ROC) curve, which shows the relationship between false negative andfalse positive error. The ROC can be used to compare classifiers performance.Rehg and Jones show comparison between classifiers based on histograms andmixture models (see Figure 2.1). Mixture models using the EM algorithm [3]

10

have been popular in earlier skin color modeling work. Separate mixture mod-els with 16 Gaussian were fitted to the full skin and non-skin pixel data. Thehistogram model was found to be superior to mixture models. In addition, Rehgan Jones regard the histogram with ÀÔÁ Â bins, using full dataset as the best one inclassification performance.

2.5 A comparative Assessment of Three Approachesto Pixel-level Human Skin detection [2]

Brand and Mason refer to [8] and compare three approaches to pixel-level skindetection. The first two approaches use simple ratios and space transforms respec-tively, whereas the third is the approach implemented by Rehg [8]. The Compaqskin and non-skin (see Section 3.2) database is used to quantitatively assess thethree approaches.

First approach is based on a very simple observation that, human skin tendsto have a predominance of red and thus the ratio of

+NÕX-is likely to be a good

detector. The second approach employs a 3-D transformation, designed explicitlyfor skin detection, leading to classification of skin pixels using just one dimension.The third technique uses the discrete 3-D probability map of skin color obtainedfrom Compaq database and first implemented by Rehg [8].

For the assessment, the threshold between skin and non-skin colors was set,so that Ö�×ÍØ of skin pixels were accepted. Then each method was applied at thetesting dataset and the false positive error was determined. This test responds tocalculating single point of ROC curve showed in Figure 2.1. The

+NÕX-ration

method was extended by evaluating another ratios+NÕX0

and-ÙÕ%0

, which seemedto be significantly influenced by skin color. The lowest false acceptance errorofI Ö�Ú¢Û1Ø was achieved using the probability maps in comparison to À E Ú¢Á,Ø for

tailored transform and single dimension threshold and ÀÔ×�Ú¢j3Ø for+NÕX-

ratio dis-crimination. The extended ratio approach provided poor improvement resultingin À�Á(Ú¢ÀfØ false positive error.

2.6 Tracking Interacting People [9]

McKenna describes a system for tracking multiple people in relatively uncon-strained environment. Tracking is performed at three levels of abstraction: re-gions, people and groups. The color and gradient information is used to extract

11

the moving objects. Color information is then used to disambiguate occlusionsand to estimate depth of ordering and position during occlusion. Since the systemis rather robust than specific (like in [17]), it can be viewed as complementary totracking methods that use more specific humans models.

The basic assumption is that the camera is stationary and that the changesin the environment are relatively slow to the motion of the people in the scene.The camera color channels are also assumed to have Gaussian noise and threevariance parameters Ü�Ý¿Þ�ß´à � Ü�áâÞ�ß´à � Ü�ã?Þ�ßâà are estimated for the given camera. Tomodel micro-motions in the scene like moving leaves etc., a color model based onGaussian distribution for each pixel is estimated. The stored background modelfor a pixel is Ë�ä�Ý � äá � ä�ã � Ü�Ý � Ü�á � Ü�ã Ì .

The model is adapted on-line using simple recursive updates in order to copewith slow environmental changes. Adaption is only performed in regions whichthe higher-level grouping process labels as background. Given new color vector� # � " �.�� of a background pixel, the updates are performed using update parameterå such as

E/F å FæIä�Þ 2 å �ç4V� I g å � ä�Þ �Ü�èÞ 2 å ��=g ä�Þ � è gÑ� I g å � Ü�èÞ �

for all color channels�Ùé ^ # � " ��ê` . Before the color pixel is considered as back-

ground, it is compared to the model. If��>g ä�Þ �(ë À Z)U\[ � Ü�Þ � Ü�Þ�Þ�ßâà � for any color

channel is true, then the pixel is set to the foreground.

The assumption that illumination changes slowly is violated when the changeis due to a shadow cast by people moving in the scene. To deal with shadows,RGB color space is replaced with chromaticities

� Ý and� á to reduce the impact

of the intensity change. But often there could be no difference in chromaticitybetween foreground and background (e.g dark green coat in front of grass).Tocope with such cases, gradients at every position in the image are estimated usingthe Sobel masks in each direction. Each pixel’s gradient is modeled using gradi-ent means

� ä ] Ý � äìâÝ �M�'� ä ] á � äìâá ��'� ä ] á � äìâá � and magnitude variances Ü�á´Ý � Ü�á´á � Ü�áâã .In addition, the average variances íÜ�áâÝ � íÜ�á´á � íÜ�áâã are computed. If the gradient

change magnitude exceeds certain value, exactly if î �� ] g ä ] � è 4ï�� ì g äì � è ëÀ Z)U\[ � Ü�áâÞ � íÜ�áâÞ � , for any color channel, the pixel is set to the foreground.

As the foreground is extracted, the tracking of moving people is performedat three levels of abstraction: Regions are connected components that are trackedconsistently over time. A person consists of one or more regions grouped together.Each person has a support map (mask), a timestamp an appearance model basedon color. People join into a group if they share a region. A person is initialized

12

when one or more regions satisfy set of rules like close proximity, overlap in thex-axis projection etc.

In order to track people consistently as they enter and leave groups, each per-son’s appearance must be modeled. A color model is build and being adapted foreach person being tracked. The model is adapted only if the person is alone, sincethe segmentation is not reliable in the group. Color distributions were modeledusing both Histograms and Gaussian mixtures. Color distribution of person

<is

updated adaptively by�)�� <¿� 2ïå � Êcð�ñ �� <»��4V� I g å �@�)�� <»�for each person

<in the given group and for each pixel

, from the group’s mask,

the probability�)��<�� ò�

is computed. When a group of several people splits, theindividual people’s color models are used to determine who belongs to each newgroup. Histogram color models are matched using histogram Ã�ó�ô computed fromnewly created group

- ô and model Ã B for person<. People are allocated to group- ô by maximizing the normalized match values:

õ ��- ô ��<¿� 2O ]cö ó�÷ Zùø¨ú � Ã/ó�ô � Ã B �O ]cö ó�÷ Ã B

Authors claim that the described scheme of background subtraction is quiterobust even in relatively unconstrained environment and the use of adaptation evenallows to cope with brief camera motion without complete failure. The weaknessis in the model of person, which is based only at the color (which is mostly colorof the clothes). Tracking would fail in grouping people that would dress in similarmanner.

2.7 Segmentation and Tracking of Faces in ColorImages [14]

Another segmentation method that uses color information for segmentation andtracking faces is proposed by Sobottka and Pittas. In addition, they use a simplemodel of the shape features of the face. First skin-like regions are determinedbased on the color attributes hue and saturation. Then regions of elliptical shapeare selected as face hypotheses. They are verified by searching for facial featuresin their interior. After the face is detected, it is tracked over time using activecontour model. We focus mostly at the segmentation method, and face candidateverification.

13

The face segmentation is done simply by thresholding the color input imageusing predefined domains of hue and saturation, that describe the human skincolor. The use of HSV color space is motivated with it is similarity with humanperception of colors. The luminance (Value) component is discarded to obtainrobustness towards changes in illumination and shadows. Having the segmentedskin image, the connected component analysis is performed and for each compo-nent shape information is evaluated. For shape analysis, best-fit ellipse is com-puted based on moments and then the distance between the ellipse E and the com-ponent is determined by:

û 2 O ]cöÆü � I gh�\�� ò�@�ý4 O ]cöcþ�ÿ¿ü �%�? ò�O ]cöÆü Iwhere

�\�� ò�denotes the indicator function of the component C at the position

.û

computes the ”holes” inside the ellipse and pixels from the component which areoutside the ellipse. Connected components which are well approximated with thebest-fit ellipse are then verified by searching for facial features.

The facial features extraction is based on the observation, that the facial fea-tures differs in brightness from the other parts of the face (the eyes and mouth areusually darker than the surrounding skin). To analyze face features positions, firstthe y-projection is determined by computing the mean grey-level values of ev-ery row of the connected component. Then the minima and maxima are searchedin the smoothed y-relief. For each significant minimum of the y-relief, an x-relief is computed by averaging the grey-level values of 3 neighbored rows. Thenthe x-relief is smoothed and minima and maxima are searched in it as well. Forevery minimum in the y-relief, the spatial positions of minima and maxima inx-relief are analyzed to find facial features. As result, a set of facial features can-didates is obtained. The candidates are clustered according to their left and rightx-coordinates to reduce the number of candidates.

Since the face candidates are available, the all possible constellations are buildand each one is assessed based on vertical symmetry of the constellation, the dis-tance between facial features and the assessment of each facial feature. Incom-plete constellations are considered as well. The best constellation is chosen torepresent the face features and the component’s contour is considered to be theface contour.

After the segmentation step, a method for tracking the face contours knownas active contours (snakes) is applied. An active contour is a deformable curve,which is influenced by it is interior and exterior forces. The interior forces imposethe smoothness constraints on the contour while the exterior forces attract thecontour to significant image features.

14

The segmentation was tested on the M2VTS database. About 90 % face fea-tures were correctly detected including eyebrows, nostrils and chin. However, im-ages in the database contains only one face with quite limited environment (singlecolor background, limited illumination conditions etc.) and thus the segmentationmethod can not be examined very well.

2.8 Detecting Human Faces in Color Images [18]

Yang and Ahuja propose a method for face detection in unconstraint color im-ages. The detection method is based on color and shape information. Multi-scalesegmentation is utilized to cope with occlusions (e.g. hands and arms) and scaleproblem (detecting faces at different scales).

The skin color model is build to capture chromatic characteristics of skin usingCIE LUV color space discarding luminance value and it is approximated by aGaussian distribution. The Gaussian is based on a histogram build from about500 images containing human skin of different races and was accepted using a� è test. A pixel is identified to have skin color if the corresponding probability isgreater than a threshold.

Multi-scale segmentation is used to extract a hierarchy of regions for match-ing. The segmented region is classified as skin region if most (above � E Ø ) ofthe pixels inside are classified as skin pixels. From the coarsest to the finest scale,the regions of skin color are merged until the shape is approximately elliptic. Theorientation of the elliptic region is computed using moments of inertia and the re-gion becomes a face candidate if the ratio of major axis to minor axis is less thana threshold (1.7). If a candidate region has some darken regions or some holesinside the merged region, it is classified as a human face.

Few examples of correct detection for different scales, rotations and back-ground conditions are shown.

15

Chapter 3

Training Dataset

Since it is hard to a priori define the skin color in the color space we use thetraining dataset to build the skin color model. The skin and non-skin histogramswere build using Compaq image database.

To examine the detection method performance, a set of testing pictures, thatcan well represent the range of possible occurrences in the practical application, isnecessary. The extent of the required set can be defined by the illumination condi-tions and camera restrictions, background complexity restrictions and occlusions.The detection performance can vary a good deal through the different datasets.Although results obtained by using a set with high restrictions (e.g. XM2VTSdatabase) can by very promising, the method can easily fail by analyzing a set ofpictures from a less restricted database.

We have used three sets of images to test our detection methods performance.Usually the XM2VTS database was used to examine first results and test themethod correctness. Then the other ones were used to test the performance forunconstraint pictures.

3.1 XM2VTS Database [10]

This database of images is a product of video sequences which contain singlefrontal face view occupying around

I Õ��of the whole image. The background

consists from single color tone, which is quite distinct from the face color. Theillumination source is in front of the object and thus the distribution of the intensityof the face color in the image is very narrow. The database contains 371 faces in2448 images with different postures, illumination and clothing.

16

Figure 3.1: XM2VTS database examples

17

Figure 3.2: Our WWW database examples

3.2 Compaq Database [7]

An completely unrestricted set of images containing human faces comes fromWWW. In [8] a large image database was collected from World Wide Web tobuild a comprehensive model for human skin color. From the whole set of 18,696photographs, 13,640 were used to build skin and non-skin color model. In the4675 images containing skin, the skin pixels were segmented by hand to producea mask for every picture. Authors made the dataset available to the academicresearch community1. This dataset was used to build skin and non-skin colormodels and to test the skin detection methods performance.

3.3 WWW Face Database

A small set of images of was collected from Web sides to analyze skin detectionand region growing technique and to present the results of these methods. Imagesfor region growing testing are small in size to advance the speed of testing butthe color distribution of the pictures represents a variety of lighting and cameraconditions.

1Unfortunately, we are not allowed to publish the pictures from the database

18

Chapter 4

Statistical approaches to skindetection

Statistical approach to pattern recognition have been widely used in many appli-cations. For the pixel-level skin detection, we have noticed several approaches inprevious work. First we will have a brief demonstration of the more frequent ofthem and then the correct way of statistical approach will be described.

The simplest one is based on the assumption, that the skin color distributionhas a small continuous extent in the used color space and this extent is a prioridefined (e.g. [11, 14, 17]). The skin blob � in the color space is described byset of thresholds in individual channels (n-dimensional rectangles), or e.g. an n-dimensional ellipse. Then the pixels which has color that fall inside of the definedregion in the color space are considered as skin pixels. To ensure the continuityof skin color, all kind of color space transformations are performed.

Other approach uses a skin color model�)��

which denotes the probabilityof appearance of the color

�within the set of possible skin colors

�. Then the skin

color model is used for skin detection regardless of background colors. The skinregion is defined by thresholding the skin color model: � 2 ^ �� Në Ð `([18]) or the probabilities are directly used for subsequent processing ([5]).

The third approach builds specific color models�)��

and�)�� æ�

for skinand background color respectively (e.g [9, 13, 19]). These two models are buildusing an a prior knowledge about appearance of skin in the image. In addition, theprior probability of skin appearance is approximated using same a prior knowl-edge. Then the Bayes rule is applied.

The first approach is a coarse approximation of the skin color model and it

19

is depending on the used color space. In the second approach, the backgroundcolor is not considered, which can lead to mistakes in classification. To reducethe number of such mistakes, some additional restrictions to the skin color arebeing made ([5]). The third approach is used in more specific applications, wherea priori assumptions can be made. Since we have no a priori knowledge aboutthe skin appearance in an image, we need another approach, which arises fromstatistical pattern recognition theory.

4.1 Bayesian Theory

Bayesian decision theory provides minimal possible error rate as far as we havesome knowledge about statistical models of possible classes. First we introducefew basic concepts we are going to use and then define the simplified Bayesiandecision rule.

Let �SB é� be a set of possible states of the nature of an object and let � é��

be a set of possible observations (features) of the object. Let us consider both

and�

as discrete random variables that are correlated with each other by a set ofjoin probabilities

�� SB � � � , which denotes the fraction of objects falling into theclass �ýB and having the feature x . Since we know the join probabilities, we alsoknow the prior probabilities

�� aB � , which denotes a priori knowledge about theoccurrence of the class �aB and conditional probabilities

�)� � � �aB � which means thefraction of observations x within the class �aB because�)� �ýB � 2 P�\ö�� SB � � �� ýB � 2 �)� �ýB � � �� SB �VÚWe can denote our decision rule as

which is from some set of decision rules7é��

. The decision rule is a function that maps the feature vector x into oneof the classes

� � �Íé� . Now we can define the average probability of a wrong

decision using the decision rule

+/��(� 2 P�\ö�� P�� ö��ÿ�� "! �)� �ýB � � � (4.1)

Now the aim is to find such a strategy R that minimizes+/� R��

R 2 UXW�YÉZùø¨ú�.ö$# P�%ö$� P��?ö��ÿ�� "! �)� �ýB � � �20

2 UXW�YÉZùø¨ú�.ö$# P�%ö$� % P � �� SB � � �ag �)�� '&2 UXW�YÉZùø¨ú�.ö$# P�%ö$� % �)� � �ýg ��ò� � �� &2 UXW�YÉZùø¨ú�.ö$# % I g P�%ö$� �)�� &2 UXW�YÉZ)U\[��ö�# % P�%ö$� �)�� &

From the last term, we can see, that the decision will be such that�� M� � � is

maximal. Formally written � � � 2 R�SB such that

R�SB 2ÑUXW�Y>Z�U'[�"�?ö�� )� R�ýB � � � 2VUXW@Y=Z)U\[��?ö�� ýB �@�)� �ýB � (4.2)

will be the decision strategy. Since

�)� �ýB � � � 2 �)� � � �ýB �â�� ýB �� (4.3)

the result correspond to the intuitive solution which is to maximize�� R�çB � � � be-

cause�� is constant for individual

�� aB � � � .For the skin detection problem

2 ^ �H�(� `, where

�and

�denotes the skin

respectively non-skin class and the color vector�

represents the observation � . Ifwe knew the prior probabilities

�)�)� �and

�)�*�L�, we would use 4.2 to classify a

pixel as skin if�� @�)�)� � ë9�)�� L�@�)�*�L�

and as non-skin otherwise.

But in our case of skin detection, we only have two conditional probabilities�� and

�)�� L�, because appearance of the skin within an image is not an

random event. Some of the applications assume�)�)� � 2 ��+�æ� 2 E Ú�× , and still

use decision 4.2 but a better solution of this problem had been found by Neymanand Pearson and will by shortly stated in the following section.

4.2 The Neyman-Pearson Strategy

For the explanation of the Neyman-Pearson problem, we will use our skin de-tection notation, stated in the previous section. Let

�ïé-,be the color of the

pixel from a color space,

. Let the probability distribution of�

depend on thestate of the pixel �SB é ^ �f�� `

. The probability is known and is defined throughconditional probabilities

�� and

�� L�.

21

The aim is to divide the color space,

into two subsets,/.

and,10

, so that thepixel is considered to be skin if

� é2,/.and non-skin if

� é�,/3. Since some of

the colors�

have both probabilities�)�� K � and

��54 �non-zero, there exists no

perfect solution. If we use histograms to model the conditional probabilities, theoverlap between skin and non-skin classes will not change by applying any colorspace transformation. Thus the use of e.g YUV or HSV space instead of RGBspace, will not improve the classifiers performance.

The quality of the decision strategy is measured by two numbers: probability,that the skin pixel will be considered as non-skin (missed skin alias false nega-tive error) and probability, that non-skin pixel will be considered as skin (falseskin alias false positive error). The decision strategy can be found such that theprobability of false skin detection errorP6.ö87�9 �)�� æ�

(4.4)

is minimized, while the missed skin error is at most an :P6�ö87<; �)�� F : (4.5)

and,=.?>@,A0 2 ,

and,=.CBD,A0 2FE . The subsets

,=.and

,A0that are solutions of

the optimization problem are defined by the decision rule

�� 2HG � øJILK �M6ON . !K �M6ON 0 ! ë Ð �� PAQSRUT W'V øXW T Ú (4.6)

In other words, a threshold Ð can be found, such that the pixels with likelihoodratio less than Ð are considered to be skin color and the rest of colors for whichthe likelihood ratio is higher than Ð belongs to non-skin colors. The algorithmthat finds the threshold Ð for a given : is described in section 4.4.

4.3 Non-random intervention

In the unrestricted images such as WWW database, all possible races, illumina-tions, camera conditions etc. can appear and such intervention can affect the actualcolor models. In practice, the color of a pixel is then dependent on the class andthe intervention. The dependence can be expressed by a probability distribution�� ýB �<Y[Z��Y è � ÚcÚcÚ �<Y Ê � , where

Y[Z��<Y è � ÚcÚÆÚ �<Y Ê denote the intervention. We couldmodel color distribution within a class �aB using�)�� SB � 2ïP]\'^NP]\�_ ÚcÚcÚ P`\(a �)�� ýB �<Y[ZM�<Y è � ÚÆÚcÚ ��Y Ê �M� (4.7)

22

but for a given image, there is no a reliable method for identifying such interven-tion available up to date. If the occurrence of intervention

Y B could be modeledwith a probability distribution and the prior probability was known, we could use�� ýB ��Y B � 2 �)�� ýB �<Y B �â��Y B � �ýB �@�)� �ýB � Ú (4.8)

But since the interventionY B is not random, the distribution

�)��Y B � �ýB � can not befound. Practically, it means, that e.g. pixels color is influenced by illuminationand class together, but the illumination color is independent from the observedclass and it is not even random variable.

To reduce the error arising from the non-random intervention, the color modelshould be carefully build using subsets of training data, with every subset contain-ing pictures under known conditions and covering the whole range of the condi-tions. Since this is a very liberal task, we only use the approximation of skin andnon-skin color models obtained from Compaq database.

4.4 Experiments

We have used the Compaq database collected by Rehg an Jones (see section 3.2)to build two color models

�� and

�� L�using histograms. As they recom-

mend in [8], we use RGB color histograms with 32 bins in each color channel.

Histogram Ã is a discrete function, which maps an input vector�

into aninteger Ã �� Ï E

.

Ã �� 2 Ã B I P W < WSbUc RdQSR U Qe�ùégf(� BXh àýB Ê �� BXh àaß ]�i (4.9)

In our case Ã B means the number of pixels with color�

falling into the binÃ B which is defined through an interval in every dimension. Let Ã .and Ã 0

be the skin histogram and non-skin histogram respectively. Then the probabilitydistributions are approximated by

�)�� 2 Ã .��Ç Ã . Ç � �)�� L� 2ÄÃ 0)��Ç Ã 0 Ç � (4.10)

whereÇ Ã Ç 2 O B Ã B denotes the total number of pixels in a histogram. To get

a coarse idea about appearance of the skin and non-skin color histograms seeFigure 4.1. Given the maximal missed skin error : , the false detection error isminimized using algorithm 1.

23

0

5

10

15

20

25

30

35

05

1015

2025

30350

2

4

6

8

10

x 104

323 RGB skin color histogram fromfull Compaq database, cut through B=16

0

5

10

15

20

25

30

35

05

1015

2025

30350

2

4

6

8

x 105

323 RGB no−skin color histogram from full Compaq database , cut through B=16

Figure 4.1: ÀÔÁ Â RGB histograms of skin(left) and non-skin(right) color obtainedfrom Compaq database. Histograms are plot in 2D cut for Blue channel = 16

The key value of the skin detection problem is the missed detection error : ,which is correlated with the false detection error. As the missed detection is low-ered, the false detection will increase and vice versa. We have tested the classi-fiers performance for all our databases. An example of the results obtained fromXM2VTS images can bee seen in Figure 4.2.

Algorithm 1 Bisection of the color space into skin and non-skin classes

1. sort the color bins<

ascendantly according to the likelihood ratio�)�� )�� L�kj Ã .��Ã 0)��

2. first�

rgb bins, these with lowest ratio, vote for non-skin, n is such that thesum over the number of skin pixels in rgb bins is maximal but less or equalto :ml Ç Ã . Ç

3. remaining bins with non-zero number of pixels in Ã .vote for skin.

4. bins with both�)��

and�)�� L�

zero i.e. these not encountered in train-ing images vote for non-skin.

Since the conditions in the XM2VTS images are strongly restricted, the resultsare quite fair. Using our WWW database, the trade off between false positive andfalse negative error is stronger, since the color overlap in the images is higher (seeFigure 4.3).

24

Figure 4.2: An example of skin detection applied on images from XM2VTSdatabase. The left column shows, the original images, the second images pro-duced by setting : 2 I ÁCØ and the right : 2 À E Ø .

Figure 4.3: An example of skin detection applied on images from our WWWdatabase. The left column shows, the original images, the second images pro-duced by setting : 2 I Ø and the right : 2 I ÁfØ .

25

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10

0.05

0.1

0.15

0.2

0.25

0.3

0.35

0.4

0.45

0.5

False negative error

Fal

se p

ositi

ve e

rror

Compaq − 100 skin, 100 non−skin imagesM2VTS − 20 images

M2VTS data Compaq data

Minimal errorFn + Fp = 26.7%

Minimal errorFn + Fp = 4.5%

Figure 4.4: The ROC curves obtained from Compaq and XM2VTS databases. Thepointers shows the minimal errors for both ROCs which are produced by summingmissed skin and false skin errors.

The correlation of the two errors can be expressed by the receiver operatingcharacteristic (ROC) curve. The curve plots the false positive and false negativeerror regarding to the threshold Ð . The performance of the classifier can be qual-ified by integrating the area under the ROC curve. As the classifier has betterperformance, the area under the curve is smaller and vice versa.

We have computed two ROC curves using Compaq and XM2VTS databases.The computation of the error for a given : was different for both databases. Com-paq database involves non-skin pictures and these were used to get false skin de-tection error. The images containing skin were used to count the missed skin error.Non skin pixels in these images were discarded, since the masks marking the skinareas do not include all the skin pixels. A fraction of 100 pictures containing skinand 100 pictures without skin were used. Because there are no images withoutskin in the XM2VTS database, we had to use them for both the false negativeand the false positive error computation. To eliminate the error, we painted verycarefully all the skin areas including lips to mark all the skin pixels. Then all thepixels marked as skin and classified as non-skin increase the false negative error,while non-marked pixels classified as skin are involved in the false positive error.The comparison of the ROC curves is in the figure 4.4.

26

Chapter 5

A Multi Color Model for Face

To move forward from skin to face detection we considered the face as multicolored object, with a specific color model. The aim was to build a statistic colormodel for face, similarly as we did for skin, including lips and hair color. Thenwe would build the classifier, which divide pixels in an image into three classes:skin, lips and hair. First we briefly describe the statistic model and the proposedclassifier and then we will discuss the results.

5.1 Face color model

We considered two statistical approaches to face color detection. First one is tomodel skin, lips and hair color individually and build an classifier using these threemodels with the model of other colors. Since the non-skin color model was buildfrom images containing no humans, it can be considered as non-face color model.Because, the prior probabilities

�)��n?< Î :'�,��< # � and

�)�*:' �<��are unknown, we

cannot resolve this task using Bayes rule. In addition, we can not use Neyman-Pearson’s solution, since it works only for two classes and to extend this solutionfor more classes is not trivial task. Hence we come up with an simple approach,which couples Neyman-Pearson task and Bayesian rule.

We divide the task of face color detection into two subtasks. First we detectall face colors regardless of the classification in the face. Let

oZ 2 ^"p �(� `be

the set of two classes: face and non-face colors and è 2 ^ �H��qC�srl`

set of facecomponents: skin, lips and hair. Providing, we have both

�)�� p �and

�)�� L�color models, we can use Neyman-Pearson strategy described in section 4.2.

To build the face color model, we need skin, lips and hair color models defined

27

by the conditional probabilities�)��

,�� q �

,�)�� r �

and the prior1 probabil-ities

�ut �)� �,�ut �q �

,�ut �)r �

which denote the fraction of skin, lips and hair withinthe face. If we assume that the appearance of skin lips and hair within a face isan random event, we can learn both the prior and conditional probabilities andassemble the face color model�)�� p � 2 P�� öv� _ �)�� ýB �@�wt � �ýB � Ú (5.1)

The second subtask is to divide obtained face pixels into the three face compo-nents. Since we know the conditional

�)�� TB � and priors�ut � �ýB � , this can be

easily done using the Bayes rule from 4.2

R�ýB 2ÑUXW�Y3Z)U\[�� ö�� _ �)�� SB �@�wt � �ýB � (5.2)

5.2 Experiments

We have build three histograms for skin, lips and hair color using XM2VTSdatabase. Within, the same training dataset, we have found the prior probabil-ities

�wt �x� �,�ut �)q �

,�ut �)r �

. Using 5.1, we build face color model�� p �

andtogether with non-skin color model

�)�� æ�, Neyman-Pearson solution is applied.

Then we divide face pixels into three classes employing 5.2. Our first experimentwas focused on the XM2VTS database. Except relatively sufficient results (seeFigure 5.1), there are several problems we will discuss now.

Figure 5.1: An example of face features detection based on face color model ap-plied on XM2VTS database. The reddish areas show skin, bluish areas show lipsand green areas show hair class

First one arises from the face color model�)�� p �

. We have extended the skin-pixel detection to face-pixel detection using this model. The skin-pixel detectionis build upon a condition, that the skin color occupy only small area in the color

1they are actually conditional too, but we can call them prior within a face color distribution

28

space. If we add the lips and especially the hair color, the extend is increased andso is the overlap with other colors. Since the extend is increased, the probabilitydistribution is ”flatten” and this leads to increase in the false negative error (seeFigure 5.2).

Figure 5.2: Increased extent of color model leads to higher missed detection error

To force the classifier to extract the face pixels we can set a low threshold Ð ,which means to reduce the false negative error. On the other hand, this resultsinto increase in the false positive error (many pixels are wrong marked as hair),because of the extend of the model (see Figure 5.3). In addition, the hair coloritself in contrast to skin color has a big overlap with other colors.

Figure 5.3: Face color model produces big amount of false hair detection

Another problem is the overlap between colors within the face. Since the lipshave small prior probability

�yt �q �, and a big overlap with skin color, the detec-

tion is poor. Especially at a low resolution, where the difference between skin andlips color is negligible. The overlap is apparent especially in WWW images, duethe amount of different lighting and camera conditions (see Figure 5.4).

We could accept the simple face color model performance in the restricted im-ages like XM2VTS database pictures. However, applying the method on WWW

29

Figure 5.4: Big overlap between skin and lip color disadvantages lips

images, the results are not satisfactory and would be probably a bad starting pointto a higher level face analysis. Thus we have left the pixel-level image analysis atskin and non-skin view.

30

Chapter 6

Connected Components

As we consider the performance of our skin pixel classifier, we realize, that theskin detection performance established at a single pixel color (detection per-pixel),is strictly limited by the overlap of the skin and non-skin color models. To con-sider the face as a connected component can either help to skin pixels classifi-cation as well as to face detection. In this chapter, first the reasons and basicpresumptions leading to connected component analysis will be explained, thanthe proposed region growing method will be presented and finally some resultswill be shown.

6.1 Single face color distribution

Our model of skin color distribution covers different faces and different environ-mental conditions like illumination color, camera etc. But in a single image thedistribution of the color of a single face is much narrowed then the whole skincolor model. In other words, the color of skin can vary a lot within different peo-ple and pictures (see Figure 6.1), but it is rather stable in a single face and image.The change of illumination color within a face can be a problem, however, in mostimages there is only single light source and its color is usually around white.

The second assumption is that the face pixels are clustered in a particular placein the image. So we could consider the region that consists from connected skinpixels as the face candidate. To ensure, the majority of the face pixels to be de-tected we need to set up a low false positive error in the Neyman-Pearson rule.The trouble is, that by doing so, we include surrounding pixels from background,clothing etc. which have skin color into the region and disallow any simple veri-

31

Figure 6.1: Examples of skin taken from different people and environment condi-tions

fication of the region being face or not (see Figure 6.2).

We would like to divide the connected component pursuant to color and spatialdistribution to get individual scene objects which we could analyze further. Thistask is known as segmentation and although many approaches were proposed,there is no robust solution available up to now. However, the advantage of ourtask is the restriction to the skin regions only and thus the number of segmentationoptions is much less than the number of options in a full image.

Figure 6.2: Examples of connected components. Left image shows region whichincludes hair, right image is the corrupted face region with no occlusions.

6.2 Region Growing

To isolate homogeneous regions, we have proposed a method based on regiongrowing. The object of the method is to build a color model of the region andfollowing this model, to allocate the regions spatial distribution. At the beginningthe region is initialized in a homogeneous place in the image to ensure the startinside the sought object. The initial color model is build from the initial region,which includes only few pixels. In the growing phase, the region is being updatedby appending a part from the boundary (e.g. one pixel). For every added part ofthe boundary, a loss can be estimated. The loss can be used either to select the

32

right piece from the boundary to add as well as to estimate the regions limitation.First we introduce the region growing general concepts and than the particularapplication to our task will be described .

i �{z|� � e � denotes the growing region in time e which is defined throughthe set of pixels

Î é � . The time e is equivalent to the size of the region.i � � �is the regions boundary. We can consider the boundary two ways: the

maximal set of pixelsÎ éÑÈ � such that for } Î and its neighboring pixels4 � Î �

states4 � Î �~> � �2{E which we denote � ð ]<� , or the set of the lines

between pixels from � andÈ � which is called �>Þ?Ý»ßâÞ� .i2� � means the increase of the region in one optimizing step. It means that

states � � e 4 I � 2 � � e �u> � �i��

is the loss function defined for the region. To denote the loss functionin time e , we simply write � � e � .

There are several features we can use to estimate the loss function that is de-pendent on region and implicitly on the regions boundary, since the boundary isfunction of the region:

i Internal feature homogeneity can be any functionõ � Ü�� ä�� Z)U\[ � � �!� Z)ø ú � � ��

,where the feature can be color, texture etc.

i Contrast at the boundary using the average or variance of the size of thecontrast in comparison to the pixels neighboring the boundary, or to theaverage color ä�� .

i Spatial homogeneity might be considered as normalized boundary sizeÇ � Ç Õ Ç � Ç è , or e.g. a fit in an a priori known contour (e.g. ellipse).

i Other knowledge about the region, e.g. probability that the pixel is fromsome particular class (skin).

The part of the boundary, that is being added could be a pixel, the median ofa pixel neighborhood, or a small block of pixels. We have tested the pixel and themedian only and thus in the text below, we only speak about pixels.

The segmentation is proposed as a optimization process minimizing thechange of the loss function � � e 4 I ��g � � e � . The loss function should be designedsuch that it could be evaluated in the constant time � � I �

. That means if the lossfunction consists from several parts, this condition must be satisfied by each of

33

them. However the speed of the method is still strongly dependent on the imagesize, since we select the added pixel from all the pixels in the boundary. To reducethe amount of computational time, we choose a random subset from the boundarypixels and only this subset is used for the pixel selection.

6.3 Experiments

Our region growing solution is only pixel-oriented and the color is used as theobserved feature. The pixels are selected from the boundary � ð ]<� using the lossfunction which has the form � � � � 2ïP B�� B � B (6.1)

where the sum is performed over different features of the region, � B is the weightof the

<’th feature and � B is the loss for the region regarding to the

<’th feature.

First we have used four features:

i�� is the color variability of the region. We discuss two ways of modelingthis variability in section 6.3.1.

i í� � B*� which is the average color contrast at the boundary. We have triedto use either the average difference on the average color of the region andthe average difference of pixels neighboring �>Þ�Ý¿ß@Þ)� . Results from these twoapproaches are shown in section 6.3.2.

i The compactness � of the region is simply defined as the normalized bound-ary size

Ç 0 Ç Õ Ç � Ç è and the corresponding loss is � þ 2 I Õ � . As the bound-ary, both � ð ]<� and �çÞ?Ý»ßâÞ� were used. See section 6.3.3 for the different.

i�� Êv�'��Å ��B Ê is a loss when the region contains a pixel classified as non-skinusing a low false negative error. We have restricted the region growing atthe skin-pixels only and this results in� Êv�s��Å ��B Ê 2�G�� ø�I�� M68N . !� �M68N 0 ! F��5�� E P[Q�RUT WsV øXW T

These four features are combined using the weights to obtain the loss functionwhich would direct the region growing into reasonable bound. The basic steps ofthe region growing algorithm are outlined in the algorithm 2, that takes an image,

34

Algorithm 2 Region growing algorithm

1. Initialize the region � � � E �at the starting location. Compute the loss func-

tions � B � E � . Set e 2 E.

2. Evaluate the regions boundary � ð ]<� . If the boundary is empty, return.

3. Take � pixelsÎ � from the boundary � ð ]<� � � � e �@� and for each region � � e ��>Î � compute the loss functions � B+� � e 4 I �

and subsequently the loss function� � using equation 6.1.

4. The pixel with the smallest loss function RÎ � 2 UXW�YÉZùø¨ú � ö"� � � add to theregion, � � e 4 I � 2 � � e �u> RÎ � .

5. Save the relevant loss functions � B � e 4 I �, set e 2 e 4 I

. Go to step 2.

the mask marking the pixels classified as skin and starting location and producesgrowing function � � � e � and the loss functions � B � e � .

First we show the practical effects that brings each of the individual featuresused for the evaluation of the loss function and then the detection of the actualregion is discussed. In all the experiments below, the results are shown on fullimages without previous skin detection to test the performance in complex envi-ronment.

6.3.1 Color variability of the region

The color variability tends the region to be homogeneous in the color, and thususing only this feature, the region spreads itself along the similar color (see Fig-ure 6.3).

Figure 6.3: An example of the region growing result using the variability featureonly.

We consider the color variance of the region as the most important sign whenenlarging the region. To simplify the computation of the variance we presume, the

35

covariance matrix of the region to be diagonal. To ensure that, the �QÝ<� áO� colorspace is used, where

� Ý 2 +��+545-9480ï4 I �� á 2 -��+545-9480ï4 I � (6.2)� 2 � P Y ��+545-9480ï4 I �R,G,B are color intensities in RGB color space. We denote � Ý as red chroma,� á as green chroma and � as intensity. From now the covariance matrix consistsfrom three numbers only, which we denote Ü èþ Ý , Ü èþ á and Ü è . Our aim is to keepthe covariance matrix preferably stable as we add pixels to the region. Since wedo not know the actual covariance of the region, we seek to keep the variance ata low rate. To achieve that, we proposed two approaches to minimize the colorvariance. First one consists in minimizing the covariance matrix determinant�`¡Í� 2 Ü èþ Ý Ü èþ á Ü è Ú (6.3)

To simplify this operation we can minimize � P Y �5¡Í� and to sum instead of multiply.Second approach is to maximize

�� where

�is the color of the added pixel in

the � ÝS� áO� color space. Instead of��

, we can use the Mahalanobis distance,which is ¢ 2 P B¤£ � B g äBÜ�B ¥ è � ø 2 I � Á � À (6.4)

where the set of ä�B and Ü�B represents the color distribution of the region.

0 50 100 150 200 250 3000

50

100

150

200

250

300

350RGB color histogram for skin

Normalized Color Intensity

Num

ber

of p

ixel

s

Red GreenBlue

0 50 100 150 200 250 3000

500

1000

1500

CrC

gI color histogram


Num

ber

of p

ixel

s

Red chroma Green chromaIntensity

Figure 6.4: An example of face skin color distribution. The left plot shows thehistograms of skin color in the RGB color space for each individual channel, theright one shows histograms after the transformation into the � Ý<� áO� Space. Theimage selection shows, the pixels encountered in the histograms.

At the first sight we considered both two approaches as leading into the sameresult, minimizing the region variance. However, since the second approach uses

36

the Mahalanobis distance, it assumes the distribution of the face skin color tohave Gaussian shape. In some images, this is an adequate approximation (seeFigure 6.4) but if the illumination changes within the face, the distribution is cor-related to that change. This shows the plots in the Figure 6.5, where are the colordistributions of skin in both color spaces. In the RGB space, the intensity influ-ences all the color dimensions, while in the � ÝS� áO� only intensity and red chromaare influenced.

0 50 100 150 200 250 3000

50

100

150

200

250

300

350


Num

ber

of p

ixel

s

RGB color histogram for skin

0 50 100 150 200 250 3000

200

400

600

800

1000

1200

1400

CrC

gI color histogram for skin


Num

ber

of p

ixel

s

Red chroma Green chromaIntensity

Figure 6.5: An example of face skin color distribution influenced by the illumina-tion change within the face.

Since the Mahalanobis distance solution seems to keep the Gaussian shape ofthe color distribution too strong, we have used the first approach which minimizesthe variance obtained from 6.3, because it only keeps minimal change of the vari-ance in every dimension. Look at the figure 6.6 to compare the results obtainedfrom using the two approaches.

Figure 6.6: First row shows the region growing results obtained by minimizingthe variance (see equation 6.3) while the second row shows the result obtained byminimizing the Mahalanobis distance (see equation 6.4)

37

6.3.2 Difference at the boundary

The difference at the boundary is used to signal the edges of the actual regionand we have proposed two basic forms to model this feature. First one counts theaverage difference in color between the pixels at the boundary. Formally we canwrite � � B*� Z 2 P¦O�¨ö � P¦ ÷ öv§ � ú T ø¨Y R � Î B � Î ô �

�5�� Î B �ag¨�� Î ô �'�Ç 0 Þ?Ý»ßâÞ� Ç �(6.5)

where theÎ B and

Î ô denotes the spatial coordinates of a pixel and functionú T ø Y R�� Î B � Î ô � equals one if the pixels are neighbors and zero otherwise. Thisfunction is independent from the color distribution of the region and so even theregions with multiple colors can be detected. The second function uses the meancolor of the region and all the boundary pixels rely to that mean:� � B+� è 2 P¦O�¨ö�©�ª¬«® � �� Î B �ag ä 6 �Ç 0 ð ]<� Ç �

(6.6)

where the ä 6 denotes the mean color vector of the region. Since this function isdepending at the mean color of the region, it can be biased if the region has anon-homogeneous color distribution. However we seek to find only regions withhomogeneous color and unlike for the first function, even fuzzy edges can be de-tected, if the color distributions of the two neighboring regions are distant enough.Now the question arises if the difference at the boundary, shall be minimized ormaximized. Our observation was, that by minimizing the difference at the bound-ary, the region keeps inside the edges as long as possible. If we tried to maximizethe difference, the region sought to glue to the boundary anyhow from inside oroutside (see Figure 6.7). However, when we added other features (such as varia-

Figure 6.7: An example of the region growing results using the difference at theboundary feature only. First row shows results obtained by maximizing the differ-ence, while the second by minimizing the difference at the boundary

tion), the difference between the maximizing and minimizing was not noticeable.

38

We did not examine this feature further and we did not use it for region growing,which is outlined in section 6.3.4.

6.3.3 Compactness of the region

Since the aim of our loss function is to segment all kinds of image regions, wehave only little a priori knowledge about the compactness of the region. Thuswe just use the normalized size of the boundary, which prevents the region frombeing scattered over the similar colors like in the second image in the Figure 6.3.Using

0 ð ]<� and0 Þ?Ý»ßâÞ)� brings similar results, both tends to keep a square region

(see Figure 6.8). We have chosen minimizing0 Þ?Ý»ßâÞ)� , since the penalty for the

Figure 6.8: Two examples of region growing results. Left image shows productfrom minimizing the

0 Þ?Ý»ßâÞ� the second image shows result from minimizing the0 ð ]<� .pixels not encountered in region, although surrounded by region pixels, is muchhigher. This result in better modeling the actual color distribution of the region.

6.3.4 Homogeneous regions detection

In the first paragraph of the section 6.2, introduction to region growing, we havequalified the loss function as an indicator which can be used either in the regiongrowing process itself and yet to determine the regions actual boundary. But dur-ing the experiments, we did not find any significant points in the loss functionwhich could signal the region to be the one we were looking for. For individualcomponents of the loss function, these signs ought to have following properties:

i Color variability of the region is supposed to be small and thus we assume,that when the region exceeds the actual boundaries, there is a high increaseof the variability. We could detect a high first or second derivation of thevariability �� e � to select the region or just to prefer a low variability.i The difference at the boundary should be high for both functions � � B*� Z

and� � B*� è defined in the equations 6.5 and 6.6 respectively. If the region exceeds

39

the boundaries, the difference ought to lower again. This assumption is thesame for both functions, however, for different reasons. For the � � B*� Z

, wesuppose that the color difference in the neighboring region is lower thanthe difference at the boundary. In other words, the regions in the image arerelatively homogeneous in the color in comparison with the edges. For the� � B*� è , the decrease arises, because by adding the pixels from the neighboringregion, the color mean starts to shift against the regions color. Using theseassumptions, we can accept the local maxima as the sings signaling theactual regions.

i For the compactness, we can only assume a high value, since the regionsought to be relatively homogeneous in the spatial extend. Our observationis, that the regions compactness increases when the actual region is fulfilledand it is lowered again after flowing out, thus we could localize the maximaas well.

0 2000 4000 6000 8000 10000 12000 140000

10

20

30

40

50

60Minimizing the region color dispersion

Number of pixels

Loss

func

tion

Boundary difference Region color varianceCompactness loss

Figure 6.9: Loss functions obtained using only the region variance as the pointerfor growing. The actual regions are depicted in the images on the left, which arethe regions stages for 7000 and 11000 pixels.

We can see, that these requests are consistent with the demands on the loss func-tion used for region growing, except the difference at the boundary, which is notthat obvious. Hence it seems that applying the same loss function for region de-tection is reasonable. But during the experiments we have found out, that themore we restrict the growing process with a loss component

<, the less noticeable

is the sign for the given component described above. The restriction is made bysetting the weight � B in the equation 6.1; strong restriction means high weight.For example using the color variance as the only feature for growing, producesthe variance loss function which is slowly raising with only insensible changes(see Figure 6.9). On the other hand, both the boundary difference and the com-pactness loss significantly point out the actual regions (face and face with hair and

40

0 1000 2000 3000 4000 5000 6000 7000 80000

10

20

30

40

50

60Minimizing boundary diffrence

Number of pixels

Loss

func

tion

Bounadry difference Region color varianceCompactness loss

Figure 6.10: The plot shows the loss functions obtained by minimizing the bound-ary difference. The image shows the region stage for 6000 pixels, just after it flewout the face.

clothing). When using the boundary difference as the only feature, there is a bigincrease of either the color variance and the compactness loss (see Figure 6.10).When a strong restriction to the compactness was made, two noticeable breaks inthe variance loss function signaled the regions, but surprisingly, the difference atthe boundary was a poor region detector (see Figure 6.11). From the observed be-

0 2000 4000 6000 8000 10000 12000 14000 160000

10

20

30

40

50

60

70Maximizing the compactness

Number of pixels

Loss

func

tion

Boundary difference Region color varianceCompactness loss

Figure 6.11: The plot shows the loss functions obtained by maximizing the com-pactness with a little restriction to color variance. The images on the left showgrowing examples for 2000 and 7000 pixels.

havior of the loss functions, we have come to a conclusion, that there is only weakdependence between the individual features. The consequence is that by choosingthe pixels regarding one features loss, the other loss functions takes the averagechange within all the boundary pixels and can be used as the actual boundarydetectors. To obtain reasonable region growing results, we have mixed up the in-dividual loss functions with empirically set weights using the equation 6.1. Lookat the Figure 6.12 to see an example of the region growing.

41

0 2000 4000 6000 8000 10000 120000

10

20

30

40

50

60

70

80

90

Growing with wdiff1

= 0.1, wcomp

= 0.2, wvar

= 0.7

Number of pixels

Loss

func

tion

Boundary difference Region color varianceCompactness loss Result loss function

Figure 6.12: Region growing results obtained using the loss function from 6.1.The region was expanded over the face within about e 2 I Á EXE pixels. The regionis showed in the size of 9 – 2500 pixels with the step of 500 pixels.

From the result loss function we would hardly determine the face region thatis approximately at 1200 pixels. Thus we have employed another function, to findhomogeneous regions which we call profit function, since the sought regions shalloccur at the maxima � 2F� � B*�O� I� Þ I�� (6.7)

where � � B*�O� is the difference at the boundary,I Õ � Þ is the compactness and

I Õ �¯�is the color homogeneity of the region. Choosing the multiplication, we stressthe relative differences, rather than the absolute differences in the individual lossfunctions. Then the profit function can be evaluated as the logarithm�o° 2±� P Y � 2�� P Y²� � B*�O� g � P Y²� Þ g � P Y²�� (6.8)

For the same region growing example as in the Figure 6.12, the profit function isshown in the Figure 6.13.

0 2000 4000 6000 8000 10000 120000.02

0.04

0.06

0.08

0.1

0.12

0.14

0.16Profit function

Number of pixels

Pro

fit fu

nctio

n

Figure 6.13: Profit function for region growing depicted in the Figure 6.12

42

Unfortunately, usually the sought region � � Re � is not to be found at the globalmaximum of the profit function Re 2 UXW@Y=Z)U\[ �� e � . The profit function onlyreaches a local maximum, when the region spreads itself over a homogeneousarea and we can not a priori define, the parameters of the homogeneous region.Thus instead of one maximum, we find a set of significant local maxima ³ . Thisset assigns a set of face candidates � � e �M� e é ³ for a single starting location. Toextract the local maxima, we use a simple algorithm, that takes the profit function�

as the input, and returns a set of local maxima positions ³ . There are three

Algorithm 3 Searching from local maxima set ³ in the profit function�

1. Take every

’th value from the�

and use them to approximate the profitfunction

�with a spline function K

2. from the K , generate two functions ´ K � e � 2JK � e 4ï� K ��< õ D@�T4�µ K ��< õ Dand

+ K � e � 2LK � e g � K ��< õ D@�Ô4¶µ K ��< õ D , which are the shifted K functions.

3. for every interval � � e ZM� e è � from K such that K � e � ë ´ K � e � and K � e � ë+ K � e �@� } e é � find the global maximum position Re and put it in the ³ .

parameters in the algorithm:

defines the accuracy of the spline function and� K ��< õ D together withµ K ��< õ D determine the sharpness and height of the peaks.

All the parameters are empirically assessed from the length of the profit function.An example of the face candidates evaluating is shown in the Figure 6.14.

0 2000 4000 6000 8000 10000 12000 140000

0.05

0.1

0.15

0.2

0.25

0.3

0.35

0.4

0.45

0.5

number of pixels

prof

it fu

nctio

n

Profit function analyse

profit Function smooth profit Functionlocal maxima intervals

Figure 6.14: Face candidates obtained from the profit function. Images show theoriginal picture and all the regions at the local maxima. The plot shows the profitfunction together with the spline function and the local maxima intervals.

43

6.3.5 Starting points

To locate the starting points for region growing, we have used two assumptions:

i the region is homogeneous in the colori the region is skin region, only if it contains any pixel with a high probabilitybeing skin.

The first assumption leads to a simple algorithm 4 which stemmed from [6, pages96-97], which takes a color image and produces a texture map, where the highvalues means high color contrast (texture) and vice versa. The texture mask stress

Algorithm 4 Computation of the texture mask T from a color image I

1. create a new image M of the same size as the image I

2. for each color plane c in I and for each pixel position (x,y) in I compute themean value of the pixels in the neighborhood of (x,y) and insert the meanvalue into the image M at the location given by (x,y) and c.

3. compute the image of absolute values of differences for each color planeseparately · 2 � � g ¢ �

4. create the intensity image DI from D

5. create the texture image T of the same size as DI

6. for each pixel position (x,y) in DI compute the mean value of the pixels inthe neighborhood of (x,y) and insert the mean value into the texture T at thelocation given by (x,y).

the edges in the image and thus we can avoid the region growing to start at any ofthe boundaries between the actual regions.

The second assumption shall reduce the number of the regions containing onlypixels with small probability of being skin pixels. For example color which is verybright can occur in a shiny brow, but not at the whole face. This assumption canhelp to accept the low frequented pixels in the face and reject the same pixelsin other areas (e.g. bright yellow wall). Using test this assumption, we haveproposed simple connected component analysis algorithm. We classify all thepixels with two thresholds Ðo¸ ��ñ and Ðº¹ B�á ¹ into skin and non-skin classes. Wecall the pixels denoted as skin with the Ð»¸ �*ñ as the provisory skin pixels, while

44

the pixels classified with the Ðo¹ B�á ¹ as the high probability skin pixels. Then wemark as skin all the connected components from the provisory skin pixels, thatinclude any of the high probability pixels. We have evaluated a set of ROC curvesfor the same dataset as in the section 4.4. Every ROC curve was computed for ahigh threshold Ðo¹ B�á ¹ and a set of Ðo¸ �*ñ¨¼ Ðº¹ B¢á ¹ . The best result was obtained for: � Ðº¹ B�á ¹ �¾½ E Ú¢j , but the benefit was only negligible (see Figure 6.15).

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.90

0.05

0.1

0.15

0.2

0.25

0.3

0.35

0.4

0.45

0.5

False negative error

Fal

se p

ositi

ve e

rror

ROC curves

simple skin detection conected components for ε = 0.6

Figure 6.15: ROC curves obtained using 200 images from Compaq database. Sim-ple skin detection is the same plot as in the Neyman-Pearson classifiers testing.

Now the starting point is located using only pixels with the rate higher thanÐo¹ B�á ¹ and choosing the lowest value from the texture map (see Figure 6.16) andregion growing is limited within the provisory skin pixels only.

(a) (b) (c) (d)

Figure 6.16: Starting point evaluation, a) Original image, b) texture mask, c) allthe provisory pixels, d) connected components containing high probability skinpixels and the starting location

45

Chapter 7

Face detection

Providing, we have extracted skin regions only and we have segmented these re-gions to homogeneous parts, each segment could be validated and accepted orrejected in virtue of the validation result. But in our case, we have more face can-didates for each starting point, that means, the candidates can overlap each other.If a candidate is marked as a non-face region, it can still cover the actual face re-gion and thus the rejection of the whole region would mean the drop of the face.This problem arises, if the region growing starts outside the face and during theprocess spreads itself over the face either. To solve this problem and reduce thenumber of false positive detections, we have proposed a simple algorithm, whichis based on several assumptions. First we introduce few more concepts whichformally describe the process of region growing applied to face detection:

i p is a projection of a human face into the picture coordinates. p is repre-sented by a set of pixels in the spatial coordinates. Other parts of the imagecan be denoted as

È pi2� � � e � is defined as in the previous chapter, but we consider � � � e � whichis significant for the face verification. It means that the number of pixels isgenerally higher.

i flowing out of the face is a Re �(¿ � such that � � � E � ÚcÚcÚ � � � Re �(¿ � g I � é p and� � � Re �(¿ � � ÚcÚcÚ � � � Re ��¿ � 4Ñ��)éæÈ p .�

is such that the neighboring region� � Re �(¿ � 4G��dB È p disallows the face validation or other future processing.

i running into the face is a ReÆB Ê such that � � � E � ÚÆÚcÚ � � � Re!B Ê g I �3é7È p and� � � Re�B Ê � ÚÆÚcÚ � � � Re!B Ê 47�� é p .�

is such that the rejected part of the face� � Re�B Ê 4 ��wB p disallows the face validation or other future processing.

46

To process all the connected components from the image using the homoge-neous regions from the region growing process, we have made these assumptions:

i if the starting position of the region growing process is located in the face,than the region first spread itself over the face, before it flows out. Formally,if � � � E � é p , then � � Re �(¿ � g I � j p .i the region, which is spread over the face can be detected by a local maxi-mum obtained from the profit function. Formally, if � � Re ��¿ � g I � j p , thenRe ��¿ � é ³ .i if the growing process starts outside the face, then the moment right beforethe running into the face can be detected by a local maximum in the profitfunction. Formally, if � � � E � éGÈ p and À ReÆB Ê , then Re!B Ê é ³ .

To validate a face candidates, we have used rather a rejection than an acceptancescheme, since the number of variations of the face appearance caused by differentrotations and scales is still high. We reject the face candidate if the given regiondoes not satisfies any of these assumptions:

i the face is approximately elliptic with a limited rate of the size of the axes.i the face involves small darker areas such as eyes and lips, that can be con-sidered as local minima in the intensity image.i number of these dark spots does not exceed certain threshold.

To apply the first assumption, we approximate the region with an ellipse Á , usingthe moments of inertia. Then we assess, how well the region � is approximatedwith the ellipse. For that purpose, the following measure is evaluated:

û 2 O ¦êöv§"Â �\� Î � � ��4 O ¦cö$Â � I g �\� Î � � ��O ¦cö � I (7.1)

where the function�\� Î � � �

indicates that the pixelÎ

belongs to the region � :

�\� Î � � � 2 G I ø�I Î é � �E PAQSRUT W'V øXW T ÚThe fit

I ÕÔûof the ellipse is than limited with a threshold as well as the rate of the

axes length ÃcàSß ôs� Ý Õ ÃÆàSB Ê�� Ý . For the second and third assumption, the number ofsignificant local minima in the intensity image is evaluated. The face is rejected ifthis number is zero or higher than a certain threshold.

47

Using the assumptions for region growing result and for the face candidateswe can build an algorithm, which would gradually analyze the face candidates. Inthe case of rejection of all of the candidates for one starting location, only the firstcandidate can be cut out from the connected component to reduce the input space.

The input of the algorithm 5 is the image and two masks marking the pixelswith high probability of being skin

¢ � Ð»¹ B�á ¹ � and the mask marking the provisoryskin pixels

¢ � Ðo¸ �*ñ � . The output is a set,

of disjunct regions which representsthe face candidates in the image.

Algorithm 5 Face finding

1. if the mask

¢ � Ðo¹ B¢á ¹ � is empty, end.

2. using the mask

¢ � Ð»¹ B�á ¹ � and the texture mask, find the starting location� � � E �for the region growing.

3. apply the region growing algorithm which produces the growing function� � � e � and the loss functions �¯� � e � , � � B*�O� � e � and � Þ � e � .4. compute the profit function

�)� e � using the equation 6.7. Find the ascendingordered set ³ of the local maxima positions in

�)� e � .5. for all the face candidates from ³ , compute the fit

I ÕÔû, axes rate and num-

ber of dark holes. Using the thresholds, pick all the regions that satisfy thecriteria for face, and put them into a new set ³Ä� .

6. if the set ³/� is empty, take the first entry e Z é ³ , remove the correspondingregion � � e Z��

from both masks

¢ � Ðo¹ B�á ¹ � and

¢ � Ðo¸ �*ñ � and go to the step 1.

7. find the Re é ³U� such that the fitI ÕÔû

of the corresponding region into theellipse is the greatest, put the region into the final set

,. Remove the region

from both masks

¢ � Ðo¹ B�á ¹ � and

¢ � Ðº¸ �*ñ � and go to the step 1.

7.1 Experiments

To evaluate the face detector performance, we have used 40 pictures from theCompaq database containing together 49 human faces in different positions, rota-tions and scales. In most images, if the connected component analysis describedin section 6.3.5 is applied, the face in the image is in a component with other ob-

48

jects having skin color like hair, clothing, furniture, walls etc. We have appliedtwo algorithms. First one is described in the previous section and it finds all thefaces in the image. This algorithm has correctly found 32 faces and incorrectlyclassified 37 regions as a face. The same test was performed for 20 images fromXM2VTS database. From 20 faces, all the faces were correctly detected, while 3false detections were made.

We have observed, that the starting location of the region growing process isvery often placed inside the face region. This is caused by the high probabilityof pixels inside the face being skin and by the high color homogeneity in the faceespecially the brow. We have used this observation to design a modified algorithmbased on the algorithm from the previous section. In this algorithm only the firstface is found, that means the end after the step 7. Using the same images fromCompaq database, 26 faces were correctly detected. The same test applied on 20images from XM2VTS database, all the faces were correctly detected.

Since the face model is very simple, we consider the results from the face de-tection as promising. Usually the wrong classification stems from the acceptanceof a non-face region. There are three cases of the false acceptance which lead to afalse detection or false rejection.

i Any small part of the face (e.g brow) is accepted, then the result is a divisionof the face into several parts which are detected as individual faces. Thesesmall areas has usually very high fit into the ellipse.

i Region that flows out of the face is accepted and has even better fit into theellipse than the actual face (see Figure 7.1).

i Region that runs into the face is accepted as a face. If the preponderance ofthe face is involved (and subsequently cut out), the face is lost (see Figure7.2).

Figure 7.1: False detection example. Images from left to right: Face detectionresult, rejected candidate with the

õ <�D 2 À�Ú�× I , accepted candidate with theõ <�D 2

À�Ú]� I49

Figure 7.2: False detection caused by the accepting a non-face region. Imagesfrom left to right: Face detection result, rejected candidate with zero number ofholes, accepted candidate with non-zero number of holes

Since the set of the face candidates usually involves the actual face, more so-phisticated method for the face validation may be examined to reduce the numberof wrong detections. Despite the poor model, we can still present number of rela-tively sufficient detection results in the pictures with a high level of occlusions byother objects (see Figure 7.3).

Figure 7.3: Results from the face detection algorithm

50

Chapter 8

Discussion and conclusions

We have presented a method for the human face detection in a complex scene en-vironment based on the skin and non-skin color, modeled by histograms, segmen-tation method using region growing technique and a simple verification schemefor the obtained face candidates.

Since the skin detection is a low-level and relies entirely on the local infor-mation, the results will never be completely reliable, but should be consideredas providing an useful information for the subsequent higher-level process, basedon the spatial distribution. Since we use histograms to model the skin and non-skin color, the performance of the skin classifier per-pixel, cannot be increased byany color space transformation. This is possible only for the other models (likeGaussian mixture), which need to cluster the skin respectively non-skin colors toa continuous blob. Providing we have a large training dataset (38.7 million skinpixels in 33 thousand bins), the histogram approximation of the distribution isquite accurate.

The problem arises with different races, since the appearance within theWWW is not an random event. We have observed a poor skin classification forthe Africans with a very dark skin, which is provoked by significantly lower pres-ence in the web images. This drawback described in section 4.3 could be fixedby selecting carefully the training dataset regarding the different human races,or by any color space transformation which unifies the human skin color withindifferent races.

We have tried to increase the skin classifier performance by the assumptionthat most of the skin regions includes pixels with high probability of being skin,although the rest of the pixels may have a small chance to be labeled as skin. Weclassify the skin pixels with a high tolerance and then we reject the connected

51

components which do not include any of high probability skin pixels. We havecomputed a set of ROC curves and found the false positive error j E Ø as the bestfor the high probability skin pixels. However the improvement compared to theper-pixel detection was only negligible.

The multi color model of the face described in chapter 5 cannot be applied inthe low-level classification process as we have found out, but it might be used asa tool in the verification scheme. However, color of lips are dependent at least onthe skin color, moreover in a low resolution images, lips are often only negligible.

Since the trade off between the false detections and false rejections in theper-pixel classification has a fixed limitation given by the ROC curve, we haveemployed the spatial attributes of the pixels to propose a segmentation method, toextract the homogeneous regions within the pixels classified as skin, which couldbe passed to a high level verification process. The segmentation method is basedon the region growing technique, proposed as a randomized optimization process,using a loss function computed from the given regions color and spatial attributes.Because we can not a priori define the characteristics for a homogeneous region,the output from the region growing process, with a single starting point, is usuallya set of regions, that are signaled by the local maxima in the profit function that iscomputed in similar manner as the loss function used for growing.

The proposed region growing method is designed to be quite robust since thereare no a priori defined limitations of the homogeneous region. Hence if the grow-ing process starts inside the face region, the actual face is usually involved in theface candidates set. Although, we have observed strong dependency in the in-tensity of the color. Especially if one half of the face is in the shade (most lightcomes from the side), flowing out of the face supervenes after only one half of theface is covered. Another problem is the detection of the local maxima in the profitfunction if the edges around the face are only negligible (e.g. hair color similarto the skin color or no shading under the chin). To improve the performance, thedifferent color space transformations and additional loss functions could be ex-amined. For a proper examination, any appropriate measure of the performanceshould be proposed.

To validate the face candidates, we have proposed a very simple scheme basedon several assumptions about the face fit into an ellipse and the presence of somedark patches. Although the simple face model works in a number of cases, mostly,some wrong detections appear, which is produced by several drawbacks in themodel. First, the appearance of the dark holes is depending on the image resolu-tion and on the intensity of the skin color. Since the appearance of a face is strongdependent on the size, a normalization into a low resolution should be made. Sec-ond problem is the choose of the region which has the best fit into an ellipse,

52

which is not always the actual face. Since the actual face is usually quite wellextracted and involved in the candidates set, any more sophisticated method couldbe used for the validation.

Since the region growing is a randomized process, the results are often dif-ferent for subsequent calls, even that 10 % from boundary pixels are evaluated,which might be a well representing fraction. Here the next drawback should beremarked and this is the speed of the detection process. Even that the loss func-tions are evaluated in constant time � � I �

, the speed is dependent on the size of theboundary of the region. We choose only a fraction from the boundary and keepthe regions compactness at a high rate to reduce the number of pixels evaluated.However, the whole process can be very time consuming. For example the evalu-ation for an image Á EÔE kGÀ EÔE pixels with × E Ø pixels marked as skin, takes about3 minutest on two processors Pentium 3 750MHz computer.

53

Bibliography

[1] International Conference on Automatic Face & Gesture Recognition, Greno-ble, France, 2000.

[2] R. Auckenthaler, J. Brand, J. Mason, C. Chibelushi, and F. Deravi. Lipsignatures for automatic person recognition. In IEEE Workshop, MMSP,pages 457–462, 1999.

[3] Christopher M. Bishop, editor. Neural Networks for Pattern Recognition.Clarendon Press, Oxford, Great Britain, 3th edition, 1997.

[4] Jaroslav Blazek. Detekce osob z barevne informace. Master’s thesis, Ceskevysoke ucenı technicke, Fakulta elektrotechnicka, Katedra rıdıcı techniky,Praha, Ceska republika, 1999.

[5] Gary R. Bradski. Computer vision face tracking for use in a perceptual userinterface. 1998.

[6] David Forsyth and Jean Ponce. Computer Vision: A modern approach.http://woska/Forsyth CV Book/.

[7] M. Jones and J. Rehg. Compaq skin database. http://www.crl.research.di-gital.com/publications/techreports/abstracts/98 11.html.

[8] Michael J. Jones and James M. Rehg. Statistical color models with applica-tion to skin detection. Compaq Cambridge Research Lab. Technical ReportCRL 98/11, 1998.

[9] Stephen J. McKenna, Sumer Jabri, Zoran Duric, and Harry Wechsler. Track-ing interacting people. In International Conference on Automatic Face &Gesture Recognition [1], pages 348–353.

[10] K. Messer, J. Matas, J. Kittler, J. Luettin, and G. Maitre. Xm2vts db: Theextended m2vts database, 1999. In International Conference on AudioandVideo-Based Person Authentication, pp. 72-77, 1999.

54

[11] Rein-Lien Hsu Mohamed. Face detection in color images. http://cite-seer.nj.nec.com/392009.html.

[12] Michail I. Schlesinger and Vaclav Hlavac. Deset predmasek z teorie stati-stickeho a strukturnıho rozpoznavanı. CVUT, Prague, Czech Republic,1999.

[13] Karl Schwerdt and James L. Crowley. Robust face tracking using color.In International Conference on Automatic Face & Gesture Recognition [1],pages 90–95.

[14] J. Sobottka and I. Pittas. Segmentation and tracking of faces in color images.In Proceedings of the Second International Conference on Automatic Faceand Gesture Recognition, pages 236– 241, 1996.

[15] Moritz Storring, Hans J. Andersen, and Erik Granum. Estimation of theilluminant colour from human skin colour. In International Conference onAutomatic Face & Gesture Recognition [1], pages 64–69.

[16] Jean-Christophe Terrillon, mahdad N. Shirazi, Hideo Fukamachi, andShigeru Akmatsu. Comparative performance of diffrent skin chrominancemodels and chrominance spaces for automatic detection of human faces incolor images. In International Conference on Automatic Face & GestureRecognition [1], pages 54–61.

[17] Jochen Triesch and Christopher von der Malsburg. Self-organized integra-tion of adaptive visual cues for face tracking. In International Conferenceon Automatic Face & Gesture Recognition [1], pages 102–107.

[18] M. Yang and N. Ahuja. Detecting human faces in color images. In Inter-national Conference on Automatic Face & Gesture Recognition [1], pages446–453.

[19] Xiaojin Zhu and Jie YangAlex Waibel. Segmentin hands of arbitrary color.In International Conference on Automatic Face & Gesture Recognition [1],pages 446–453.

55

Appendix A

Applications description

Two applications were build: one for the evaluation of the individual face detec-tion subtasks and the second one for handling the masks and relevant experiments.The applications were implemented in the Matlab environment, using m files andmex files written in C++. The first tool is called Face detector, while the sec-ond one is called Selector. In this appendix, first the basic information neededto run and control the applications is outlined and then the coarse description ofindividual functions is given.

A.1 Face detector

Face detector is a tool, that applies a detection method on a single image and vi-sualize the result. All the source files can be found in the home directory ˜xno-vakv1/matlab/facedetect. To run the application, run Matlab, set yourworking directory or add the path to the searching paths with addpath(path).Then simply type facedetect to open the application window. To set the initialimage database directory to a given path, use facedetect(’<database>’).The default image database is set to our WWW testing image database.

A.1.1 Application control

There are two control panels that can be used to handle the tool. The panel onthe left controls the image database, which is the list of image files in a singledirectory and the histogram files of skin and non-skin colors used for detection.The text field enables to insert the name of a directory and load the content, which

56

is a list of images names, that can be browsed by the buttons Previous and Next.The new histogram file names can be inserted into the bottom text-fields and thenloaded.

The right panel controls the detection subtasks and the threshold parameter : .There are five detection subtasks available:

i Skin detection: Low level skin detection using the false detection error : .i Face features: Face features detection, using the multi color model of the

face and the false negative error : in face/non-face pixel classification.

i Connected components: Simple connected component analysis using thehigh probability skin pixels to accept the provisory skin regions.

i Starting point: Shows the starting location for the region growing algo-rithm.

i First face: Finds only one face in the image, if exists any.

i All faces: Finds all the faces in the image.

The button Apply can be used to call the given method or the Auto apply canbe set on to call the method after any change (e.g. change of the input image).The check-box Fast mode ensures the big images to be scaled before sendingto the face finding method. The default value is ’on’ and we recommend not tochange it. The threshold : can be changed with the slider or the text field. If anyface detection or connected component method is set up, the threshold is usedfor estimating the Ðo¹ B�á ¹ and it is automatically set to 60 %. For more detailedinstructions see the file facedetect.html in the same directory.

A.1.2 Functions description

In this section, a short description of each function is given. The functions aresorted in alphabetic order and the extension .m or .cpp denotes the m files orC++ files respectively.

allReg.m – takes the image, mask of provisory skin pixels and high probabilityskin pixels and returns the mask, marking the face regions together with the set ofellipses around the faces.

classf.cpp – takes the image and classification table and produces a maskmarking pixels regarding to the table (e.g. skin).

57

conComp2.m – applies simple connected component analysis using two masks:mask marking the provisory skin pixels and the mask marking high probabilityskin pixels.

createTable.m – creates the classification tables. If the histogram names arepassed, the histograms are loaded from the disk.

darkHoles.m – computes the number of the dark holes in the image part givenby the given mask.

ellipse.m – computes the best fit ellipse for the pixels given by a mask. Re-turns polygon marking the ellipse and the length of the axis.

facedetect.m – initializes the tool environment and create the tool window.Handles the callbacks received from the graphics components.

firstReg.m – searches for a face in the image. If a face is found, the functionreturns the mask marking the corresponding region and the best fit ellipse.

floodSeg*.cpp – region growing algorithm. Takes the image, two masks andthe starting location and produces mask marking the last region and the bound-ary, loss functions � � B*�O� � �� þ and the growing function � � � e � . There is moreversions of this algorithm, see facedetect.html for details.

change image.m – updates the image in the image axes to the actual image.

load db.m – loads the content of the given directory and shows the filenamesof all the images in the file list.

localMax.m – finds the set of local maxima in a function.

mixtab.cpp – creates a classification table using the Bayes rule. The input are4couples - histogram, prior probability. N is maximally 20.

neyman.cpp – takes two histograms of positive and negative classes and returnsthe 3D table of rates for individual colors and the table of

�s for a given false

positive error. There isIêEÔE

numbers in the table for : 2 ^ E � I � Á � ÚcÚcÚ � ÖÔÖ ` .panel.m – sets up all the graphics components and its callbacks.

readhist.cpp – reads the histogram file *.hst with ÀÔÁÔÂ bins and returns thehistogram table and the total sum of pixels in the histogram.

selvis.cpp – visualizes the face features classification result using the imageand the mask with the values 1 for skin, 2 for lips and 3 for hair.

startGrow.m – finds the start location for the region growing algorithm

view class.m – applies actual detection method and shows the result.

view image.m – shows the actual image in the image axes. If the image is notpresent in the memory, it is loaded from the disk.

58

A.2 Selector

Selector is a tool, that can be used to select regions of pixels in the image anduse them for some subsequent functions, like histogram building, color distribu-tion visualization etc. All the m files and mex files can be found in the homedirectory ˜xnovakv1/matlab/selector. To run the application, run Mat-lab, set your working directory or add the path to the searching paths using ad-dpath(path). Then simply type selector to open the application win-dow. To set the initial image database directory to a given path, use selec-tor(’<database>’). The default image database is set to our WWW testingimage database.

A.2.1 Application control

There are two control panels that can be used to handle the tool. The panel on theleft controls the image database, which is the list of image files in a single directoryand saving the histogram files a given class. The text field enables to insert thename of a directory and load the content, which is a list of images names, that canbe browsed by the buttons Previous and Next.

The right panel enables to set the selection method, that can be either, a poly-gon selection or the flood fill selection. The flood fill selection is controlled bytwo parameters (sliders on the right), that set the maximal difference betweenneighboring pixels and the maximal difference to the starting pixel. The selectiontarget can be one of three classes: skin (default), lips or hair. The change of theselection target changes the name of the histogram filename to the relevant class.The actual selection target determines the subsequently selected pixels class. Forthe detailed instructions see the file selector.html in the given directory.

A.2.2 Functions description

In this section, the short description of each function is given. The functions aresorted in alphabetic order and the extension .m or .cpp denotes the m files orC++ files respectively.

change image.m – changes the image in the image axes to the actual image.

countDisp.m – evaluates the color variance within the skin masks. Prints themean, variance, maximal and minimal value of the color variance in the individualimages.

59

floodSelect.m – calls the floodsel.cpp with the actual image and givenstarting location, refresh the image.

floodsel.cpp – selects the pixels using recursive algorithm, that takes theimage, starting location and two thresholds.

key handle – handles the shortcut keys.

loadMasks.m – loads all the available masks from the directory image-Data/masks. The masks are stored in hdf format and have the same names asthe images.

load db.m – loads the image database from the actual directory and shows thefirst image.

panel.m – sets up all the graphics components and its callbacks.

saveHst.m – generates the histograms from all the images and masks and savethe appropriate one in the file in the current directory.

saveMasks.m – saves all the non-empty masks into the mask directoryimageData/masks. The masks are stored in hdf format and have the samenames as the images.

selector.m – initializes the tool environment and create the tool window. Han-dles the callbacks received from the graphics components including the callbacksused for polygon selection.

showDist.m – shows the color distribution of the skin pixels selected in theactual image.

updthist2.cpp – adds pixels to a histogram using image, mask and a givenclass of pixels.

view image.m – shows the actual image.

writehist.cpp – saves the given histogram into the file with a given name.

60

novak-msc01

Documents

Transcript of novak-msc01