Photo Tagging by Collection-Aware People...

Photo Tagging by Collection-Aware People Recognition

Cristina Nader VasconcelosUFF

[email protected]

Vinicius JardimUFF

[email protected]

Asla SaFGV

[email protected]

Paulo Cezar CarvalhoIMPA

[email protected]

Abstract

Some mature findings in person recognition research re-strict their use cases for dealing with near frontal faceposes under controlled environments conditions. Such spe-cial pose and environment conditions can be accepted if im-posed as a requirement for security applications, but theyare not reasonable constraints when dealing with personalcollections. In this paper a person recognition approach isproposed designed for aiding the task of personal collec-tions tagging under non-controlled conditions. It can begenerally described as a collection-aware clustering ap-proach, that evaluates people appearance across the photoset. A classificatory procedure working over a trained Ko-honen network is proposed. The network is responsible forlearning people appearance, working over a dynamicallyreduced feature space.

1. Introduction

The problem of face recognition has received distin-guished attention from Computer Vision, Pattern Recogni-tion and Biometric research communities during the pastyears. Such effort is justified by several application scenar-ios ranging from security to entertainment industries (andmany others). Nowadays it can be considered that automaticface recognition systems have reached a certain level of ma-turity as long as the face image acquisition procedure followcertain restrictions. But such restrictions imply that currentsystems are still far away from the capability of the humanperception system. For instance, face recognition from im-ages acquired in an outdoor environment with changes inillumination and/or pose remains a largely unsolved prob-lem.

Inspired by the huge amount of digital content being pro-duced, this papers deals with the problem of helping peo-ple in tagging personal photo collections and suggests a re-viewing of the capability of the human perception system,in such context. For this task, the restrictions imposed byseveral person recognition approaches based on facial fea-

tures are not reasonable, as our visual system is able to dis-cover who are the people in a photo even if someone is fac-ing backwards or has his face occluded in relation to thecamera, as long as we have some familiarity with the scene.Figure (1) illustrates a photo that will fail in any tagging sys-tem looking for facial unique descriptors.

Figure 1. Hard samples for face recognition

While in our application scenario there is no need forpreventing identification intruders (that maliciously try toinduce a false identification by looking for the technologyweakness), there may be necessary to be able to tag peo-ple under extreme conditions that invalidate traditional fa-cial recognition approaches.

A human being can identify a person in such extremeconditions by comparing with other photos in the same col-lection where a person with similar appearance, such asskin, hair and dressing characteristics, is also presented. Un-der such observation, this paper propose what we named aCollection-Aware People Recognition, that is, a recognitionby retrieving persons images with common attributes un-der the same collection.

Our proposal has the assumption that a collection coversphotos from a single event (for instance: a birthday party,a family meeting, and so on) such that each person mostlymaintains its appearance characteristics (like wearing thesame clothings) in every shot of the collection. Besides that,the algorithm presented do not impose a fixed number ofpeople per collection, neither a minimum number of pho-tos a person appears.

In order to reach such goal, an adaptation of a Kohonennetwork [8, 4] is proposed that learns people appearance, nomatter their pose, followed by a classificatory method. Thefeatures learned are obtained as a dynamic selection of binsfrom normalized color histograms.

This paper is organized as follows. Section 2 presentssome related works. The description of our proposal is pre-sented in Section 3, followed by its results analysis in Sec-tion 4. Finally, conclusions are presented in Section 5.

2. Related Works

This section briefly describes some commercial applica-tions and some proposals found in literature for the problemof person tagging in photo collections, aiming to discuss ex-istent distinct approaches.

A review of face recognition methods is left behind fromthis article, once that it is not our exactly goal to propose anew facial recognition method. Instead, a person identifica-tion over a photo collection method is being proposed here.Furthermore, face recognition is a huge research topic by it-self even to be resumed here. Interested readers can foundsurveys about face recognition techniques in [13, 1, 11].

As examples of commercial applications for photo col-lection management, it is worth mentioning iPhoto [7] andPicasa [9]. Both applications implement face recognitionmethods embedded within a tagging tool with the inten-tion of grouping similar faces. Ideally, each group repre-sents faces of a single person. Even thought presenting goodresults, these applications apply a conservative strategy fortheir group formations: faces are not considered (neitherpresented to the user) when uncertainty is high so that theoverall grouping precision is kept very high. Besides facerecognition, it is not known from the authors of this paperthat they analyze any extra appearance feature from the col-lection, but only from each image faces.

The first proposal discussed here working over featuresdescribing both face and clothing is the work of Chu etal. [2]. They propose a clustering using the K-means algo-rithm. It is relevant to note that their face descriptors playsa mayor rule during the clustering process. Their clothingdescriptors are used into a refinement step, that removesfaces whose associated clothing descriptors largely differfrom the averaged cluster descriptor. Besides, the facial de-scriptors are obtained from a PCA-based procedure workingover frontal faces only. Another limitations of this approachis that the number of clusters is specified by the user as it isa mandatory input parameter for the K-means method.

In [5], Gallagher and Chen propose a technique for cloth-ing segmentation by using mutual information betweenmultiple images. Initially, each face is normalized in scaleand a dimensional vector is produced as its description.Then, they apply a segmentation step where a graph-cut is

used to refine the clothing segmentation contour for everyface considered, using a global clothing mask previouslylearned. Both facial and clothing features are used into aK-nearest neighbor classifier that is constructed for eachidentity considered in the collection. Each K-NN model isused to create a probability model a person already known.The label suggested for a new photo is found by maximiz-ing a bipartite graph matching (whose individual matchingweights are inherited from the K-NN models) and solvedusing the Hungarian algorithm. Their proposal also assumesthat the number of individuals is known.

In [10], Lo Presti et al. propose a framework for phototagging also working over facial and clothing descriptionsby finding a maximum matching on a bipartite graph. Asa differing criteria from the above methods they point outthat while re-identifing known persons, that is, when look-ing for them in new photographs, they take into account thatthe same person can not appear twice in the same photo.

In contrast to our proposal, both commercial and litera-ture approaches presented in this section suppose the detec-tion of the corresponding faces as a prior to person identi-fication. None of them proposed an alternative in case thecorresponding face descriptors adopted is missing, imply-ing in a reduction of the person pose variation really sup-ported. Such assumption is too restrictive when consideringthe pose variations presented in a personal collection. Al-though our approach can be integrated with a face recogni-tion method for further refinements in cases where a properface image is presented, this paper focus in cases wherethere is an absence of facial descriptors.

3. Proposed Method

Aiming to show off the importance of appearance fea-tures to the problem of tagging photo collections, our goalis to propose a method and evaluate it under unconstrainedface and body poses.

Considering persons’ image, this paper initially proposesa categorization of especially hard cases for identificationprocedures. The four categories proposed correspond to: a)Non-frontal faces; b) Occluded faces; c) Facing backwardsposes; and d) Twins. Figure 2 illustrates our four categories.

By proposing such categories it is observed that eventhough face recognition methods can be improved to dealwith non-extreme cases of the first of our four categories(non-frontal faces), and may solve some cases of non-identical twins, there still remains important cases that willnot be resolved by algorithms exploring facial features only,as they may not be presented in images.

Revisiting the human being’s identification ability underthose four categories, it can be observed that even simplyfeatures such as the colors of the subjects’ clothing are used

(a) (b) (c) (d)

Figure 2. Proposed categories: a) Non-frontal; b) Occluded; c) Facing backwards; d)Twins

by our visual system in order to re-identify the people ap-pearing in multiple images from a photo collection.

It is worth pointing out that skin regions are usuallyshown in images under the categories named Non-frontaland Twins, and that human hair regions are usually pre-sented in images under the categories named Non-frontal,Facing Backward Poses and Twins. They can be furtherused by our visual system for identification, but they maynot appear in the images from the Occlusion category.

Such observations motivated our use of color distribu-tions as descriptors of a person’s appearance to be used asinput for the proposed algorithm. The features analyzed byour algorithm are presented in detail in Subsection 3.1.

It is assumed that no extra information is given as inputto our algorithm beyond the bounding boxes, each one con-taining the coarse segmentation of a person within an imageof the photo collection. More specifically, neither the num-ber of different identities, nor the number of times each oneappears, nor even initial samples of the queried people’s ap-pearance, are known a priori.

Once the features are defined, they are passed to a clas-sificatory algorithm. Supervised methods from MachineLearning cannot be used due to the unavailability of train-ing data describing the identities of the collection, as theyare not known a priori. Adopting an unsupervised approach,the problem of identifying people across a photo collectionof the same event can be proposed as a clustering problem.That is, it can be seeing as the task of assigning a set of ob-jects into groups (called clusters) so that the objects in thesame cluster are more similar to each other than to thosein other clusters. In our approach, a cluster groups subre-gions from different photos containing a certain person.

Cluster analysis can be achieved by various algorithmsthat differ significantly in their notion of what constitutes acluster and how to find them. Some clustering algorithmshave already been explored in person tagging applications,as described in Section 2. In [2], a centroid-based clustering,

known as K-means is adopted. It can be briefly described asan optimization problem, that given k, the number of clus-ters, searches for the k-cluster centers and assign the objectsto the nearest cluster center, such that the squared distancesfrom the clusteres are minimized.

In [5], the number of identities in the collection is givenas input. A second parameter appears as a K-Nearest Neigh-bors model is constructed for each identity. Such clusteringalgorithm can be seeing as an optimization problem for find-ing the K-closest points in a metric space. In the describedapplication scenario, the parameter K implies that a personshould appear at least K times in a collection in order to as-sure that its K-NN model can be correctly constructed.

While clustering techniques like K-means and K-NearestNeighbors demand respectively that the number of identi-ties in the collection, and the minimum number of timesthey appear, are given as input, this paper investigates theuse of Self Organizing map (SOM) as an unsupervisedlearning method that does not require such input parame-ters. Their general concept and the classification procedureproposed are detailed in 3.2.

3.1. Person appearance featuring

Our algorithm adopts a color histogram as the descrip-tor of the person’s appearance as there are a set of colors inobservation by our visual system while re-identifying a per-son from a photo collection. Color ranges can be used to de-scribe the person’s skin and hair, and also to describe cloth-ing, makeup and accessories that can be, each of them, mul-ticolored. Color distributions, described by histograms, pro-vide a robust, efficient cue for indexing into a large databaseof models [12]. Besides, they are stable representations inthe presence of occlusion and over change in view, whichare natural variations expected for people over a photo col-lection of the same event.

3.1.1. Color Spaces: A good color space for representinga person’s appearance should preserve the perceived colordifferences. Three color spaces: RGB, HSV and Lab, wereinvestigated.

The well known RGB color model is far from being per-ceptually uniform. A descriptor constructed over the RGBcolor space must select the histogram quantization stepsizes to be fine enough such that distinct colors are not as-signed to the same bin [12].

In order to reproduce a wide variety of colors, the rela-tionship between the constituent amounts of red, green, andblue light and the resulting color is unintuitive. Because ofthis, color pickers based on HSV color space (Hue, Satu-ration and Value/Intensity) are an attempt to accommodatemore traditional and intuitive color mixing models.

The Lab color space (L for lightness and a and b for thecolor-opponent dimensions) is an improved representation

in the sense of perceptual uniformity. This property meansthat a change of the same amount in a color value shouldproduce a change of about the same visual importance.

The person’s descriptor adopted was obtained by testingour algorithm using representations on those color spacesagainst different photo collections (see Section 4). Six rep-resentations based on color histograms varying from one tothree-dimensions were tested, corresponding to the follow-ing components: H (from HSV); HS (from HSV) and ab(from Lab); RGB, HSV, and Lab.

3.1.2. Similarity measurement: Once a descriptor isconstructed, it is necessary to define a similarity crite-ria working over it. Many metrics have been used todefine a similarity between two histograms. Euclidean dis-tance and its variations are the most commonly used.

Our algorithm was tested adopting the metrics bellow,given two histogramsH andG adopting the same represen-tation, whose elements are indexed using i, j, k (or fewerindexes in 1D and 2D histograms):

• Histogram Euclidian Distance:

d(H,G) =

√∑∀i

∑∀j

∑∀k

(H(i, j, k)−G(i, j, k))2 (1)

• Histogram Intersection Distance [12]:

d(H,G) =∑∀i

∑∀j

∑∀k

min(H(i, j, k), G(i, j, k))

min(|H|, |G|)(2)

3.1.3. Dimension Reduction: A disadvantage in workingwith descriptors based on color histograms is its high di-mensionality. A large dimensionality increases the com-plexity and computation of both the distance function and,more seriously, of the learning algorithm adopted.

The drawback of a naive dimensionality reduction is thatthe oversampling at the same time produces a larger set ofcolors that may be needed for identity discrimination. In ourapplication scenario, it is easily observed that drastically re-ducing the number of bins used in uniform sampling of col-ors (in any of the color spaces adopted) induced the classifi-cation procedure to produce wrong classifications, once thatthe descriptors loose their distinctive power. Thus, chang-ing the number of histogram bins actually impacts directlyin the algorithm performance.

In order to maintain discriminative power and reduce thedimensionality of the descriptors, our proposal created avery simple and efficient procedure computed dynamicallyfor each collection. Initially, every person’s bounding boxfrom the collection has its color histogram computed usinga number of bins adjusted to a proper discriminative power(finer level). Once those individual histograms have beencomputed, they are summed into a single global histogram,describing the whole collection distribution. The global his-togram, most of the times, is actually a multidimensional

sparse matrix as it is quite a rare event observing an im-age collection having pixels in every band of the histogram.

In a third step, the individual histograms are revisited, sothat each of their bins having a zero value in the correspond-ing bin of the global histogram are eliminated in every in-dividual histogram, thus are not further considered throughour algorithm. The remaining bins from the individual his-tograms are kept as the final descriptor.

3.2. Clustering

A Self-Organizing Map (SOM), also known as a Koho-nen network [8, 4] is a method originally developed for thevisualization and analysis of high-dimensional data. It de-fines an ordered mapping as a topological projection from aset of given data items onto a regular grid. Each grid nodeis associated with a learned model mi.

In our application, each input data item contains the val-ues of the reduced color histogram describing a person ina specific image. The Kohonen network maps an input datainto the grid’s node whose model is the most similar to itsdata. The similarity measurement can be adapted accordingto the application purposes. Thus, in our application, bothmetrics presented in Subsection 3.1.2 were tested.

The learning algorithm used by Kohonen Networks canbe briefly described as an iterative procedure that repeat-edly presents the input data to the network. Considering n-dimensional input vectors, corresponding to the descriptorsof the data items to be learned, it is assumed that the modelsto be learned are vectors having the same dimension n. Sup-pose a grid containing k nodes topographically disposed,each one representing a distinct model mi (i ∈ 1,2, . . ., k) .

The learning happens as a smoothing-type process inwhich, in each iteration step t, an input item x is randomlyor sequentially selected and presented to the network. Next,x is compared against each of the grid nodes mi, and themost similar node is chosen as the iteration winner. Thelearning happens as a new value for the winner node is com-puted by weighing its actual vector with the current dataitem vector. Nodes in the neighboring positions are also up-dated, but with decreasing weights.

Given hci, a smoothing kernel associated with the grid’sneighborhood function, and α(t), a learning rate that de-creases with the iteration index t. Models are updated as:

mi(t+ 1) = mi(t) + α(t) ∗ hci[x(t)−mi(t)] (3)

Once the network is trained, each of its node contains amodel so that the corresponding model vectors are moresimilar at the nearby nodes than between nodes located far-ther away from each other on the grid.

In our application, after the network has been trained, itis still necessary to select which nodes are going to be con-sidered clusters centers between the k models trained. For

that goal, our approach includes a counter in each node ofthe grid to track their corresponding winning frequency.

The search for clusters centers starts with a conservativestrategy. In a first search, only nodes with high winning fre-quency are approved. The threshold used is computed dy-namically based on the rate of number of iterations per num-ber of input samples. Our motivation is that a node that wasrepeatedly considered winner from several input data items,several times during training iterations, has its model welladapted to all of them, and, at the same time, this points outthe similarity between those correspondent input items.

After such initial conservative selection, samples corre-sponding to smaller clusters might not be represented. Ina second search, the grid models are reevaluated againstthe initial selection of cluster centers, so that models dif-fering over a threshold from all of those already includedare reconsidered as clustering centers. Such search is alsoperformed from nodes with higher frequency in descendingorder. Such ordered search is crucial as neighboring nodestend to present similar models.

Finally, in a third automatic search for clustering cen-ters, samples are compared against the selected centers. Ifnon of them presented a model for such sample under a sim-ilarity measurement, the grid is again revisited looking forthe best model to represent such input sample. This thirdstep was motivated by people appearing in very few pho-tos, but that should not clearly be represented by the clus-tered included in previous two steps, as the sample charac-teristics are actually very different from the clusters model.

4. Results

While there are several image datasets used as bench-mark for algorithms dealing with face detection and recog-nition and the same can be found for pedestrian detection(such as, respectively, [6] and [3]), it is not known from theauthors of this paper a public benchmark containing photocollections covering events, such that our Collection-AwarePeople Recognition algorithm could be numerically evalu-ated under public tests.

For that reason, in order to illustrate our approach overnatural images covering our four hard cases categories,we constructed our own collection. The results shown inthis section corresponds to four collections, covering re-spectively: a sequence of twins playing around containingeleven shots, and six different identities (Figure 3a.); a birth-day party containing twelve shots, and seven different iden-tities (Figure 3b.); and a family meeting containing six shotsand eight different identities (Figure 3c.); a soccer sequencecontaining forthy three shots and eleven different identities(Figure 3d.). The last one mentioned presents all the fourcategories proposed of hard cases for facial recognition.

(a) (b)

(c) (d)

Figure 3. Photo Collections

Several tests of our algorithm against those photo collec-tions were evaluated in order to select the color histogramrepresentation to be adopted. The best clustering resultswhere obtained when adopting a three dimensional colorhistogram in Lab colorspace. The results remained quite sta-ble using from 16 to 12 bins for the chromatic components,combined with only 6 bins for the L component discretiza-tion (more levels induced clustering breaks). Tests over theother five histograms representation, proposed in Section3.1.1, resulted in a difficult balancing for choosing the num-ber of bins in order to offer enough identity discriminationbut at the same time not so many clustering breaks. That is,augmenting the numbers of bins on such color spaces, pro-duced clusters with fewer elements than expected.

As another testing result, the topology of our networkwas organized as a 3D-grid with 4 × 4 × 4 nodes (1D and2D grids where also tested). Even though smaller networkscould support the tests over collections with small num-ber of identities, reducing the network number of nodes,also reduces the maximum number of identities supported.Besides, the classificatory procedure proposed is adaptiveenough to discover fewer clusters within a larger grid. Still,in order to deal with collections with more identities thanthe number of nodes adopted (64), a larger grid should beused. Thus, even though the exactly number of identities isnot demanded a priori by our algorithm, a coarse approxi-mation of its maximum may be used in order make the al-gorithm free of such limit on the maximum number of iden-tities learned.

About the similarity metric adopted, the results showedthat the histogram intersection metric is very sensitive to thebounding box definition. More specifically, adopting suchmetric, background information induced wrong classifica-tions of different people into the same cluster, in cases theircorresponding bounding box have enough background in-

(a) (b)

(c) (d)

Figure 4. Clustering results

(a) (b)

(c) (d)

Figure 5. Clustering errors

formation in common with each other. Such problem wasovercame in our tests adopting this metric by delimiting apersons bounding box confined within pixels containing theperson’s appearance only. In order to support a coarse bond-ing box delimitation, the Euclidian Distance was adopted asthe default metric and was used in the results presented.

Figure 4 shows the biggest clusters found in each of thecollections, as there is no space for showing all of themhere. It is also important to note that the algorithm failedin the clusters presented in Figure 5, caused mainly by twofactors: people wearing similar clothings (Figure 5a., b. andd.); and body occlusion (Figure 5c.), so that skin and hair ar-eas dominate the appearance vectors, but are not distinctiveenough by themselves.

5. Conclusions

Even though we believe that the appearance characteris-tics explored in this paper do not substitute facial descrip-tors, our goal in presenting an approach exclusively basedon such appearance features is to show off that they should

not be overlooked. As our first contribution, four categorieswhere defined pointing out hard and even impossible casesfor re-identification if processed observing only facial fea-tures, that is, by face recognition procedures exclusively.

As another contribution, color histograms are exploredin the paper as robust and efficient descriptors describinga person’s appearance (such as skin, hair, clothing, makeupand accessories). They are stable representations in the pres-ence of occlusion and are robust to changes in the object’sorientation and to the camera’s view point, which are natu-ral variations expected for people appearing in a photo col-lection. It was also presented a fast and effective descriptorreduction procedure computed dynamically for each photocollection.

Last, this paper presented a clustering algorithm basedon a Kononen Map, so that our proposal does not make ini-tial assumption about the exactly number of individuals, norhow many times each one is presented.

References

[1] A. F. Abate, M. Nappi, D. Riccio, and G. Sabatino. 2d and3d face recognition: A survey. Pattern Recognition Letters,28(14):1885–1906, 2007.

[2] W.-T. Chu, Y.-L. Lee, and J.-Y. Yu. Using context informa-tion and local feature points in face clustering for consumerphotos. Acoustics, Speech, and Signal Processing, IEEE In-ternational Conference on, 0:1141–1144, 2009.

[3] P. Dollar, C. Wojek, B. Schiele, and P. Perona. Pedestrian de-tection: A benchmark. IEEE Conference on Computer Visionand Pattern Recognition (2009), pages 304–311, 2009.

[4] L. Fausett, editor. Fundamentals of neural networks: archi-tectures, algorithms, and applications. Prentice-Hall, Inc.,Upper Saddle River, NJ, USA, 1994.

[5] A. Gallagher and T. Chen. Clothing cosegmentation for rec-ognizing people. In Proc. CVPR, 2008.

[6] G. B. Huang, M. Ramesh, T. Berg, and E. Learned-Miller.Labeled faces in the wild: A database for studying facerecognition in unconstrained environments. Technical Re-port 07-49, University of Massachusetts, Amherst, October2007.

[7] iPhoto. http://www.apple.com/ilife/iphoto/.[8] T. Kohonen. Self-organized formation of topologically cor-

rect feature maps. Biological Cybernetics, 43:59–69, 1982.10.1007/BF00337288.

[9] Picasa. http://picasa.google.com/.[10] L. L. Presti, M. Morana, and M. L. Cascia. A data association

algorithm for people re-identification in photo sequences. InISM, pages 318–323. IEEE Computer Society, 2010.

[11] P. Sinha. Face recognition by humans: Nineteen results allcomputer vision researchers should know about. In Proceed-ings of the IEEE, pages 1948–1962, 2006.

[12] M. J. Swain and D. H. Ballard. Color indexing. Interna-tional Journal of Computer Vision, 7:11–32, 1991.

[13] C. Zhang and Z. Zhang. A survey of recent advances in facedetection, 2010.

Photo Tagging by Collection-Aware People...

Documents

Transcript of Photo Tagging by Collection-Aware People...