[IEEE 2011 18th IEEE International Conference on Image Processing (ICIP 2011) - Brussels, Belgium...

4
FAST FACE SEQUENCE MATCHING IN LARGE-SCALE VIDEO DATABASES Hung Thanh Vu 1 Thanh Duc Ngo 2 Thao Ngoc Nguyen 1 Duy-Dinh Le 3 Shin’ichi Satoh 3 Bac Hoai Le 1 Duc Anh Duong 1 1 University of Sciences, 227 Nguyen Van Cu, Ho Chi Minh city, Vietnam 2 The Graduate University for Advanced Studies (Sokendai), Japan 3 National Institute of Informatics, 2-1-2 Hitotsubashi, Chiyoda-ku, Tokyo, Japan ABSTRACT There have recently been many methods proposed for match- ing face sequences in the field of face retrieval. However, most of them have proven to be inefficient in large-scale video databases because they frequently require a huge amount of computational cost to obtain a high degree of accuracy. We present an efficient matching method that is based on the face sequences (called face tracks) in large-scale video databases. The key idea is how to capture the distribution of a face track in the fewest number of low-computational steps. In order to do that, each face track is represented by a vector that approx- imates the first principal component of the face track distribu- tion and the similarity of face tracks bases on the similarity of these vectors. Our experimental results from a large-scale database of 457,320 human faces extracted from 370 hours of TRECVID videos from 2004 - 2006 show that the proposed method easily handles the scalability by maintaining a good balance between the speed and the accuracy. Index TermsFace retrieval, face track matching, sub- space method. 1. INTRODUCTION A huge number of videos generated daily come from sources such as television programs, broadcast news, surveillance videos, and movies. The principal object in these videos is humans. This implies that the human face is the most im- portant object in video retrieval. Currently, frontal faces can be efficiently extracted from these video sources by using face detectors [1]. These face detectors can produce large databases of up to tens of millions of faces. The problem is how to organize these databases for efficient and accu- rate retrieval. Solving this problem, we would be beneficial for a wide range of applications from video indexing, event detection, to person search in videos. The conventional approach is to use a single image for matching. In this approach, each single image can be rep- resented in a high dimensional space as a point. The simi- larity between the query image of a person and an image in the database is the exact distance between two points in the feature space. The main drawback with this approach is that the matching results rely completely on these points that are unstable due to the huge number of variations in the human face such as head poses, facial expressions, and illumination. Another approach deals with the dependence on such unsta- ble points by using a face sequence instead of a single im- age and thus a person is represented by the distribution of a point set in the feature space. Methods following the face sequence based approach usually try to model this distribu- tion. Shakhnarovich et al. [2] modeled a face sequence using a probability distribution. Cevikalp and Triggs [3] claimed a face sequence was a set of points and discovered a con- vex geometric region expanded by these points. The min- min method [4, 5, 6] considered a face sequence as a clus- ter of points and measured the distance between these clus- ters. Subspace methods [7, 8, 9] viewed a face sequence as points spread over a subspace. Although these methods can be highly accurate, a lot of computation is needed to repre- sent the distribution of the face sequence, such as computing the convex hulls in [3], the probability models in [7] and the eigenvectors in [7, 8, 9]. For this reason, they are not scalable for large video databases. Other methods that can efficiently match numerous face tracks such as the k-Faces method [10] usually sacrifice the accuracy for the speed. Therefore, de- mand is growing for algorithms that have a balance of both speed and accuracy on a large-scale. We propose an efficient method for matching face tracks in large-scale video databases. We follow the idea described for the subspace methods [7, 8, 9] in which the similarity be- tween two face tracks is estimated by the similarity between two distributions. However, the scalability is as important as the accuracy in such databases. Therefore, unlike subspace methods, we try to approximate the first principal component corresponding to largest variation by a vector instead of find- ing subspaces that need a huge computation cost. In this way, the computation cost is significantly reduced while the accu- racy is still maintained and is comparable to other methods. The rest of our paper is organized as follows. Section 2 in- troduces an overview of our framework: Subsections 2.1 and 2.2 present our previous work in [10] for face track extraction; 2011 18th IEEE International Conference on Image Processing 978-1-4577-1303-3/11/$26.00 ©2011 IEEE 2549

Transcript of [IEEE 2011 18th IEEE International Conference on Image Processing (ICIP 2011) - Brussels, Belgium...

Page 1: [IEEE 2011 18th IEEE International Conference on Image Processing (ICIP 2011) - Brussels, Belgium (2011.09.11-2011.09.14)] 2011 18th IEEE International Conference on Image Processing

FAST FACE SEQUENCE MATCHING IN LARGE-SCALE VIDEO DATABASES

Hung Thanh Vu1 Thanh Duc Ngo2 Thao Ngoc Nguyen1 Duy-Dinh Le3

Shin’ichi Satoh3 Bac Hoai Le1 Duc Anh Duong1

1University of Sciences, 227 Nguyen Van Cu, Ho Chi Minh city, Vietnam2The Graduate University for Advanced Studies (Sokendai), Japan

3National Institute of Informatics, 2-1-2 Hitotsubashi, Chiyoda-ku, Tokyo, Japan

ABSTRACT

There have recently been many methods proposed for match-ing face sequences in the field of face retrieval. However,most of them have proven to be inefficient in large-scale videodatabases because they frequently require a huge amount ofcomputational cost to obtain a high degree of accuracy. Wepresent an efficient matching method that is based on the facesequences (called face tracks) in large-scale video databases.The key idea is how to capture the distribution of a face trackin the fewest number of low-computational steps. In order todo that, each face track is represented by a vector that approx-imates the first principal component of the face track distribu-tion and the similarity of face tracks bases on the similarityof these vectors. Our experimental results from a large-scaledatabase of 457,320 human faces extracted from 370 hours ofTRECVID videos from 2004 - 2006 show that the proposedmethod easily handles the scalability by maintaining a goodbalance between the speed and the accuracy.

Index Terms— Face retrieval, face track matching, sub-space method.

1. INTRODUCTION

A huge number of videos generated daily come from sourcessuch as television programs, broadcast news, surveillancevideos, and movies. The principal object in these videos ishumans. This implies that the human face is the most im-portant object in video retrieval. Currently, frontal faces canbe efficiently extracted from these video sources by usingface detectors [1]. These face detectors can produce largedatabases of up to tens of millions of faces. The problemis how to organize these databases for efficient and accu-rate retrieval. Solving this problem, we would be beneficialfor a wide range of applications from video indexing, eventdetection, to person search in videos.

The conventional approach is to use a single image formatching. In this approach, each single image can be rep-resented in a high dimensional space as a point. The simi-larity between the query image of a person and an image inthe database is the exact distance between two points in the

feature space. The main drawback with this approach is thatthe matching results rely completely on these points that areunstable due to the huge number of variations in the humanface such as head poses, facial expressions, and illumination.Another approach deals with the dependence on such unsta-ble points by using a face sequence instead of a single im-age and thus a person is represented by the distribution of apoint set in the feature space. Methods following the facesequence based approach usually try to model this distribu-tion. Shakhnarovich et al. [2] modeled a face sequence usinga probability distribution. Cevikalp and Triggs [3] claimeda face sequence was a set of points and discovered a con-vex geometric region expanded by these points. The min-min method [4, 5, 6] considered a face sequence as a clus-ter of points and measured the distance between these clus-ters. Subspace methods [7, 8, 9] viewed a face sequence aspoints spread over a subspace. Although these methods canbe highly accurate, a lot of computation is needed to repre-sent the distribution of the face sequence, such as computingthe convex hulls in [3], the probability models in [7] and theeigenvectors in [7, 8, 9]. For this reason, they are not scalablefor large video databases. Other methods that can efficientlymatch numerous face tracks such as the k-Faces method [10]usually sacrifice the accuracy for the speed. Therefore, de-mand is growing for algorithms that have a balance of bothspeed and accuracy on a large-scale.

We propose an efficient method for matching face tracksin large-scale video databases. We follow the idea describedfor the subspace methods [7, 8, 9] in which the similarity be-tween two face tracks is estimated by the similarity betweentwo distributions. However, the scalability is as important asthe accuracy in such databases. Therefore, unlike subspacemethods, we try to approximate the first principal componentcorresponding to largest variation by a vector instead of find-ing subspaces that need a huge computation cost. In this way,the computation cost is significantly reduced while the accu-racy is still maintained and is comparable to other methods.

The rest of our paper is organized as follows. Section 2 in-troduces an overview of our framework: Subsections 2.1 and2.2 present our previous work in [10] for face track extraction;

2011 18th IEEE International Conference on Image Processing

978-1-4577-1303-3/11/$26.00 ©2011 IEEE 2549

Page 2: [IEEE 2011 18th IEEE International Conference on Image Processing (ICIP 2011) - Brussels, Belgium (2011.09.11-2011.09.14)] 2011 18th IEEE International Conference on Image Processing

Fig. 1. Eigenvectors (with largest eigenvalues) and mean vec-tors of two 3D face tracks.

(a) (b)

Fig. 2. Computing cosine distance before (a) and after (b)zero-mean normalization step.

our proposed method is described in Subsection 2.3. Finally,our experiments and conclusion are presented in Sections 3and 4.

2. FRAMEWORK OVERVIEW

2.1. Face sequence extraction

There are several approaches for extracting face tracks fromvideos. Sivic et al. [6] use a face detector to locate humanfaces in every frame. Faces of the same person are associ-ated together by tracking the covariance affine regions overtime. Their method is able to yield high results but is toocomplex due to the huge cost of running affine covariant de-tectors and trackers. Another efficient extraction method usedin [4] is Kanade-Lucas-Tomasi (KLT). KLT is applied to ev-ery frame to track the interest points in a shot. A pair of faceregions in different frames is linked by comparing the num-ber of tracked points that passed these face regions to the totalpoints in both regions. However, the tracked points are usu-ally sensitive to illumination changes, occlusions, and false

face detections, and thus many fragmented face sequencesmay be created. We also apply a KLT tracker to associatefaces of the same person but that is different from [4], wemaintained the interest points in the face region instead ofthe shot and re-computed tracked points every frame. Ourmethod was shown to be more efficient and robust than Ever-ingham et al.’s method in [11].

2.2. Facial feature representation

We use the Local Binary Pattern (LBP) [12] feature to rep-resent the extracted faces. LBP has recently become one ofthe most popular features in face representation. Several re-markable advantages of the LBP feature are that it is invari-ant to monotonic changes in illumination and can be quicklycomputed. A direct extension of the LBP proposed in [12]is LBPP,R, which takes into consideration the LBP operatorson different scales and rotations. The LBPP,R operator at thepoint (xc, yc) in an image, where P is the number of sam-pling points on a circle of a radius R, compares the intensityof each sampling point with the intensity of (xc, yc) to pro-duce a string of binary code (0 or 1). Each string is convertedinto an integer number that falls into the unique bin in a k-binhistogram. A face image is partitioned into a regular grid ofn × n cells; a k-bin histogram is individually built for everycell in the grid. These histograms are concatenated to createthe LBP feature of the whole image. After this step, each faceimage is represented by a feature of D = n × n × k dimen-sions.

2.3. Face track representation and matching

Since each face is represented as a feature point, each facetrack describes the distribution of the faces of one person inthe feature space. When the number of the faces in each facetrack is huge, an efficient representation of the face track dis-tribution is needed. We use mean vectors to represent facetracks to overcome this issue. Given the face track Fi of nifaces, the mean vector is given by:

vi =

∑ni

j=1 fij

ni,

where, fij is the face j-th of the face track Fi. Using meanvector can be explained that the mean vector of the face trackis nearly the approximation of the first principal componentthat corresponds to the direction of maximum variance ofdata. On the other hand, the mean vector can be used to re-place the first eigenvector (corresponding to the largest eigen-value) of the subspace used in subspace methods. The ex-ample shown in Figure 1 shows two face tracks in a three-dimensional space. The green vectors are the mean vectorsof the face tracks while the red (blue) one is the eigenvector(corresponding to the largest eigenvalue) to represent the sub-space of the red (blue) face track. It is easy to see that the

2011 18th IEEE International Conference on Image Processing

2550

Page 3: [IEEE 2011 18th IEEE International Conference on Image Processing (ICIP 2011) - Brussels, Belgium (2011.09.11-2011.09.14)] 2011 18th IEEE International Conference on Image Processing

mean vectors and the eigenvectors are nearly identical. Basedon this approximation, we believe that our method can inheritthe advantages of the subspace methods in terms of the accu-racy. In addition, our method only requires O(ni × D) forfinding the mean vectors, and thus, the face track representa-tion can be quickly computed.

After extraction and representation processes are com-plete, the face tracks are organized into databases for thematching phase. Given an input face track, the similarity be-tween the input face track and each face track in the databasesis estimated and a rank list is returned according to the sim-ilarity scores. There are some common similarity measuresfor matching in face retrieval, such as Euclidean, L1, andHIK. However, we choose the cosine distance to measure thesimilarity between two mean vectors. This idea is based onthe angle distance between two subspaces, which is success-fully used in subspace methods. We refer to our method asmean-cos (using the mean vector for representation and thecosine distance for matching) and its details are described asfollows:

mean-cos

1. Compute a mean for all faces in the database:

u =

∑Ni=1

∑ni

j=1 fij

Nf,

where, N is the number of face tracks and Nf =∑Ni=1 ni is the number of faces in the database.

2. Find the mean vector vi for the face track Fi.

3. Normalize the mean vector: vi = vi − u.

4. Compute the distance from face track G to each facetrack Fi.

5. Return a rank list.

Steps 1 and 3 compose a zero-mean normalization step.The purpose of normalizing data is to the zero-mean in orderto enhance the discrimination when we use the cosine dis-tance. Figure 2a shows an example where this measure makesa mistake: the face track F3 is considered to be further fromF2 than F1 since the angle ϕ between v3 and v2 is greater thanangle θ between v1 and v2. Meanwhile, the distances betweenface tracks can be correctly estimated (ϕ < θ) by using thezero-mean normalization in Figure 2b.

3. EXPERIMENTS

3.1. Database and evaluation

We used the database described in [10] to evaluate ourmethod. This database was collected from 370 hours of

Method MAP(%)CMSM 58.39

mean-cos 58.13MSM 57.72

min-min 56.93k-Faces 54.97

Table 1. MAP results from TRECVID data.

TRECVID news video from 2004-2006. According to [10],the faces were extracted, annotated, and organized into adatabase of 1,510 face tracks from 49 people containing457,320 face images. The LBP feature was extracted for eachimage using a 3×3 grid and 59-bin LBP histograms to createa 531-dimensional feature.

For evaluation, we used the mean average precision(MAP), which is a common measure generally used to eval-uate information retrieval systems and particularly in faceretrieval systems. MAP is also a standard benchmark inmany reliable competitions such as the TRECVID workshop[13], and the PASCAL VOC challenge [14]. We used eachface track in the database as a query, and thus we had 1,510queries. The MAP value was evaluated from the results of1,510 queries and used to compare the methods.

3.2. Results and analysis

We compared our method (mean-cos) with MSM, CMSM,min-min, and k-Faces. All our experiments were performedon a Linux server with 24 cores, 2.66 GHz, 128 GB RAM. Ta-ble 1 lists the MAP values of all the methods. The mean-cosmethod obtained the MAP at 58.13%, outperformed k-Faces(54.97%), min-min (56.93%), MSM (57.72%), and is compa-rable to CMSM (58.39%). These results show that the meanvectors can be used to describe the first principal componentof the distribution of face tracks very well.

Suppose N is the number of face tracks in the database;M is the average number of faces per face track; D is thenumber of feature dimensions; and Dc is the constrained sub-space dimensions in the CMSM method. In our experiments,N is equal to 1,510; M is 302; D is 531 and Dc is 500 (Dc ischosen so that CMSM yields the highest MAP).

The performance time and computational complexities ofall methods are listed in Table 2. The min-min method doesn’thave a representation phase but its matching time is too hugein terms of the complexity ofO(N×N×D×M×M) againstO(N × N ×D) in the other methods. In the face track rep-resentation phase, the subspace methods (MSM and CMSM)need O(N ×D×D× (D+M)) while our method requiresO(N × M × D). This means that our method is D timesfaster than the subspace methods. The performance time re-sults prove that mean-cos is respectively 438 and 559 timesfaster than MSM and CMSM. It means that MSM and CMSMneed more than 6 minutes to represent 1,510 face tracks while

2011 18th IEEE International Conference on Image Processing

2551

Page 4: [IEEE 2011 18th IEEE International Conference on Image Processing (ICIP 2011) - Brussels, Belgium (2011.09.11-2011.09.14)] 2011 18th IEEE International Conference on Image Processing

Face track representation MatchTime (seconds) Big O Time (seconds) Big O

min-min 0 None 6,543,790 O(N ×N ×D ×M ×M )MSM 396 O(N ×D ×D × (D +M)) 449 O(N ×N ×D)

CMSM 505 O(N ×D ×D × (D +M)) 407 O(N ×N ×Dc)mean-cos 0.9 O(N ×D ×M) 184 O(N ×N ×D)k-Faces 0.28 O(N ×D) 120 O(N ×N ×D)

Table 2. Performance time and computational complexity of all methods.

our method only requires 1 second. In the matching phase,all the methods (except min-min) have a very similar com-putational complexity of O(N × N × D) (actually, CMSMhas O(N × N × Dc), but Dc < D). However, mean-cosis 2 times faster than the subspace methods and 35,000 timesfaster than min-min since these methods need more time forcomplex operations, such as the matrix operations and eigen-vector decompositions in the subspace methods or the dis-tance calculations of all the point pairs in two face tracks,which makes them totally unsuitable for the scalability. Thek-Faces method is the fastest of these methods (3.28 timesfaster than our method in the representation phase and 1.53times in the matching phase), but it is less accurate than othermethods (Table 1). The experimental results show that ourproposed method is a good method for satisfying the trade-off between two key problems in the scalability: the accuracyand the speed.

4. CONCLUSION

We introduced an efficient and accurate method for match-ing face tracks in large-scale databases in this paper. Theface tracks extracted from video sequences are represented byfinding the mean vector of each face track. After the normal-ization step, the cosine distance is used to measure the sim-ilarity between two face track distributions. The efficiencyof our method was proved in both theory and practice fora large-scale face track database extracted from TRECVIDvideos while the accuracy is equivalent to the state-of-the-artmethods.

5. REFERENCES

[1] Paul A. Viola and Michael J. Jones, “Rapid object detec-tion using a boosted cascade of simple features,” CVPR,2001.

[2] Gregory Shakhnarovich, John W. Fisher, III, and TrevorDarrell, “Face recognition from long-term observa-tions,” ECCV, 2002.

[3] Hakan Cevikalp and Bill Triggs, “Face recognitionbased on image sets,” CVPR, 2010.

[4] M. Everingham, J. Sivic, and A. Zisserman, ““Hello!My name is... Buffy” – automatic naming of charactersin TV video,” BMVC, 2006.

[5] A. Hadid and M. Pietikainen, “From still image tovideobased face recognition: An experimental analysis,”FG, 2004.

[6] J. Sivic, M. Everingham, and A. Zisserman, “Personspotting: Video shot retrieval for face sets,” CIVR, 2005.

[7] Wei Fan and Dit-Yan Yeung, “Locally linear modelson face appearance manifolds with application to dual-subspace based classification,” CVPR, 2006.

[8] O. Yamaguchi, K. Fukui, and K. Maeda, “Face recogni-tion using temporal image sequence,” FG, 1998.

[9] Kazuhiro Fukui and Osamu Yamaguchi, “Face recog-nition using multi-viewpoint patterns for robot vision,”ISRR, 2003.

[10] Thao Ngoc Nguyen, Thanh Duc Ngo, Duy-Dinh Le,Shin’ichi Satoh, Bac Hoai Le, and Duc Anh Duong,“An efficient method for face retrieval from large videodatasets.,” CIVR, 2010.

[11] Thanh Ngo Duc, Duy Dinh Le, Shin’ichi Satoh, andDuc Anh Duong, “Robust face tracking finding in videousing tracked points,” Proc. Intl. Conf. on Signal-ImageTechnology and Internet-Based Systems, page 59-64,2008.

[12] Timo Ojala, Matti Pietikainen, and Topi Maenpaa,“Multiresolution gray-scale and rotation invariant tex-ture classification with local binary patterns,” PAMI,2002.

[13] Alan F. Smeaton, Paul Over, and Wessel Kraaij, “Eval-uation campaigns and trecvid,” MIR, 2006.

[14] M. Everingham, L. Van Gool, C. K. I. Williams, J. Winn,and A. Zisserman, “The pascal visual object classes(voc) challenge,” IJCV, 2010.

2011 18th IEEE International Conference on Image Processing

2552