Person Search in a Scene by Jointly Modeling People ...yuanluxu/publications/... · Liang Lin Sun...

4
Person Search in a Scene by Jointly Modeling People Commonness and Person Uniqueness Yuanlu Xu Sun Yat-Sen University [email protected] Bingpeng Ma Univ. of China Academy Sci. [email protected] Rui Huang * Huazhong Univ. of Sci.&Tech. [email protected] Liang Lin Sun Yat-Sen University [email protected] ABSTRACT This paper presents a novel framework for a multimedia search task: searching a person in a scene using human body appearance. Existing works mostly focus on two indepen- dent problems related to this task, i.e., people detection and person re-identification. However, a sequential combination of these two components does not solve the person search problem seamlessly for two reasons: 1) the errors in people detection are carried into person re-identification unavoid- ably; 2) the setting of person re-identification is different from that of person search which is essentially a verification problem. To bridge this gap, we propose a unified framework which jointly models the commonness of people (for detec- tion) and the uniqueness of a person (for identification). We demonstrate superior performance of our approach on pub- lic benchmarks compared with the sequential combination of the state-of-the-art detection and identification algorithms. Categories and Subject Descriptors I.4 [Image Processing and Computer Vision]: Miscel- laneous General Terms Theory, Algorithm, Experimentation Keywords person search, generative model, GMM, Fisher vector 1. INTRODUCTION Over the past few decades, mature and low-cost vision acquisition equipments substantially boosted academic stud- ies and industrial applications in vision related multimedia. * Corresponding author. This work was supported in part by NSFC 61105014, 61332016, Hi-Tech R&D Program of China (no.2013AA013801), Special Project on Integration of Industry, Ed- ucation and Research of Guangdong (no.2012B091100148), and Fun- damental Research Funds for the Central Universities (no.13lgjc26). Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]. ACM MM’14, November 03 - 07, 2014, Orlando, FL, USA. Copyright is held by the owner/author(s). Publication rights licensed to ACM. ACM 978-1-4503-3063-3/14/11 $15.00. http://dx.doi.org/10.1145/2647868.2654965. (b) S01004 S01005 S01007 (a) (c) (d) S01004 S01005 S01007 Template Figure 1: An illustration of searching a person in a scene. (a) Query images and template. (b) The re- sults of our search approach. Blue rectangles are the groundtruth and green ones the correctly identified per- sons. (c) The people detection results from HOG-based detector. (d) The person re-identification fails to identify the correct person based on the detection results. One of the fastest growing areas is intelligent surveillance. From the perspective of public security, searching suspect- s and missing people in surveillance footage is an effective procedure to help solve cases. Hereby, we investigate this particular multimedia search problem, i.e., person search, aiming at locating a specific person in a scene given a query image using visual clues. The most-researched clue for this task is face. Both face detection and recognition have achieved impressive perfor- mance through decades of research. However, due to the low resolution and pose variation of individuals in typical surveil- lance footage, faces are often deemed of limited usefulness. Going beyond faces, human body appearance is another im- portant clue for searching a person in a scene. Nevertheless, there is a gap between the current research trends and the actual task we are facing. Person search can be viewed as a special case of image search, or content-based image retrieval [9, 6], which has long been studied in the multimedia field. Traditional im- age search does not make use of the prior knowledge about

Transcript of Person Search in a Scene by Jointly Modeling People ...yuanluxu/publications/... · Liang Lin Sun...

Person Search in a Scene by Jointly Modeling PeopleCommonness and Person Uniqueness

Yuanlu XuSun Yat-Sen University

[email protected]

Bingpeng MaUniv. of China Academy Sci.

[email protected]

Rui Huang∗

Huazhong Univ. of Sci.&[email protected]

Liang LinSun Yat-Sen [email protected]

ABSTRACTThis paper presents a novel framework for a multimediasearch task: searching a person in a scene using human bodyappearance. Existing works mostly focus on two indepen-dent problems related to this task, i.e., people detection andperson re-identification. However, a sequential combinationof these two components does not solve the person searchproblem seamlessly for two reasons: 1) the errors in peopledetection are carried into person re-identification unavoid-ably; 2) the setting of person re-identification is differentfrom that of person search which is essentially a verificationproblem. To bridge this gap, we propose a unified frameworkwhich jointly models the commonness of people (for detec-tion) and the uniqueness of a person (for identification). Wedemonstrate superior performance of our approach on pub-lic benchmarks compared with the sequential combination ofthe state-of-the-art detection and identification algorithms.

Categories and Subject DescriptorsI.4 [Image Processing and Computer Vision]: Miscel-laneous

General TermsTheory, Algorithm, Experimentation

Keywordsperson search, generative model, GMM, Fisher vector

1. INTRODUCTIONOver the past few decades, mature and low-cost vision

acquisition equipments substantially boosted academic stud-ies and industrial applications in vision related multimedia.

∗Corresponding author. This work was supported in part by

NSFC 61105014, 61332016, Hi-Tech R&D Program of China(no.2013AA013801), Special Project on Integration of Industry, Ed-ucation and Research of Guangdong (no.2012B091100148), and Fun-damental Research Funds for the Central Universities (no.13lgjc26).

Permission to make digital or hard copies of all or part of this work for personal orclassroom use is granted without fee provided that copies are not made or distributedfor profit or commercial advantage and that copies bear this notice and the full citationon the first page. Copyrights for components of this work owned by others than theauthor(s) must be honored. Abstracting with credit is permitted. To copy otherwise, orrepublish, to post on servers or to redistribute to lists, requires prior specific permissionand/or a fee. Request permissions from [email protected] MM’14, November 03 - 07, 2014, Orlando, FL, USA.Copyright is held by the owner/author(s). Publication rights licensed to ACM.ACM 978-1-4503-3063-3/14/11 $15.00.http://dx.doi.org/10.1145/2647868.2654965.

(b)

S01004

S01005 S01007

(a)

(c) (d)

S01004 S01005 S01007 Template

Figure 1: An illustration of searching a person in a

scene. (a) Query images and template. (b) The re-

sults of our search approach. Blue rectangles are the

groundtruth and green ones the correctly identified per-

sons. (c) The people detection results from HOG-based

detector. (d) The person re-identification fails to identify

the correct person based on the detection results.

One of the fastest growing areas is intelligent surveillance.From the perspective of public security, searching suspect-s and missing people in surveillance footage is an effectiveprocedure to help solve cases. Hereby, we investigate thisparticular multimedia search problem, i.e., person search,aiming at locating a specific person in a scene given a queryimage using visual clues.

The most-researched clue for this task is face. Both facedetection and recognition have achieved impressive perfor-mance through decades of research. However, due to the lowresolution and pose variation of individuals in typical surveil-lance footage, faces are often deemed of limited usefulness.Going beyond faces, human body appearance is another im-portant clue for searching a person in a scene. Nevertheless,there is a gap between the current research trends and theactual task we are facing.

Person search can be viewed as a special case of imagesearch, or content-based image retrieval [9, 6], which haslong been studied in the multimedia field. Traditional im-age search does not make use of the prior knowledge about

the object category that the queried object belongs to, whilein person search we know that the query image contains aperson. The benefit is that we can now use this knowledgeto facilitate our searching procedure, e.g., by incorporatingthe common characteristics of human body obtained from atraining set besides the query image itself. Another differ-ence is that person search is instance-level search that callsfor a higher degree of accuracy for practical use than generalimage search.Besides image search, extensive studies have been con-

ducted on two other problems related to person search, i.e.,people detection and person re-identification. People detec-tion tries to find all the people in a scene, which is differentfrom our mission of seeking a specific person given a queryimage. One of the most influential people detection meth-ods employs the Histograms of Oriented Gradients (HOG)features and a linear SVM classifier [1]. The state-of-the-artgeneral object detection methods such as Deformable PartModels (DPM) [?] can also be used for people detection. Werefer the readers to a detailed review [2]. From the objectdetection point of view, person search is an extreme case offine-grained object detection, i.e., instance-level detection.To achieve the goal of person search, a further identifi-

cation step is required to follow people detection, that is,a mode to match images of people using the whole bodyappearance, which is usually called person re-identification.Previous works on person re-identification can be roughlydivided into the following two categories. Learning-basedmethods tend to learn the discriminative features [5], dis-tance metric [13] or transformations [7] across views andcameras of the same person. Despite acknowledged suc-cess, learning-based methods suffer from model over-fittingand currently are unsuitable for large-scale data. Matching-based approaches, on the other hand, aim at localizing andcomparing the corresponding image patches [8, 12], imagesegments [3] or body parts [11] in images.However, there are two drawbacks in solving the person

search task by sequentially combining people detection andperson re-identification. First, person re-identification fo-cuses on measuring the similarity of two windows (manuallycropped or obtained by detection) containing the roughlyaligned human subjects. Thus, the overall performance ofidentifying a specific person in a scene heavily relies on theperformance of the people detector. Although good progresshas been made on people detection, the results are not aspromising when there are occluded people, as illustrated inFig. 1. Obviously, a false positive or false negative reportedby the people detector will unavoidably generate a corre-sponding false positive or false negative in the subsequentperson re-identification results.Second, the setting of person re-identification is different

from the setting of person search. In current person re-identification research, for each window to be identified, agallery set is commonly given so that the identity of thewindow can be classified or ranked according to the gallery.While in the person search task, for each candidate window,there is only one query image to be verified against. There-fore, many current person re-identification methods [5, 13]do not work in this setting.These observations inspire us to treat the person search

problem in a unified way rather than solving it in a sequen-tial fashion. We hereby propose a framework to jointly op-timize the detection of people and the identification of the

queried person. The philosophy behind our proposal is thatpeople detection algorithms use the most common features(commonness) shared by all human bodies to distinguishthem from other objects, while the person re-identificationalgorithms need the most distinctive features (uniqueness)that are unique to a certain person to discriminate him/heragainst other people. Recent studies have shown that evensmall salient regions in appearance play an important rolein identification [12]. However, in the sequential framework,the salient regions may easily be ignored by the people detec-tor, especially if they belong to an occluded person, whichresults in false negatives in detection and consequently inthe final results of search. On the other hand, if we onlyfocus on the uniqueness of the queried person, we may beable to pick up the salient regions, but will probably pick upmany false positives from the background as well.

In particular, our algorithm finds the queried person bytesting a binary classifier in sliding windows throughout theimage spatially and across scale levels, followed by non-maximum suppression to remove multiple responses of theclassifier on the same individual at slightly shifted spatiallocations and neighboring scales. This binary classifier istrained to verify whether the queried person is present inthe current window or not and consists of two components:1) A generative model representing the commonness of peo-ple using the average probabilistic distribution of the train-ing data, 2) A Fisher vector feature [10] representing theuniqueness of a person, using the learned generative modelas the kernel. The final verification is based on fusion of thecommonness of the human body (to reduce false positives)and the uniqueness of the queried person (to reduce falsenegatives).

The main contributions are: 1) We argue that the sim-ple combination of current solutions to the related problems(e.g., people detection and person re-identification) does notstraightforwardly solve the new person search problem. 2)we propose a unified framework for person search by jointlymodeling people commonness and person uniqueness, andshow its superiority with experiments.

2. PROPOSED FRAMEWORKIn this section, we introduce a unified framework for per-

son search. Given a query image Iq and a scene Is, we wantto find the possible location L the queried person stayed at.Since we know we are looking for a person, which belongs tothe category of people, we can also make use of an additionalset of images of people, T . The problem can be solved bymaximizing a posterior

P (L|Iq , Is, T ) ∝ P (L|Is, T )P (L|Iq , Is)

∝ e−Dcom(L,Is,T ) · e−Duni(L,Iq,Is),(1)

because T is independent to lq and assuming a uniform pri-or for L. Dcom(·) and Duni(·) are two distance functionsmeasuring respective commonness (Sec. 2.2) and uniqueness(Sec. 2.3), based on appearance computed locally (Sec. 2.1).

2.1 Local DescriptorTo take advantage of the common spatial configuration of

human bodies (e.g., mostly standing upright, often appear-ing symmetric) without using sophisticated part matchingalgorithms, we describe the entire window with six small-er rectangles roughly corresponding to the six human bodyparts (i.e., head, torso, left and right arms, left and rightlegs), as shown in Fig. 1(a). The template is empirically

derived from the average image of the training set. Featureextraction and model training are performed in each partseparately. For clarity, we limit the following discussion inone part. The complete feature or model is a concatenationof the part features or models.In order to capture the structure, color, gradient and ori-

entation information, we use a very concise 7-dimensionallow-level descriptor for each pixel [8]:

f(x, y) = [ x, y, I(x, y),∂I

∂x,∂I

∂y,∂2I

∂x2,∂2I

∂y2], (2)

where x and y are the pixel relative coordinates within thepart, I(x, y) is the pixel intensity at position (x, y), ∂I

∂xand

∂I∂y

are the first-order derivatives of image I with respect

to the horizontal and vertical directions, while ∂2I∂x2 and ∂2I

∂y2

are the second-order derivatives. The descriptors are furtherwhitened in each part. For color images, we convert thecolor space to HSV and extract such descriptors in eachcolor channel separately.

2.2 GMM Encoded CommonnessWe use a generative model, in particular, a Gaussian Mix-

ture Model (GMM) to capture the commonness of shape andappearance of human bodies. Given the training images, welearn a GMM for each part using the extracted local descrip-tors. The learned model is denoted by Θ = {(µk, σk, πk) :k = 1, . . . ,K}, where µk, σk and πk are the mean, covari-ance and prior probability of the k-th Gaussian component,respectively. In our implementation, K is empirically set to16 for each body part and σk is diagonal [8].Since we train the GMM with various images of differ-

ent people, we consider the trained GMM a representationof people commonness. In essence, these learned Gaussiancomponents can be regarded as the building blocks of theshape and appearance of human bodies while the prior dis-tribution Π = [π1, . . . , πK ]⊤ describes how the common hu-man body is constructed by these building blocks. We cancompute the posterior probability wik of a local descriptorfi being generated by the kth Gaussian component

wik = p(k|fi;µk, σk) =πk N (fi;µk, σk)∑Kj=1 πk N (fi;µk, σk)

. (3)

The distance function Dcom(·) is then defined as

Dcom(L, Is, T ) =∥∑

i∈idx(Is(L))

( [wi1, . . . , wiK ]⊤ −Π) ∥2, (4)

where idx(Is(L)) is the set of indices within the windowIs(L). Intuitively, if the posterior of different Gaussian com-ponents deviates far from the prior mixture weights, thecontents of the window are less likely to resemble a humanbody. We use this measure to eliminate false positives thatdo not contain or well localize a human body.

2.3 Fisher Vector Encoded UniquenessThe Fisher vector [10] is an image representation which is

usually used in visual classification and has seen success inperson re-identification. Hereby, we incorporate it with thelearned GMM model to describe the uniqueness of a specificperson’s appearance. For a given window W , we computethe mean and covariance deviation vectors uk and vk for eachof the K components in the Gaussian mixture models, andthe Fisher vector is the concatenation of the deviation vec-tors uk and vk, i.e., Φ(W ) = [u1, v1, . . . , uK , vK ]⊤, where

uk =1

|W |√πk

∑i∈idx(W )

wikfi − µk

σk,

vk =1

|W |√2πk

∑i∈idx(W )

wik

[(fi − µk

σk

)2

− 1

].

(5)

The distance function Duni(·) is then defined as

Duni(L, Iq , Is) = ∥ Φ(Is(L))− Φ(Iq) ∥2 . (6)

The final decision of whether a candidate window containsthe queried person is made by jointly measuring the com-monness and uniqueness (Eqn. 1) with a proper thresholdlearned from the training data.

2.4 Searching StrategyThe entire framework follows the sliding window strategy.

For each candidate window, a binary decision of whetherit includes the queried person is made. We further per-form multi-scale scanning against scale variations and non-maximum suppression to remove redundant responses. Onedifferent strategy employed in our framework is that the re-dundant responses are pruned using additional identity in-formation, that is, if two overlapped windows are identifiedas two different subjects, they will not be merged; while inthe original protocol for detection, responses of all peopleare put together for pruning and hence occluded people canbe incorrectly merged with the person in the front. Thisway we can effectively reduce false negatives.

3. EXPERIMENTSIn this section, we show our experimental results and com-

parisons to other approaches on public benchmarks.

3.1 Datasets and SettingsWe validate our method on two public datasets: CAMPUS-

Human [11] and EPFL [4], as shown in Tab. 1. The groundtruth ID and bounding box of each target person are pro-vided in these two datasets. Other public Re-ID datasetsare unsuitable for our task without scene images.

Table 1: Dataset informationDataset ID Query Scene Target Scene Res.

CAMPUS 74 370 214 1519 640×360EPFL 30 70 80 294 360×288

To better evaluate our method, we separate the CAMPUS-Human dataset into two different scenarios, one crowded sce-nario S1 with more than 10 people in one shot and one sparsescenario S2 with less than 10 people. For both datasets, weuse all query images belonging to the same person for search,and scanning every scene shot to locate the queried person.All the parameters are fixed in the experiments, including s-canning scale factor 1.1, window size 60×180, window stride8 (on both horizontal and vertical directions).

3.2 Experimental ResultsTo objectively evaluate our approach, we setup two base-

lines for comparison: 1) sequentially applying HOG-basedpeople detection [1] and person re-identification using Fish-er vectors [8] (Detect+ID Seq), 2) combining these two tech-niques more tightly by fusing the detection score and theidentification difference (Detect+ID Jnt), the fusion weightis found by cross validation and set as 0.25(4−DetectScore)+IDDifference. Besides, we analyze the component effective-ness of the proposed method with two settings: 1) using onlyuniqueness as in the general image search task (Our Uniq),2) using both commonness and uniqueness (Our Full).

0.04 0.06 0.08 0.1 0.12 0.14 0.16 0.180.05

0.1

0.15

0.2

0.25

0.3

CAMPUS−Human All

Precision

Recall

Our Full

Our 1

Detect+ID Jnt

Detect+ID Seq

0 0.05 0.1 0.15 0.2 0.250.1

0.15

0.2

0.25

0.3

0.35

CAMPUS−Human S1

Precision

Recall

Our Full

Our 1

Detect+ID Jnt

Detect+ID Seq

0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.450.1

0.15

0.2

0.25

0.3

0.35

0.4

0.45

CAMPUS−Human S2

Precision

Recall

Our Full

Our 1

Detect+ID Jnt

Detect+ID Seq

0.1 0.2 0.3 0.4 0.5

0.2

0.25

0.3

0.35

EPFL

Precision

Recall

Our Full

Our 1

Detect+ID Jnt

Detect+ID Seq

Figure 2: Performance comparisons using PR curves on CAMPUS-Human All, CAMPUS-Human S1, CAMPUS-

Human S2 and EPFL datasets.

Figure 3: Person search results generated by our method. Given query persons, green and red rectangles in scene

images are respective correct and false locations, compared with the groundtruth with blue ones.

Table 2: Quantitative results on two public datasets.

MethodCAMPUS-Human

EPFLAll S1 S2

Our Full 16.28% 19.96% 27.55% 30.15%Our Uniq 13.59% 17.65% 24.47% 27.87%Detect+ID Jnt 12.15% 16.54% 24.12% 29.72%Detect+ID Seq 11.51% 15.82% 23.96% 29.54%

We adopt the PASCAL Challenge criterion to evaluatethe localization results: a match is counted as the correctmatch only if the intersection-over-union ratio (IoU) withthe groundtruth bounding box is greater than 50%. Weutilize the F-score as the benchmark metric, which measuresthe accuracy of person search by considering both recall andprecision. The F-scores and PR curves of all experimentson both datasets are quantitatively reported in Table. 2 andFig. 2, and qualitatively shown in Fig. 3, respectively.From the results, we conclude that our unified framework

outperforms the other searching strategies in general. Detec-t+ID Jnt works better than Detect+ID Seq, supporting ouradvocation for joint modeling of detection and identification.However, because the detection score and identification scoreare two naturally different metrics, the direct combination ofsuch scores does not work as well as our fundamentally cou-pled framework (Our Full). The improvements of Our Fullover the baselines are smaller on the EPFL dataset becausethis dataset is relatively easy for the detectors.

4. CONCLUSIONSIn this paper, we propose a unified framework for person

search, a practical multimedia search problem, by jointlymodeling the commonness of people and the uniqueness ofthe queried person. The efficiency is still a concern dueto the high computational cost in each sliding window. Weare exploring the state-of-the-art region proposal methods toreplace the sliding window protocol, as well as more efficientapproaches to model commonness and uniqueness.

5. REFERENCES[1] N. Dalal and B. Triggs. Histograms of oriented gradients

for human detection. CVPR, 2005.[2] P. Dollar, C. Wojek, B. Schiele, and P. Perona. Pedestrian

detection: An evaluation of the state of the art. TPAMI,2012.

[3] M. Farenzena, L. Bazzani, A. Perina, V. Murino, andM. Cristani. Person re-identification by symmetry-drivenaccumulation of local features. CVPR, 2010.

[4] P. Felzenszwalb, R. Girshick, D. McAllester, andD. Ramanan. Object detection with discriminativelytrained part-based models. TPAMI, 2010.

[5] F. Fleuret, E. Turetken, and P. Fua. Multiple objecttracking using k-shortest paths optimization. TPAMI, 2011.

[6] D. Gray and H. Tao. Viewpoint invariant pedestrianrecognition with an ensemble of localized features. ECCV,2008.

[7] P. Jain, B. Kulis, and K. Grauman. Fast image search forlearned metrics. CVPR, 2008.

[8] W. Li and X. Wang. Locally aligned feature transformsacross views. CVPR, 2013.

[9] B. Ma, Y. Su, and F. Jurie. Local descriptors encoded byfisher vectors for person re-identification. ECCVWorkshops, 2012.

[10] Y. Rui, T. S. Huang, and S.-F. Chang. Image retrieval:Current techniques, promising directions, and open issues.J. Vis. Commun. Image Represent., 10(1):39–62, 1999.

[11] J. Sanchez, F. Perronnin, T. Mensink, and J. Verbeek.Image classification with the fisher vector: Theory andpractice. IJCV, 2013.

[12] Y. Xu, L. Lin, W. Zheng, and X. Liu. Humanre-identification by matching compositional template withcluster sampling. ICCV, 2013.

[13] R. Zhao, W. Ouyang, and X. Wang. Unsupervised saliencelearning for person re-identification. CVPR, 2013.

[14] W. Zheng, S. Gong, and T. Xiang. Re-identification byrelative distance comparison. TPAMI, 2013.