[IEEE 2009 Twelfth IEEE International Workshop on Performance Evaluation of Tracking and...

8
Robust Background Model for Pixel Based People Counting using a Single Uncalibrated Camera Saad Choudri, James M. Ferryman, Atta Badii School of Systems Engineering University of Reading [email protected],{j.m.ferryman, atta.badii}@reading.ac.uk Abstract Several pixel-based people counting methods have been developed over the years. Among these the product of scale-weighted pixel sums and a linear correlation coefficient is a popular people counting approach. However most approaches have paid little attention to resolving the true background and instead take all foreground pixels into account. With large crowds moving at varying speeds and with the presence of other moving objects such as vehicles this approach is prone to problems. In this paper we present a method which concentrates on determining the true-foreground, i.e. human-image pixels only. To do this we have proposed, implemented and comparatively evaluated a human detection layer to make people counting more robust in the presence of noise and lack of empty background sequences. We show the effect of combining human detection with a pixel-map based algorithm to i) count only human-classified pixels and ii) prevent foreground pixels belonging to humans from being absorbed into the background model. We evaluate the performance of this approach on the PETS 2009 dataset using various configurations of the proposed methods. Our evaluation demonstrates that the basic benchmark method we implemented can achieve an accuracy of up to 87% on sequence “S1.L1 13-57 View 001” and our proposed approach can achieve up to 82% on sequence “S1.L3 14-33 View 001” where the crowd stops and the benchmark accuracy falls to 64%. Keywords---people counting, crowd analysis, human detection, background modeling. 1. Introduction Counting objects is important for safety, security and analysis applications. Using infra-red sensors, thermal imaging technology and computer vision are alternatives to manually counting objects. Infra-red sensors, e.g. line of sight sensors, suffer from installation flexibility and coverage while thermal imaging is expensive; thus computer vision offers the most practical way for object counting. Over the last two decades significant progress has been made in this application domain, e.g. for counting vehicles and people in public areas, roads, malls etc. Section 2 includes a review of the state-of-the-art and the three main methodologies used for the automated counting of people. Section 3 describes our motivation for the developed approach. In Section 4 we explain our method and in Section 5 we evaluate it using 8 different combinations of the proposed techniques and establish the optimal solution. 2. Previous Work People counting techniques are based on; i) object detection (recognition), ii) tracking and, or, iii) mapping (textures, pixels etc). i) Detection-based methods are useful for rapid deployment to both count and localise objects. Such techniques can be used with moving cameras, such as in a PTZ-controlled camera network. When provided with training examples containing partial samples of an object they perform better than pixel counting or trajectory clustering under occlusion. A well- established method for object detection was proposed by Viola and Jones in [7]. In [8] this method was used to detect faces for head pose estimation of people in crowds to identify areas of interest by counting identified poses. Dalal and Triggs [9] show how Histogram of Oriented Gradients (HOG) can be used to detect humans. After image segmentation using Mosaic Image Difference (MID) Li et. al [10] use the HOG descriptor to model the omega shape to detect heads. In contrast, Mohan et al [20] proposed a component based people detection system where parts of the human body were classified separately and then the results from SVMs were combined. Detection typically requires acquiring thousands of positive and negative samples to train a classifier. In the case of Li et al. [10], 1755 positive samples of 32x32 pixels were collected and 11000 negative randomly sampled patches were obtained from 399 head-shoulder free images. This is generally a cumbersome process and becomes view-specific depending on the samples used for training. However, theoretically, detection provides a linear correlation between heads and the actual number of people. ii) There are numerous tracking approaches documented and any one of these can be used to estimate the number of independently moving people in a scene. These methods include grouping trajectories of features tracked over time or obtaining the number of tracked blobs. With the latter method it is necessary to have change detection or 978-1-4244-5504-1/09/$25.00 ©2009 IEEE

Transcript of [IEEE 2009 Twelfth IEEE International Workshop on Performance Evaluation of Tracking and...

Robust Background Model for Pixel Based People Counting using a Single Uncalibrated Camera

Saad Choudri, James M. Ferryman, Atta Badii School of Systems Engineering

University of Reading [email protected],{j.m.ferryman, atta.badii}@reading.ac.uk

Abstract Several pixel-based people counting methods have

been developed over the years. Among these the product of scale-weighted pixel sums and a linear correlation coefficient is a popular people counting approach. However most approaches have paid little attention to resolving the true background and instead take all foreground pixels into account. With large crowds moving at varying speeds and with the presence of other moving objects such as vehicles this approach is prone to problems. In this paper we present a method which concentrates on determining the true-foreground, i.e. human-image pixels only. To do this we have proposed, implemented and comparatively evaluated a human detection layer to make people counting more robust in the presence of noise and lack of empty background sequences. We show the effect of combining human detection with a pixel-map based algorithm to i) count only human-classified pixels and ii) prevent foreground pixels belonging to humans from being absorbed into the background model. We evaluate the performance of this approach on the PETS 2009 dataset using various configurations of the proposed methods. Our evaluation demonstrates that the basic benchmark method we implemented can achieve an accuracy of up to 87% on sequence “S1.L1 13-57 View 001” and our proposed approach can achieve up to 82% on sequence “S1.L3 14-33 View 001” where the crowd stops and the benchmark accuracy falls to 64%.

Keywords---people counting, crowd analysis, human detection, background modeling.

1. Introduction

Counting objects is important for safety, security and analysis applications. Using infra-red sensors, thermal imaging technology and computer vision are alternatives to manually counting objects. Infra-red sensors, e.g. line of sight sensors, suffer from installation flexibility and coverage while thermal imaging is expensive; thus computer vision offers the most practical way for object counting. Over the last two decades significant progress has been made in this application domain, e.g. for counting vehicles and people in public areas, roads, malls etc. Section 2 includes a review of the

state-of-the-art and the three main methodologies used for the automated counting of people. Section 3 describes our motivation for the developed approach. In Section 4 we explain our method and in Section 5 we evaluate it using 8 different combinations of the proposed techniques and establish the optimal solution.

2. Previous Work

People counting techniques are based on; i) object detection (recognition), ii) tracking and, or, iii) mapping (textures, pixels etc). i) Detection-based methods are useful for rapid deployment to both count and localise objects. Such techniques can be used with moving cameras, such as in a PTZ-controlled camera network. When provided with training examples containing partial samples of an object they perform better than pixel counting or trajectory clustering under occlusion. A well-established method for object detection was proposed by Viola and Jones in [7]. In [8] this method was used to detect faces for head pose estimation of people in crowds to identify areas of interest by counting identified poses. Dalal and Triggs [9] show how Histogram of Oriented Gradients (HOG) can be used to detect humans. After image segmentation using Mosaic Image Difference (MID) Li et. al [10] use the HOG descriptor to model the omega shape to detect heads. In contrast, Mohan et al [20] proposed a component based people detection system where parts of the human body were classified separately and then the results from SVMs were combined. Detection typically requires acquiring thousands of positive and negative samples to train a classifier. In the case of Li et al. [10], 1755 positive samples of 32x32 pixels were collected and 11000 negative randomly sampled patches were obtained from 399 head-shoulder free images. This is generally a cumbersome process and becomes view-specific depending on the samples used for training. However, theoretically, detection provides a linear correlation between heads and the actual number of people. ii) There are numerous tracking approaches documented and any one of these can be used to estimate the number of independently moving people in a scene. These methods include grouping trajectories of features tracked over time or obtaining the number of tracked blobs. With the latter method it is necessary to have change detection or

978-1-4244-5504-1/09/$25.00 ©2009 IEEE

background segmentation in place. To apeople, blob splitting and merging especially for analysis of images of crow[11] individuals are tracked using gtrajectories. In [12] floor fields are usedtracking result in a densely crowded arthese are useful methods but for countican be inaccurate. In a situation where or moves with little directional varcongestion, clustering trajectories willamount of the time spent developingalgorithm would be devoted to identifyfor tracking, including blob splitting trajectory grouping parameters. iii)methods can be found in ADVISPRISMATICA [4] which are amonganalysis contributions. They usually relypixels, and sometimes both. To detecongestion, as described in [5], a neuratrained with pixels and features from exdata. Kong et al. [13] provide a survey omapping methods and use a depth maHOG descriptors filtered by a foregrounneural network is trained as it generaliseslinearities in the pattern space and linearilost in dense crowds. However this lineaassumed in [6] and works well for casignificantly higher than the crowd preoverhead view. In [14] the algorithm intmap with the segmented foreground simtime-adaptive histograms are used to meadifferent times. This helps to define threshold per timeslot throughout the daya monocular un-calibrated view, depth crucial step. In [5, 6, 13, 14, 15] a deptapplied to normalise the processing of objdistances from the camera. There arewhereby a depth map can be obtained; eground plane assumption in [19] and rays from image coordinates to device ccalibrated view. Although this method applied here, if camera parameters are noconventional monocular methods can be al. in [14] suggested a depth calculation improve on approaches presented in [5, 6

Figure 1 Left: ‘S1.L2 Time 14-33 View 0

from [1]. Centre: poor foreground segm

crowd is absorbed into background mo

applying our human presence mask pre

absorption of important human-pixels.

accurately detect is important

wded scenes. In grouped feature d to improve the rea. As trackers ing people they the crowd stops riance, due to l fail. A large g this type of

ying robust cues techniques and Mapping-based

SOR [3] and g early crowd y on features or ect 4 levels of al network was

xemplar training of feature-based ap to normalise nd map. Then a s well over non-ity is considered ar relationship is amera positions esenting a near tegrates a depth

milar to [15] and asure densities at

a crowdedness y. In the case of modelling is a

th map has been jects at different e several ways e.g. by using the

back-projecting coordinates in a can be usefully t available more deployed. Ma et method; said to , 15].

01’ Frame 309

mentation as

odel. Right:

events

3. Motivation

Figure 2 Process flow of ou

We intend to harnessdetection and mapping bassuch as [5, 6, 13, 14, 15] lacfilter out spurious objects Although [5] and [13] use detect crowd levels they havseparate humans from othcomplex scenes, e.g. imagesAdditionally adaptive bacmixture of Gaussians [16],channel called the Colour Mmethod in [17] have tended thumans being absorbed into1). This was also noted in layered approach will evenabsorption is defined by a twe set out to evaluate the cpixel counting method and using human region detectioon image resolution; thus it imethod is motivated by the we use scale-weighted (ucounting to approximate thimage. Unlike other methopixels before counting them are not of interest into the bmotivated mainly by the need

• Increase the robustness of methods when background difficult to resolve. •Count human-classified foreground pixels. • Reduce the loss of people the background after being by introducing a selective ba

4. Method

Our approach comprises of 5Figure 2. For the sake of clardoes not follow the process 4.1 we begin by describinwhich together with “Image and the “Depth Map” (Sectithe “Human Presence Map” we explain the two possibdepending on scene complex

Depth Map ImagSegment

Selective Uwith Human

ur approach.

s the advantages of both ed methods. Most methods

ck an object detection step to from the counting process. classifiers with features to

ve not explicitly attempted to her objects particularly in s with vehicles in the scene. ckground models such as , or a single Gaussian per Mean and Variance (CMV) to lead to pixels belonging to the background (see Figure [10]. Alternatively a multi-

ntually perform similarly as time threshold. In this study conventional scale-weighted to make it more robust by

on. Texture is heavily reliant is not used in this work. Our work of [6] and [14] where

using a depth map) pixel he number of people in an ods we classify foreground

and only update pixels that background. Our approach is d to:

pixel-based people counting pixels become increasingly

pixels rather than just

when they get absorbed into stationary for a long period ckground update method.

5 components as depicted in rity the following description flow in Figure 2. In Section

ng a human detection layer Segmentation” (Section 4.2) ion 4.4) is a key element of (Section 4.3). In Section 4.5 bilities of counting people

xity.

ge tation

Human Presence Map

Update n Mask

People Count

4.1. Human Region Detectio

We demonstrate how a basic humancan be implemented with few samclassification. Heads were chosen as featdetection. This is because in crowded scan be seen more often from near-overhare not interested in the result of a headthat the regions that are detected contain humans. The detector is used only as abuilding block of the system rather thasystem itself. This becomes clearer in Sec

Figure 3 Exemplar negative samples (to

positive samples (bottom row)

4.1.1. Training

As shown in Figure 3 negative and pwere extracted from 9 images belong(other than from Time 13-57 View 001). Ushoulder patches were selected includingpartial faces. These patches ranged from 30x30 pixels depending on the locatioshoulder objects within each image. By scthe image, 494 patches were extracted fbackground image taken from View 00images folder as negative samples. Wknown that more negative samples fromand datasets would make the system mora real life scene has too many varying obso we treat this step much like learning model of a particular scene for changcascade of boosted classifiers working features is trained based on the methodJones [7]. 23 cascades were trained for thobject detection suite from OpenCV [18training and detection where each patch 25x25 pixels.

4.1.2. Filter Levels

Two filters were applied to the deThese filtering steps were: i) compare head detection bounding box region baseRegions of Interest (RoI) depth from Section 4.4) selected at initialisation, anremaining heads based on the criterdetection bounding boxes include foregroRoI depth is calculated by selecting a reghuman and calculating the total weight map for this region. The reason a head isthat we are mainly interested in decreasinhuman region detections rather than

on

n region detector mples for pixel

tures for human situations heads head views. We d detector but in

at least parts of a filter and is a an the proposed ction 4.3.

op row) and

positive samples ging to 'S0.RF’ Up to 296 head-g shoulders and 15x15 pixels to

on of the head-canning through from one empty 01's background

While it is well m different views

re generalisable, bjects/props and the background

ge detection. A with Haar-like

d by Viola and his purpose. The 8] was used for

was re-sized to

etection results. total weight in

ed on predefined depth-map (see nd, ii) filter out rion that head ound pixels. The ion as large as a from the depth

s not selected is ng false positive increasing true

positive head detections. Indetection results are shownwith two filters as describedanywhere and are not boundthis task, i.e. instead of usinreal-world depth can be used

Figure 4 Detection Results

10 frames of ‘S1.L1 Time 13

from frame 12. As explaine

interested only in ‘Human R

and not the others which is

Figure 5 (Blue) Raw Detecto

filtered result (Level 1) (Red

‘S1.L1 Time 13-57 View 001

4.1.3. Testing

In Figure 4 filter level and 2 and 3 are the filter interested in the human regresults are 90% correct. It caresults are improved by appthere is a reduction in the nuresult of these filters whilst remains the same. More true no attempt is made to impro

Detection

0

1020

30

40

5060

70

8090

100

1 2Filter

%

Head DetectioHead DetectoHuman RegionHead DetectioHead Detectio

n Figure 4 the 3-filter level n, i.e. raw detection results d. These filters can be used d to the training data nor to ng a pixel block size actual

d to change block sizes.

after evaluation of every

3-57 View 001’ starting

d in Section 4.1 we are

Region Detector’ precision

s 90% accurate.

or results (Green) weight

d), 2nd Level filtered.

’, Frame 0191.

1 is the raw detection result step results. Again we are

gion detector for which the an be seen that the detection plying these filters; further, umber of false positives as a the number of true positives positives are not detected as

ove the head-feature detector

Results

2 3Level

0

50

100

150

200

250

300

350

Hea

ds

on (Precision)r (Hit Rate)n Detector(Precision)on (False Detections)on (True Detections)

in these steps for this purpose. We havethe intended purpose of detecting humresults move from below 80% to abapplying the 2nd filter.

Figure 6 Frame 147 ‘SO_RF-Time 1537-

Foreground pixels (F) are shown as wh

Background (B) is black. Shadows and

shown as shades of grey.

4.2. Image Segmentation

We choose the CMV method [17] athe Hue Saturation Value (HSV) colour well suited to shadow and highlight handifference of a few frames per second isdata is not noisy such as in this case.equations are used to build each Gausbackground building and updating proces

(1) ����� � � ������ �� � ����� (2) ����� � � ������ �� � ������� ��

���������� � � �where k is the pixel location, i.e. k=(row, channel) and channel= (0| 1|2) for H, S anrespectively, t is the current frame, p is thweight, l is the new pixel value, � = meanstandard deviation. Hence at each HSV piare 3 Gaussians.

During background building the wwhere N is the number of frames procepoint. After this the weight p for backgset to 0.1 and 0.2 for the mean and stanrespectively. If a pixel index is considerethe update for this is set to 0.01 and 0.0and standard deviation respectively. Upda5 frames. For H and S channels the decisiEquation (3) where �� and �� are the background verdicts for channel k respect

(3) ���������� ���� ���� ���� ��� ! �������"����#���� �$ ��%&���'()*)��� � +�,�-.+./01234.0560�5607� �

F and B represent the final foregroundverdict for the 3D image pixel L. respectforeground/background map. An in

e shown that for man regions the bove 90% after

-View 001’.

hite and

d highlights are

and applied it to space. HSV is

ndling where the s not crucial and The following sian during the

ss:

������

column, nd V he update n and 8 is the ixel (L) there

weight p is 1/N, essed up to that round update is

ndard deviationsed as foreground 05 for the mean ates occur every ion making is in foreground and tively.

� �$ ���&�,9:���� � ;��� ;6

or background tively. FG is the ntersection (&

operation) of the verdicts oftaken before checking if highlight. For example if tpixel as being a backgrounthen no further checks are mas a probable foreground pix

(4) ����< ������ �$ �����=��=>+&�

Figure 7 With the update co

frame) the background mod

the entire sequence. Witho

update purposes (top-right

the mean and standard dev

than in the case of the botto

selective update is used.

4.2.1 Shadows and

Shadows and highlightschannel (see Figure 6). Ifconsidered as a foreground the next steps are pe(5) and (6). If this value is; iits value deviates from the m

and is greater than 0 (7� 7?For classifying pixels aconfiguration is used but thevalue must be greater than th

(5) ���#����� $ @% ! ��A#������B����B�C7D!������B�$/01234�

(6) ��"�����#����� $ @% ! ��7?C����B����B�C7E!��0560�5607�

(7) ��"���� F#����� $ @% G�#������� �$ ���

For equations (5) to (6) k is 2

(8) HI��< ��JK�L �$ IK�� !K=MK=>

f the above operation is then the pixel is a shadow or the Hue channel ranks this nd rather than a foreground

made. Otherwise it is defined el P :! ������ $ @�&���"������ $

ounter set to 1 (every

del frames were stored for

ut a human mask for

) people get absorbed into

viation of the model more

om-right image where the

d Highlights

s are handled within the V f the value of a pixel is by all channels above then

erformed. See Equations i) less than the mean, and ii)

mean by less than 50 (7D� 7E)

), it is considered a shadow. as highlights, the same e distinction is that the pixel he mean.

��� C � ������% �G #7� C$/01234&!��$�

�A#����� � ������% �G���B�$0560�5607&!��$�

#�����N�0% �G �#�����N�/%O !

2, i.e. it is set to channel V.

! �P� $ -�& )JQ)����P� $ R�&

No further filters, e.g. a median filterkernel size’s vary depending on the size interest from one scene to another. Alscomponent analysis is done at this segmentation is used as it is to coforeground pixels. No tweaking of any oand thresholds is done, this was implementation and testing of this methodas an effective approach for people counthelp the scalability of this approach.

4.3. Human Presence Map

In Equation (9) x and y are the tobox corner points of the human region dew and h are the width and height respectiv

(9) S S �� F�T�U � $ ;O �!VWXU=VYWZT=Y F�T�U � $

(10) S S ��FS S ��[UWT\]\=U T\]

TWT\][=T T\]^�_`aU=T\]�

^�_`b�T=T\]0cd1e�% ! ��T�U $ 0cd1e����������f�B ��9g � ��� "Bh�U � ;6 ! ����9g � ���,9:��9g� � #ijklm% �

M ���The detector was applied to frames primanoise and to make the system more robwhere there is a lack of empty backgBased on the human region detection min our approach a connected component between the foreground pixels and the 2nhead detection bounding boxes; S(9) and (10). That is, those head detect

Figure 8 Evaluation results

after processing “S1.L1 Time

13-57 View 001” and “S1.L2

Time 14-33 View 001”. The

mean precision over RO R1

and R2 for each method is

shown along with the

precision deviation over

these regions to test for both

accuracy and consistency.

As shown a human-RoI mask

proved beneficial where the

crowd stops (“S1.L2 Time

14-33 View 001”). Also see

Figure 1.

r, are applied as of the object of o no connected stage and the

ount the white other parameters

to keep the d at its simplest, ting. This would

op left bounding tection box, and vely.

$ 0cd1eO

[�\� $��"Bh�T �� ;6 !:B�

arily to filter out ust to scenarios ground images.

method described analysis is done nd level filtered See Equations tions which are

smaller than the defined Rwhite pixel region are used aforeground mask.

Figure 9 a) Selection of two

sized along the ground plan

result c) resulting depth ma

pixels. This dataset [2] was

changing depth.

With a window size of 25x Equation (10)), connected

RoI depth and encompass a as acceptance regions for the

o lines (green) equally

ne b) image segmentation

ap d) scale-weighted

s used for illustrating

25 pixels (i.e. SIZE=625 in d component analysis is

performed to connect the encoded surrounding white pixels. Those componconnected to any human region are discin the human presence map. This is intenthe location map of humans, ii) prevenbeing absorbed into the background mothe number of learning frames required, aany spurious foreground detections. We presence map to; a) prevent people frominto the background, and b) prevent nofrom being counted as human-pixels. Thwe have 2 maps, a foreground map (8) and a human presence map fr(9) and (10). As shown in Figure 10 thenoise in the human presence map.

Figure 10 Top: Frame 0043 from ‘S1.

View 001’, Middle: image segmen

Bottom: Human Presence RoI.

4.4. Depth Map

This geometric correction (GC) estimation is used to model the relative pixels in an image, thus providing the mentioned earlier. The method used hafrom [14]. The method in Equation (1vertical coordinate of the vanishing point(13) it is used to calculate Depthr. This then be duplicated across the image colu2D. A constant should be used to accostanding on the ground. We use the inobject but this is only useful if density is bdirectly from the weighted foreground millustrates the process during the init

components toents that are not

carded, resulting nded to, i) obtain nt humans from odel, iii) reduce and iv) filter out

use the human m being absorbed on-human-pixels hus at this stage

from Equation from Equations ere is much less

.L1 Time 13-57

ntation result,

or depth map depths between scale-weighting

as been adapted 1) gives us the t Yv. In Equation

depth map can umns to make it ount for people

nclination of the being calculated

map. Figure 9 ‘a’ tialisation stage

taking PETS 2006 [2] data wboth of equal length on thehas been used to illustrate tdepth in this scene than in Vi

(11) �no � #pqrsatuv�w��x�yrsatuyrsatu

(12) z{|U}]��~ � ������ �M� (13) �2��B��� � yrsatuv�w ���

� ��� ���������������������������������4������ � �

(14) 59g{|U}]��� �pqrsatuv�w�p�rsatuv�w�

����������������������������4�����3���

4.5. Count

As shown in Equation (image sequences, a correlatby taking the number of peinterest and dividing this bregion. In this study currentin region RO. N is the totPeoplei is the actual numberWT is the total weight of tpeople are present. In thisobtained and a mean of correlation coefficient Corr. interested in seeing the reCorr.

(15)zh�� �S

����� ��t���tb��� �

b���t���t��w��� �

��wb��

~

(16)S S ���"Bh�����9g�

"Bh������9g F����� $ ;O

(17) S S ���"Bh�����9g�

"Bh������9g F����

47� � 2��B���& (18) zh�9B � �4�� x �zh�

5. Evaluation

Ground truth was obtainof people in every 10th frame001’ and ‘S1.L3 Time sequences were chosen as th1 and 3 respectively. Also i001’ the crowd stops and evaluation of the selective babeen set as one of the main othis work. The ground truth first 12 frames are used to essingle human depth, and 10count for the vector in EquTable 1 and Figure 8. Wherempty images were selectedBackground -Time 13-38 background.

where two lines are selected e ground plane. This dataset these steps as there is more iew 001 of PETS 2009 [1]. uv��% �yrsatuv�w�pqrsatuv����uv�� �yrsatuv�w�� �� ����� �M���x � 59g{|U}]���� � 59��B�5�,f��0��f�B���gB� �� 3���gBM

(15), for counting people in tion coefficient is calculated eople present in a region of by the total weight in the tly all training has been done tal number of frames used. r of people in the frame and the region of interest where s way a vector of values is

this vector gives us the We use the mean as we are

sult with outliers affecting

��w�����

O ! �47� � 47� � 2��B���&3 

� $ 0cd1eO ! �47� �

��

ned by counting the number e of ‘S1.L1 Time 13-57 View 14-33 View 001’. These

hey represent difficulty levels in ‘S1.L3 Time 14-33 View

stands still permitting the ackground update which had objectives to be addressed in starts from frame 12 as the

stimate the depth, the relative 0 readings of actual humanuation (15). Please refer to rever BG is set to “YES” 25 d from time Sequence ‘SO -- View 001’ to build the

5.1. Combinations

Table 1 Method combinations. Method 1 is the

developed benchmark approach.

Index CMV CMVHUMAN

HUMAN MASK

BG

1 YES NO NO YES2 YES NO YES NO3 YES NO YES YES4 YES NO NO NO5 NO YES NO YES6 NO YES YES NO7 NO YES YES YES8 NO YES NO NO

Various configurations of the proposed approach and the benchmark approach are presented in Table 1. With reference to this table, we submitted methods 1, 3 and 7 to the PETS2009 organisers for a blind evaluation. Method 3 is designed for a situation where only humans are assumed to traverse a scene and 7 caters for a situation where other objects are present in the scene. These methods are designed to handle scenarios where the crowd stops entirely for a prolonged period of time. Other combinations are evaluated for a comparative and contrastive analysis of the resulting performance of the system in varying scenarios; primarily to test its flexibility. The resulting methods are named using these combinations. If the method name starts with just “CMV_” and not “CMV_HUMAN_” then this method is calculating the product of the total number of weighted foreground pixels and the correlation coefficient, i.e. pixels are not being classified before being counted. See Equation (16). In what follows underscores, i.e. “_”, have been removed from configuration names for readibility. This developed method represents the conventional scale-weighted pixel-based people counting method and will be referred to as the vanilla/ benchmark approach in this text. If the name starts with “CMV HUMAN” then this method differs from the vanilla method in that the foreground map used is filtered first with the human region mask using connected components before a correlation coefficient is calculated or applied. See Equation (10). Both these are then followed by “BG” with “YES BG” denoting that an empty background sequence was provided and “NO BG” indicating that no such sequence was provided. The suffix “HUMAN MASK”, if set to yes, meant that the human presence mask was used to prevent human-pixels from being assimilated into the background model and thus speeding up non-human pixels being absorbed.

5.2. Testing

Referring to Figure 8 and Table 1 for this analysis of test results it is noted that the overall top 2 methods are our proposed methods (in comparison to the benchmark approach) using the human presence mask for the background update routine. As explained earlier, the difference between ‘CMV HUMAN’ and ‘HUMAN

MASK’ is that in the case of the former human pixels are considered at every frame and in the case of the latter the raw foreground mask contains mainly human pixels at every frame as pixels belonging to other objects are absorbed into the background. However the source, i.e. the human presence mask, for both is the same. Combining these two methods reduced the overall precision for ‘S1.L1 13-57’ but proved robust in the case of ‘S1.L3 14-33’ where for the benchmark approach people were absorbed into the background when they stood still. This improvement in the results for the proposed methods was also attributed to people standing close together thus increasing the likelihood of being classified as human given the low detection rate; and so is indicative of the potential of the human presence mask for dense slow moving crowds. Methods prefixed with “CMV HUMAN” generally did not perform as well as methods prefixed only with “CMV”. The result of all methods using a human region mask, whether for background update or human pixel counting, can be improved by increasing the detection hit rate.

Figure 11 The relationship between weighted

foreground pixels and the actual number of people in

regions RO R1 and R2. Troughs can be seen where

human filtering is used due to the low hit rate.

Dataset: ‘S1.L1 13-57’

In sequence ‘S1.L1 13-57’ the top performing method is the benchmark “CMV” method (86%) where a background sequence is available. In the presence of no background sequence and no human presence mask, it yields a lower average score (66%) than method 2 (see Table 1) “CMV NO BG YES HUMAN MASK” (70%). For both sequences, the top performing method is the benchmark approach combined with a human-region mask for background update purposes, “CMV YES BG YES HUMAN MASK” (76.9%) followed by “CMV HUMAN YES BG YES HUMAN MASK” (76.5%) where we perform the selective update and also count only human-pixels. In the presence of no background sequence and no human region update mask the benchmark (“CMV”) approach yields 62.6%. We show how combining it with a human presence mask improves it when no clear background is available to 66% with

“CMV NO BG YES HUMAN MASK”. In Figure 8 the average standard deviation of precision over RO R1 and R2 is given along with the mean precision over these regions. This is to test the depth map and show how consistently accurate the approaches can be. We can see that the depth map proved to be accurate. For ‘S1.L3 14-33’ there is inconsistency across the regions for those methods that do not deploy a human-region mask (“NO HUMAN MASK”) as a result of human pixels getting absorbed into the background model, specifically in region (R1 & RO & ~R2).

5.3. Discussion

This study has helped to resolve problems with updating and counting all foreground pixels. As shown in Figure 1, slow moving crowds over a long period of time tend to be absorbed into the background thus contributing to a poor result for the foreground segmentation. The background model resulting from a slow moving crowd moving through a scene is shown in Figure 7 for when our update routine addition is used and when it is not. We show that using a smart update procedure improves the background estimation task and makes it more robust in the case of slower moving crowds. Our results would have provided more evidence to establish that counting only human pixels in this situation necessarily improves the overall result if the dataset contained other objects, e.g. cars. From Figure 11 we can see losses in linearity between the weighted pixel count and the actual number of people owed to the lower detection hit rate. We have shown that our method makes the benchmark system more robust when the difficulty level increases. Methods put forward for the challenge are method 1, 3 and 7 from Table 1 which correspond to the top 3 performing methods in Figure 8.

6. Conclusion

The results of the configurations tested on ‘S1.L1’, ‘S1.L2’ and ‘S1.L3’ have been made available for the people counting challenge in [1]. In this paper we have shown that classifying foreground pixels makes pixel-based people counting more robust in the presence of slow moving crowds, noise and lack of empty background sequences. We aim to develop a more robust but still simple to initialise human detection layer based on this idea that has a higher detection hit rate. In doing so we also hope to deal with a situation where a crowd is stationary from the start to the end of the sequence.

7. References

[1] PETS 2009, http://pets2009.net/ [2] PETS 2006, http://pets2006.net/ [3] M. Naylor. "ADVISOR: Annotated Digital Video for

Intelligent Surveillance and Optimised Retrieval”. 2003.[http://www-sop.inria.fr/orion/ADVISOR/finaleval.pdf]

[4] S.A.Velastin, L. Khoudour, B.P.L Lo, J. Sun and M. Vicencio-Silva. “PRISMATICA: a multi-sensor

surveillance system for public transport networks”. In Proc. of 12th Int’l Conf. on Road Transport Information and Control (RTIC), Institute of Electrical Engineers (IEE), UK, pp. 19-25. 2004.

[5] B.P.L. Lo and S.A. Velastin. “Automatic congestion detection system for underground platforms”. In Proc. of the Int’l Symposium on Intelligent Multimedia, Video and Speech Processing. Hong Kong. pp. 158-161. 2001.

[6] A.C. Davies, J.H. Yin, and S.A. Velastin. “Crowd monitoring using image processing”. Electronics & Communication Engineering Journal. Vol. 7(1): pp. 37-47. 1995.

[7] P. Viola and M. Jones. “Robust Real-time Object Detection”. In Int’l Journal of Computer Vision. 2001.

[8] S. Choudri, “Cognitive Vision Model Gauging Interest in Advertising Hoardings”, MSc Project, University of Leeds.2006.[www.comp.leeds.ac.uk/mscproj/reports/0506/Choudri.pdf]

[9] N. Dalal and B. Triggs. “Histograms of oriented gradients for human detection”. In Proc. CVPR. Vol. 1, pp. 886–893. 2005.

[10] M. Li, Z. Zhang, K. Huang, T. Tan."Estimating the Number of People in Crowded Scenes by MID Based Foreground Segmentation and Head-shoulder Detection". In Proc. of the 19th IEEE Int'l Conf' on Pattern Recognition. Tampa, Florida, USA. pp. 1-4. 2008.

[11] G.J. Brostow, R. Cipolla. “Unsupervised Bayesian Detection of Independent Motion in Crowds”. In Proc. of IEEE Computer Society Conf. on Computer Vision and Pattern Recognition. Vol. 1, pp. 594–601. 2006.

[12] S. Ali, and M. Shah. “Floor Fields for Tracking in High Density Crowd Scenes”. In Proc. of the 10th European Conference on Computer Vision. Marseille, France. Vol 2, pp. 1-14. 2008.

[13] K. Dan, G. Doug, and T. Hai. “Counting Pedestrian in Crowds using viewpoint invariant training”. In Proc. BMVC. 2005.

[14] R. Ma, L. Li, W. Huang, Q. Tian, “On pixel count based crowd density estimation for visual surveillance”. In IEEE Conf. on Cybernetics and Intelligent Systems, Vol.1, no., pp. 170-173. 2004.

[15] N. Paragios, and V. Ramesh. “A MRF-Based Approach for Real-Time Subway Monitoring”. In CVPR. Vol.1, pp. I-1034-I-104. 2001.

[16] C. Stauffer, and W.E.L. Grimson. “Adaptive background mixture models for real-time tracking”. In CVPR. Vol.2, pp.2246. 1999.

[17] J. Aguilera, D. Thirde, M. Kampel, M. Borg, G. Fernandez, and J. Ferryman. “Visual Surveillance for Airport Monitoring Applications”. In Proc. of 11th

Computer Vision Winter Workshop. pp. 5-10, 2006. [18] G. R. Bradski and A. Kaehler."Learning OpenCV:

Computer Vision with the OpenCV Library". Ch 13. pp 506. O’Reilly. 2008.

[19] T.N. Tan,, G.D. Sullivan and K.D. Baker, “Structure from Motion using the Ground Plane Constraint”. In ECCV. pp 277-281. 1992.

[20] A. Mohan, C. Papageorgiou and T. Poggio. “Example-Based Object Detection in Images by Components”. In IEEE Transactions on Pattern Analysis and Machine Intelligence. Vol. 23, pp. 349-361. 2001.