2008_A Real-Time Person Detection Method for Moving Cameras

8/8/2019 2008_A Real-Time Person Detection Method for Moving Cameras

1/8

A Real-Time Person Detection Method for Moving

Cameras

Javier Oliver1, Alberto Albiol2, Samuel Morillas1, and Guillermo Peris-Fajarnes1

1 Centro de Investigacion en Tecnologas Graficas (CITG)

[email protected] Dep. de Comunicaciones (DCOM)

Universidad Politecnica de Valencia Spain

Abstract. In this paper, we introduce an advanced real-time method for vision-

based pedestrian detection made up by the sequential combination of two basicmethods applied in a coarse to fine fashion. The proposed method aims to achieve

an improved balance between detection accuracy and computational load by tak-

ing advantage of the strengths of these basic techniques. Boosting techniques in

human detection, which have been demonstrated to provide rapid but not accu-

rate enough results, are used in the first stage to provide a preliminary candidate

selection in the scene. Then, feature extraction and classification methods, which

present high accuracy rates at expenses of a higher computational cost, are ap-

plied over boosting candidates providing the final prediction. Experimental re-

sults show that the proposed method performs effectively and efficiently, which

supports its suitability for real applications.

1 Introduction

The CASBliP Project [1] aims at developing a navigation assistance system for blind

and visually impaired users. This system aids the users by means of the generation

of real-time simplified acoustic maps that describe the scene. These maps consist of

acoustic representations of the main elements that appear in the scene by means of

spatially localized sounds associated to each object. Persons, free paths, moving objectsand other hazards are the main objects in the scene to be included in the acoustic maps.

In this work, we focus on the task of person detection.

Real-time person detection in video sequences is a challenging task since humans

may appear in a great variability of appearances, poses and illumination conditions, and

in variable scenarios such as urban, traffic and cluttered environments. Furthermore, the

problem significantly increases when the acquisition system works onboard a person.

There exists extensive literature on vision-based object detection and in particular on

human detection. The problem has been handled by authors combining different feature

extraction methods with different classification machines for the retrieval process, which This work is supported by CASBLIP project 6-th FP [1]. The authors acknowledge the support

of the Technological Institute of Optics, Colour and Imaging of Valencia - AIDO. Dr. Samuel

Morillas acknowledges the support of Generalitat Valenciana under grant GVPRE/2008/257

and Universitat Politecnica de Valencia under grant Primeros Proyetos de Investigacion

2008/3202.

E. Corchado et al. (Eds.): HAIS 2009, LNAI 5572, pp. 129136, 2009.

c Springer-Verlag Berlin Heidelberg 2009


2/8

130 J. Oliver et al.

may be carried out either using single detection window [4] or a parts-based approach [6].

Haar-like features combined with Boosting classifiers have been commonly used for face

detection by Viola et al [2] [3] yielding excellent results. Similar combination of meth-

ods has been used for human detection by other authors: Oren, Papageorgiou and Poggio

[11] use Haar-like features combined with Boosting, where SVM are used as weak clas-sifiers [15]. Viola et al. [12] extend Haar-Boosting-based methods to handle space-time

information for moving-human detection. During the last years, feature extraction meth-

ods like Histograms of Oriented Gradients (HOG) in combination with SVM classifiers

have been widely used by several authors [4] [6]. Other authors use HOG features com-

bined with Boosting-based classifiers, where SVM are used as weak classifiers in each

stage of the cascade [13]. In this context, previous works on different classification prob-

lems have shown that Boosting-based methods are time-efficient methods [2] whereas,

on the other hand, SVM-based techniques provide more accurate results at the expense

of an increase in computation cost. For our application [1], we need an effective real-timeprocessing method. Therefore, in order to obtain an improved trade-off between detec-

tion accuracy and computational efficiency we propose an advanced pedestrian detection

method which is made up by a coarse-to-fine combination of these two basic approaches.

The rest of the paper is organized as follows: Section 2 gives a description of the

method and the training methodology. Section 3 shows the experimental evaluation.

The main conclusions are summarized in Section 4.

2 System Description

The introduced method, which we name Haar-Boosting-HOG-SVM detection method

(HBHS), performs the pedestrian identification task by dividing the processing chain

into two stages (see Fig. 1). In the first stage, a coarse selection of candidates in the

scene is provided by a Haar-Boosting method, where Haar-like features extracted from

the input image and classified using a Boosting-based method provide a preliminary

candidate selection. In the second stage, single-window HOG descriptors are computed

on each preliminary candidate and then classified using SVM in order to refine the

initial findings. Predictions yielded by SVM are the final labelling of the method.

A set of Boosting classifiers composed by 20 to 45 stages and a set ofPolynomial andGaussian SVM kernels have been considered in order to find out the optimal balance

regarding the computation cost and the accuracy of the HBHS method. MIT-Pedestrian

[17] and INRIA [18] datasets have been used to train the Boosting and the SVM classi-

fication machines.

The HBHS performance of has been obtained from our particular set of test se-

quences used in the CASBliP, that gather different scenarios, positions of the camera,

backgrounds and light conditions.

General description and training methodology for each HBHS module is detailed next.

2.1 Coarse Candidate Selection Module

A fast preliminary coarse candidate selection is based on the classification of the Haar-

like features as in [2]. These features, which are based on differences of mean intensity


3/8

A Real-Time Person Detection Method for Moving Cameras 131

Fig. 1. Flowchart of the HBHS method

among adjacent rectangular groups of pixels, are classified by means of a Boosting

based classifier, which consist of a cascade of weak stage classifiers that have been

trained using the Adaboost [16] method.

The Boosting classifier has been trained using 3365 positive and 200 negative exam-

ples selected from the image databases.

2.2 Fine Labelling Module

This module works on the preliminary set of candidates retrieved by the coarse module

and provides a refined labelling by using SVM for classification of a HOG descriptor of

each preliminary candidate. SVM classification of HOG features provides an accurate

identification but it is also much more computationally demanding. However, since this

refined classification is performed only for the preliminary candidates, the increased

computational load can be assumed within the required real-time processing.

HOG features, which are computed using a simplification of Lowes SIFT algorithm[10], is a vector containing the accumulated histogram of the oriented gradients of the

image. Gradients, which are represented by its magnitude and orientation, are computed

over a dense cell grid, where cells are grouped into overlapped blocks of NBP (Number

of Spatial Bins) cells. The gradient module is normalized by the highest value in the

block and is spatially weighted by a Gaussian function. The gradient orientation infor-

mation is discretized into NBO bins (Number of Orientation Bins). The accumulated

histogram of all blocks shapes the image feature vector.

In the last step of the fine labeling module, HOG feature vectors are classified using

SVM. Several Polynomial and Gaussian kernels [19] have been considered, varying thedegree d of the Polynomial function and the exponential term gamma for the Gaussian,

as defined in [14].

The two imagery datasets have been fully used for the training stage considering the

left-right reflections, obtaining a global dataset with 4991 positive and 16630 negative

examples.

3 Experimental Results

The method has been implemented in C using OpenCV [20] and libsvm [14]libraries.

Training and testing images for Boosting and SVM classification machines were

64x128 pixels size. Scanning images for the final test of the HBHS method were

320x240 pixels. The method performance is measured as the trade-off among time cost

and Accuracy, Precision, Recall, given as


4/8


Accuracy =FP + FN

Ntest,Precision = 100

FPNtest

100,Recall = 100

FNNtest

100

where Ntest is the number of testing dataset, FP is the false positives and FN the false

negatives.

3.1 Coarse Module Performance

A study of the Boosting classifier performance depending on the number of stages has

been carried out using our personal video-sequences collection for the test. Table 1

shows that the results of boosting method by itself depends on the number of stages of

the classifier. We find that for those classifiers with low number of stages the method

provides high number of false alarms, whereas classifiers composed by more stages

provide better results in terms ofPrecision. Regarding the Recall, we find that the high-

est value is obtained for 30 stages with a value of 80.60%. However, for 35 stages weobtain nearly the same Recall and Precision improves considerably. Moreover, the last

row in Table 1 shows that computation cost is less than 50 ms per frame and is held

constant for all the different stages of the Boosting classifier. This low time is due to

the fact that features are extracted from groups of pixels rather than from single pixels.

Besides, Boosting-based classifiers discard most of the background patches in earlier

stages of the cascade, speeding up the process and significantly increasing the detection

rate.

3.2 Fine Module Performance

HOG parameters discussion. There are several parameters in the HOG computation,

such as cell size (measured in pixels), block size (measured in cells), NBO and block

overlapping, that determine the description capability of an image descriptor vector.

First, we analyzed the effect of considering boundary blocks in the image descriptor

computation. Results showed that boundary blocks reduction decreased the SVM accu-

racy in 0.19%. Therefore, such reduction must be considered to speed up the process at

expense of the insignificant decrease in the SVM accuracy.

Second, we analyzed the effect of the block overlapping, considering12 ,

14 and

34

overlapping, values which are proposed by Dalal [4]. The best trade-off between detec-

tion rate and computational cost was obtained for 14

block overlapping, where detection

rate was optimal and time cost per sample, measured in seconds, dropped from 0.020

in the case of 34

overlapping to 0.013 in the 12

and 0.006 in the 14

case.

Furthermore, we analyzed the effect of the block size, being NBP = 4,8. The op-timal values regarding detection rate and time cost were found for NBP = 4, whereaccuracy on test raised from 99.20 to 99.32 and computation cost per sample dropped

from 0.0058 to 0.0052.

SVM kernel selection. For the best SVM kernel selection, we trained and tested two

different kernels: Polynomial and Gaussian. Optimal kernels have been obtained for

the Gaussian kernel, for gamma = [0.050.25]. Kernels with relative higher values ofgamma (g = 0.25) achieve Precision values of 100% but Recall slightly drops and thecomputation cost takes more than 6 ms with respect to the optimal value. On other hand,


5/8


Table 1. Boosting method performance depending on the number of stages of the classifier. Op-

timal values in terms of computational cost and recall are found for 30 and 35 stages.

Nstg 20 Nstg 25 Nstg 30 Nstg 35 Nstg 40 Nstg 45

Prec(%) 10.53 15.29 22.86 34.28 50.07 69.14,Rec(%) 59.94 79.33 80.60 79.86 75.52 72.95

Tcost(ms) 48.8 48.4 49.0 47.3 49.2 48.2

lower values of gamma (g = 0.05) yield the best Recall value, 93.9%, but Precisiondecreases to 99.25%. The best balance among these parameters is obtained for g = 0.2,where Precision = 99.93%, Recall = 93.70%, Accuracy = 99.6% and the runtime is11ms. It is important to remark that we prefer higher Precision values rather than higher

Recall values since SVM will work in the second stage of the process and will have toyield low number of false positive images. From here on we will use radial basis kernel

with gamma = 0.25.

3.3 HBHS Performance

In this section we test the performance of the HBHS method. Several stages have

been studied for the boosting classifier. HOG parameters have been set to: NBP = 4,NBO = 4{0180}, block overlapping = 1/4, feature vector= 512 components for a

64x128 window. A Radial Basis kernel with gamma = 0.2 has been used for the SVMclassifier, and the optimal threshold in the classification has been studied. The following

figure shows the method performances combining different boosting classifiers, varying

the number of stages, and studying the effect of the detection threshold in the SVM con-

tribution. Figure 2 shows the relationship between Precision and Recall of the HBHS

for different boosting classifiers.

Boosting results involve a constraint in the maximum performance achievable in

the Recall of the HBHS method. On the other hand, Precision is highly improved with

SVM contribution, achieving values of 100% ofPrecision for Recall values under 50%.

For Recall values higher than 50%, the Precision starts to drop off. The best balancebetween Precision and Recall is obtained for the 35-stage classifier with a threshold

near -0.5.

A study of the computation cost of each of these methods is necessary to complete

the study. Table 2 shows mean values of the HBHS time cost for the same scenario.

Row 1 shows the number of candidates per image selected by the boosting method and

Rows 2-4 show detailed averaged time information of each processing module during

the scanning of the 320x240 input images. It is easy to find that SVM method is the bot-

tle neck of the process in terms of computational cost. Therefore, the lower number of

candidates provided by Boosting positively affects the SVM computational cost. Notethat for lower number of stages, more candidates are yielded by boosting and therefore

higher computational cost is needed by SVM. However, for those situations with more

stages in Boosting classifier, the necessary time for SVM significantly decreases. See

that there is a significant step in the final temporal cost for a boosting classifier with

30 and 35 stages. The total cost decreases in 50 ms for 35 stages, which is more than


6/8


Fig. 2. Comparison among several HBHS classifiers varying the number of stages in Boosting

and the SVM classification threshold. The SVM has been trained with a Gaussian kernel with

gamma = 0.175. In coloured lines, HBHS performance. In black, N. Dalal performance.

Table 2. HBHS temporal analysis for different number of Boosting stages and a Gaussian kernel

with gamma = 0.175. Mean values per window.

Nstg 20 Nstg 25 Nstg 30 Nstg 35 Nstg 40 Nstg 45

Boosting candidates 8.84 4.80 3.24 2.27 1.97 1.64

Boosting time (ms) 48.8 48.4 49.0 47.3 49.2 48.2

HOG time (ms) 141.6 78.1 52.8 36.4 31.9 25.8

SVM time (ms) 314.9 169.7 116.8 80.4 71.4 57.9

HBHS time (ms) 505.5 296.3 218.7 164.2 152.6 131.9

25%. However, a classifier based on 40 stages does insignificantly reduce the cost incomparison to the 35 stages. According to the figure 2, the best results were found for

30, 35 and 40 stages. On the other hand, results on time study reveal that a classifier

based on 35 stages is preferred rather than the 30 one. Although classifiers with 40

and 45 stages provide good temporal results, Precision and Recall drops considerably.

Therefore, the best combination ofAccuracy and Temporal Cost is reached for a boost-

ing classifier with 35 stages and a threshold set to -0.4. Figure 3 shows some examples

of the HBHS method performances over our collected dataset for different stages of

boosting method. Boosting candidates are surrounded by white boxes. SVM contribu-

tion and therefore the final prediction is represented in grey colour. The visual analysisreveals that sometimes the positive boosting candidates are not well framed. Then, the

feature extraction is carried out over an image where the candidate is not centred in the

window. SVM is more sensitive to changes than boosting, and therefore a slight drift

in the position implies a wrong classification. The problem increases for high stages

of boosting classifiers. For that reason, there are several occasions where SVM fails

in the labelling. However, the final labelling is accurate and fast enough for real-time

processing of complex scenarios.

3.4 HBHS against Other Methods

In this subsection we present a comparison of our method with two reference methods,

the one based on Viola and Jones [2] and the other based on Dalal [4]. For the first

method, we have developed our own implementation according to their theoretical ba-

sis. For Dalals method, we have straight downloaded the source code from his official


7/8


Fig. 3. HBHS performance over our particular imagery collection, for different stages of the

coarse classifier. From left to right, classifiers comprise from 25 to 45 stages. In blue, discarded

coarse predictions. In green, SVM candidates and therefore the final prediction of HBHS.

Table 3. Comparison of HBHS among other methods

Precison Recall Time(ms)

Viola and Jones 34.28 79.86 47Dalal 85.69 72.61 1990

HBHS 79.36 72.77 164

website [5] and compiled it under linux using gcc. Testing has been carried out using

our personal imagery dataset.

Table 3 shows a comparison among Viola and Jones approach, Dalal person detector

and HBHS. This table shows that our method presents a good balance among time cost,

Precision and Recall, achieving high temporal performance providing high Precisionand Recall rates as well. The best rates of Precision and Recall are achieved by Dalal,

but temporal requirements are significant, too. On the other hand, Viola and Jones ap-

proach needs the lowest computational cost but poorer detection rates are achieved.

HBHS, however, provides similar detection rates to Dalal and still presents low com-

putational cost, 10 times lower than Dalals method requirements, achieving real-time

detection at a high detection rates.

4 Conclusions and Future Work

In this paper we present a new real-time method to detect persons in image sequences

of diverse nature. This method has been devised with the aim of improving the balance

detection accuracy and computational cost of the state-of-the-art methods. The pro-

posed method comprised the sequential combination of two basic methods performing

in a coarse-to-fine fashion. A fast coarse preliminary candidate selection is provided by

boosting techniques. Then, SVM-based classification of HOG features extracted from

the preliminary candidates refines the selection and provides the final detection. Exten-

sive experimental results using real sequences have been carried out both for parameteradjustment and performance assessment. It can be seen that the proposed method ob-

tains an improved trade-off between computational cost and detection accuracy.

Several modifications that may enhance the overall performance of the method might

be the following: First, tracking techniques may be considered in order to improve the

high miss rate and the slightly deviations of the candidate bounding box that Adaboost


8/8


produces; Second, other implementations of SVM may be used to reduce the high com-

putational cost of the fine classification. Third, stereo-cameras may be used in order

to perform the preliminary candidate selection and intelligent scan considering the ex-

pected size of a person at each depth.

References

1. CASBLIP Project (FP6-2004-IST-4),http://www.casblip.upv.es

2. Viola, P., Jones, M.: Rapid object detection using a boosted cascade of simple features. IEEE

Computer Vision and Pattern Recognition 25, 29 (2001)

3. Mohan, A., Papageorgiou, C., Poggio, T.: Example-based object detection in images by com-

ponents. IEEE Trans. Pattern Anal. Mach. Intell. 23(4), 349361 (2001)

4. Dalal, N., Triggs, B.: Histograms of Oriented Gradients for Human Detection. In: Proceed-

ings of IEEE Conference Computer Vision and Pattern Recognition, San Diego, USA, pp.886893 (June 2005)

5. Dalal, N.: Pedestrian Detector source code,

http://www.navneetdalal.com/software

6. Parra, I., Fernandez, D., Sotelo, M.A., Bergasa, L.M., Revenga, P., Ocana, M., Garcia, M.A.:

Combination of Feature Extraction Methods for SVM Pedestrian Detection. IEEE Trans.

Intell. Trans. Sys. 8(2) (June 2007)

7. Gavrila, D., Philomin, V.: Real-time object detection for smart vehicles. In: Proc IEEE Int.

Conf. Comput. Vis, pp. 8793 (1999)

8. Broggi, A., Bertozzi, M., Fascioli, A., Sechi, M.: Shape-based pedestrian detection. In: Proc.

IEEE Intell. Veh. Symp. Dearborn, MI, pp. 215220 (October 2000)9. Bertozzi, M., Broggi, A., Chapuis, R., Chausse, F., Fascioli, A., Tibaldi, A.: Shape-based

pedestrian detection and localization. In: Proc IEEE ITS Conf., Shanghai, China, pp. 328

333 (October 2003)

10. Lowe, D.: Distinctive image features from scale-invariant keypoints. International Journal of

Computer Vision 60(2), 91110 (2004)

11. Oren, M., Papageorgiou, C., Sinha, P., Osuna, E., Poggio, T.: Pedestrian Detection Using

Wavelet Templates. Computer Vision Pattern Recognition (1997)

12. Viola, P., Jones, M., Snow, D.: Detecting pedestrians using patterns of motion and appear-

ance. In: International Conference on Computer Vision (2003)

13. Carmen, A., Albiol, A.: Authomatic Pedestrian Detection using boosting techniques. Final

Degree Project. Technical University of Valencia (2006)

14. An implementation of Support Vector Machines (SVM) in C,

http://svmlight.joachims.org/

15. Yuan, Z., Yang, L., Qu, Y., Liu, Y., Jia, X.: A Boosting SVM Chain Learning for Visual

Information Retrieval. In: Wang, J., Yi, Z., Zurada, J.M., Lu, B.-L., Yin, H. (eds.) ISNN

2006. LNCS, vol. 3971, pp. 10631069. Springer, Heidelberg (2006)

16. Freund, Y., Schapire, R.E.: A Short Introduction to Boosting. Journal of Japanese Society for

Artificial Intelligense 14(5), 771780 (1999)

17. Pedestrian Dataset from MIT,http://cbcl.mit.edu/software-datasets/PedestrianData.html

18. Pedestrian Dataset from INRIA,http://pascal.inrialpes.fr/data/human/

19. Vapnik, V.: The Nature of Statistical Learning Theory, p. 314. Springer, Heidelberg (2000)

20. OpenCV library, http://sourceforge.net/projects/opencvlibrary/
http://www.casblip.upv.es/http://www.navneetdalal.com/softwarehttp://svmlight.joachims.org/http://cbcl.mit.edu/software-datasets/PedestrianData.htmlhttp://pascal.inrialpes.fr/data/human/http://sourceforge.net/projects/opencvlibrary/http://sourceforge.net/projects/opencvlibrary/http://pascal.inrialpes.fr/data/human/http://cbcl.mit.edu/software-datasets/PedestrianData.htmlhttp://svmlight.joachims.org/http://www.navneetdalal.com/softwarehttp://www.casblip.upv.es/

2008_A Real-Time Person Detection Method for Moving Cameras

Documents

Transcript of 2008_A Real-Time Person Detection Method for Moving Cameras