2008_A Real-Time Person Detection Method for Moving Cameras

download 2008_A Real-Time Person Detection Method for Moving Cameras

of 8

Transcript of 2008_A Real-Time Person Detection Method for Moving Cameras

  • 8/8/2019 2008_A Real-Time Person Detection Method for Moving Cameras

    1/8

    A Real-Time Person Detection Method for Moving

    Cameras

    Javier Oliver1, Alberto Albiol2, Samuel Morillas1, and Guillermo Peris-Fajarnes1

    1 Centro de Investigacion en Tecnologas Graficas (CITG)

    [email protected] Dep. de Comunicaciones (DCOM)

    Universidad Politecnica de Valencia Spain

    Abstract. In this paper, we introduce an advanced real-time method for vision-

    based pedestrian detection made up by the sequential combination of two basicmethods applied in a coarse to fine fashion. The proposed method aims to achieve

    an improved balance between detection accuracy and computational load by tak-

    ing advantage of the strengths of these basic techniques. Boosting techniques in

    human detection, which have been demonstrated to provide rapid but not accu-

    rate enough results, are used in the first stage to provide a preliminary candidate

    selection in the scene. Then, feature extraction and classification methods, which

    present high accuracy rates at expenses of a higher computational cost, are ap-

    plied over boosting candidates providing the final prediction. Experimental re-

    sults show that the proposed method performs effectively and efficiently, which

    supports its suitability for real applications.

    1 Introduction

    The CASBliP Project [1] aims at developing a navigation assistance system for blind

    and visually impaired users. This system aids the users by means of the generation

    of real-time simplified acoustic maps that describe the scene. These maps consist of

    acoustic representations of the main elements that appear in the scene by means of

    spatially localized sounds associated to each object. Persons, free paths, moving objectsand other hazards are the main objects in the scene to be included in the acoustic maps.

    In this work, we focus on the task of person detection.

    Real-time person detection in video sequences is a challenging task since humans

    may appear in a great variability of appearances, poses and illumination conditions, and

    in variable scenarios such as urban, traffic and cluttered environments. Furthermore, the

    problem significantly increases when the acquisition system works onboard a person.

    There exists extensive literature on vision-based object detection and in particular on

    human detection. The problem has been handled by authors combining different feature

    extraction methods with different classification machines for the retrieval process, which This work is supported by CASBLIP project 6-th FP [1]. The authors acknowledge the support

    of the Technological Institute of Optics, Colour and Imaging of Valencia - AIDO. Dr. Samuel

    Morillas acknowledges the support of Generalitat Valenciana under grant GVPRE/2008/257

    and Universitat Politecnica de Valencia under grant Primeros Proyetos de Investigacion

    2008/3202.

    E. Corchado et al. (Eds.): HAIS 2009, LNAI 5572, pp. 129136, 2009.

    c Springer-Verlag Berlin Heidelberg 2009

  • 8/8/2019 2008_A Real-Time Person Detection Method for Moving Cameras

    2/8

    130 J. Oliver et al.

    may be carried out either using single detection window [4] or a parts-based approach [6].

    Haar-like features combined with Boosting classifiers have been commonly used for face

    detection by Viola et al [2] [3] yielding excellent results. Similar combination of meth-

    ods has been used for human detection by other authors: Oren, Papageorgiou and Poggio

    [11] use Haar-like features combined with Boosting, where SVM are used as weak clas-sifiers [15]. Viola et al. [12] extend Haar-Boosting-based methods to handle space-time

    information for moving-human detection. During the last years, feature extraction meth-

    ods like Histograms of Oriented Gradients (HOG) in combination with SVM classifiers

    have been widely used by several authors [4] [6]. Other authors use HOG features com-

    bined with Boosting-based classifiers, where SVM are used as weak classifiers in each

    stage of the cascade [13]. In this context, previous works on different classification prob-

    lems have shown that Boosting-based methods are time-efficient methods [2] whereas,

    on the other hand, SVM-based techniques provide more accurate results at the expense

    of an increase in computation cost. For our application [1], we need an effective real-timeprocessing method. Therefore, in order to obtain an improved trade-off between detec-

    tion accuracy and computational efficiency we propose an advanced pedestrian detection

    method which is made up by a coarse-to-fine combination of these two basic approaches.

    The rest of the paper is organized as follows: Section 2 gives a description of the

    method and the training methodology. Section 3 shows the experimental evaluation.

    The main conclusions are summarized in Section 4.

    2 System Description

    The introduced method, which we name Haar-Boosting-HOG-SVM detection method

    (HBHS), performs the pedestrian identification task by dividing the processing chain

    into two stages (see Fig. 1). In the first stage, a coarse selection of candidates in the

    scene is provided by a Haar-Boosting method, where Haar-like features extracted from

    the input image and classified using a Boosting-based method provide a preliminary

    candidate selection. In the second stage, single-window HOG descriptors are computed

    on each preliminary candidate and then classified using SVM in order to refine the

    initial findings. Predictions yielded by SVM are the final labelling of the method.

    A set of Boosting classifiers composed by 20 to 45 stages and a set ofPolynomial andGaussian SVM kernels have been considered in order to find out the optimal balance

    regarding the computation cost and the accuracy of the HBHS method. MIT-Pedestrian

    [17] and INRIA [18] datasets have been used to train the Boosting and the SVM classi-

    fication machines.

    The HBHS performance of has been obtained from our particular set of test se-

    quences used in the CASBliP, that gather different scenarios, positions of the camera,

    backgrounds and light conditions.

    General description and training methodology for each HBHS module is detailed next.

    2.1 Coarse Candidate Selection Module

    A fast preliminary coarse candidate selection is based on the classification of the Haar-

    like features as in [2]. These features, which are based on differences of mean intensity

  • 8/8/2019 2008_A Real-Time Person Detection Method for Moving Cameras

    3/8

    A Real-Time Person Detection Method for Moving Cameras 131

    Fig. 1. Flowchart of the HBHS method

    among adjacent rectangular groups of pixels, are classified by means of a Boosting

    based classifier, which consist of a cascade of weak stage classifiers that have been

    trained using the Adaboost [16] method.

    The Boosting classifier has been trained using 3365 positive and 200 negative exam-

    ples selected from the image databases.

    2.2 Fine Labelling Module

    This module works on the preliminary set of candidates retrieved by the coarse module

    and provides a refined labelling by using SVM for classification of a HOG descriptor of

    each preliminary candidate. SVM classification of HOG features provides an accurate

    identification but it is also much more computationally demanding. However, since this

    refined classification is performed only for the preliminary candidates, the increased

    computational load can be assumed within the required real-time processing.

    HOG features, which are computed using a simplification of Lowes SIFT algorithm[10], is a vector containing the accumulated histogram of the oriented gradients of the

    image. Gradients, which are represented by its magnitude and orientation, are computed

    over a dense cell grid, where cells are grouped into overlapped blocks of NBP (Number

    of Spatial Bins) cells. The gradient module is normalized by the highest value in the

    block and is spatially weighted by a Gaussian function. The gradient orientation infor-

    mation is discretized into NBO bins (Number of Orientation Bins). The accumulated

    histogram of all blocks shapes the image feature vector.

    In the last step of the fine labeling module, HOG feature vectors are classified using

    SVM. Several Polynomial and Gaussian kernels [19] have been considered, varying thedegree d of the Polynomial function and the exponential term gamma for the Gaussian,

    as defined in [14].

    The two imagery datasets have been fully used for the training stage considering the

    left-right reflections, obtaining a global dataset with 4991 positive and 16630 negative

    examples.

    3 Experimental Results

    The method has been implemented in C using OpenCV [20] and libsvm [14]libraries.

    Training and testing images for Boosting and SVM classification machines were

    64x128 pixels size. Scanning images for the final test of the HBHS method were

    320x240 pixels. The method performance is measured as the trade-off among time cost

    and Accuracy, Precision, Recall, given as

  • 8/8/2019 2008_A Real-Time Person Detection Method for Moving Cameras

    4/8

    132 J. Oliver et al.

    Accuracy =FP + FN

    Ntest,Precision = 100

    FPNtest

    100,Recall = 100

    FNNtest

    100

    where Ntest is the number of testing dataset, FP is the false positives and FN the false

    negatives.

    3.1 Coarse Module Performance

    A study of the Boosting classifier performance depending on the number of stages has

    been carried out using our personal video-sequences collection for the test. Table 1

    shows that the results of boosting method by itself depends on the number of stages of

    the classifier. We find that for those classifiers with low number of stages the method

    provides high number of false alarms, whereas classifiers composed by more stages

    provide better results in terms ofPrecision. Regarding the Recall, we find that the high-

    est value is obtained for 30 stages with a value of 80.60%. However, for 35 stages weobtain nearly the same Recall and Precision improves considerably. Moreover, the last

    row in Table 1 shows that computation cost is less than 50 ms per frame and is held

    constant for all the different stages of the Boosting classifier. This low time is due to

    the fact that features are extracted from groups of pixels rather than from single pixels.

    Besides, Boosting-based classifiers discard most of the background patches in earlier

    stages of the cascade, speeding up the process and significantly increasing the detection

    rate.

    3.2 Fine Module Performance

    HOG parameters discussion. There are several parameters in the HOG computation,

    such as cell size (measured in pixels), block size (measured in cells), NBO and block

    overlapping, that determine the description capability of an image descriptor vector.

    First, we analyzed the effect of considering boundary blocks in the image descriptor

    computation. Results showed that boundary blocks reduction decreased the SVM accu-

    racy in 0.19%. Therefore, such reduction must be considered to speed up the process at

    expense of the insignificant decrease in the SVM accuracy.

    Second, we analyzed the effect of the block overlapping, considering12 ,

    14 and

    34

    overlapping, values which are proposed by Dalal [4]. The best trade-off between detec-

    tion rate and computational cost was obtained for 14

    block overlapping, where detection

    rate was optimal and time cost per sample, measured in seconds, dropped from 0.020

    in the case of 34

    overlapping to 0.013 in the 12

    and 0.006 in the 14

    case.

    Furthermore, we analyzed the effect of the block size, being NBP = 4,8. The op-timal values regarding detection rate and time cost were found for NBP = 4, whereaccuracy on test raised from 99.20 to 99.32 and computation cost per sample dropped

    from 0.0058 to 0.0052.

    SVM kernel selection. For the best SVM kernel selection, we trained and tested two

    different kernels: Polynomial and Gaussian. Optimal kernels have been obtained for

    the Gaussian kernel, for gamma = [0.050.25]. Kernels with relative higher values ofgamma (g = 0.25) achieve Precision values of 100% but Recall slightly drops and thecomputation cost takes more than 6 ms with respect to the optimal value. On other hand,

  • 8/8/2019 2008_A Real-Time Person Detection Method for Moving Cameras

    5/8

    A Real-Time Person Detection Method for Moving Cameras 133

    Table 1. Boosting method performance depending on the number of stages of the classifier. Op-

    timal values in terms of computational cost and recall are found for 30 and 35 stages.

    Nstg 20 Nstg 25 Nstg 30 Nstg 35 Nstg 40 Nstg 45

    Prec(%) 10.53 15.29 22.86 34.28 50.07 69.14,Rec(%) 59.94 79.33 80.60 79.86 75.52 72.95

    Tcost(ms) 48.8 48.4 49.0 47.3 49.2 48.2

    lower values of gamma (g = 0.05) yield the best Recall value, 93.9%, but Precisiondecreases to 99.25%. The best balance among these parameters is obtained for g = 0.2,where Precision = 99.93%, Recall = 93.70%, Accuracy = 99.6% and the runtime is11ms. It is important to remark that we prefer higher Precision values rather than higher

    Recall values since SVM will work in the second stage of the process and will have toyield low number of false positive images. From here on we will use radial basis kernel

    with gamma = 0.25.

    3.3 HBHS Performance

    In this section we test the performance of the HBHS method. Several stages have

    been studied for the boosting classifier. HOG parameters have been set to: NBP = 4,NBO = 4{0180}, block overlapping = 1/4, feature vector= 512 components for a

    64x128 window. A Radial Basis kernel with gamma = 0.2 has been used for the SVMclassifier, and the optimal threshold in the classification has been studied. The following

    figure shows the method performances combining different boosting classifiers, varying

    the number of stages, and studying the effect of the detection threshold in the SVM con-

    tribution. Figure 2 shows the relationship between Precision and Recall of the HBHS

    for different boosting classifiers.

    Boosting results involve a constraint in the maximum performance achievable in

    the Recall of the HBHS method. On the other hand, Precision is highly improved with

    SVM contribution, achieving values of 100% ofPrecision for Recall values under 50%.

    For Recall values higher than 50%, the Precision starts to drop off. The best balancebetween Precision and Recall is obtained for the 35-stage classifier with a threshold

    near -0.5.

    A study of the computation cost of each of these methods is necessary to complete

    the study. Table 2 shows mean values of the HBHS time cost for the same scenario.

    Row 1 shows the number of candidates per image selected by the boosting method and

    Rows 2-4 show detailed averaged time information of each processing module during

    the scanning of the 320x240 input images. It is easy to find that SVM method is the bot-

    tle neck of the process in terms of computational cost. Therefore, the lower number of

    candidates provided by Boosting positively affects the SVM computational cost. Notethat for lower number of stages, more candidates are yielded by boosting and therefore

    higher computational cost is needed by SVM. However, for those situations with more

    stages in Boosting classifier, the necessary time for SVM significantly decreases. See

    that there is a significant step in the final temporal cost for a boosting classifier with

    30 and 35 stages. The total cost decreases in 50 ms for 35 stages, which is more than

  • 8/8/2019 2008_A Real-Time Person Detection Method for Moving Cameras

    6/8

    134 J. Oliver et al.

    Fig. 2. Comparison among several HBHS classifiers varying the number of stages in Boosting

    and the SVM classification threshold. The SVM has been trained with a Gaussian kernel with

    gamma = 0.175. In coloured lines, HBHS performance. In black, N. Dalal performance.

    Table 2. HBHS temporal analysis for different number of Boosting stages and a Gaussian kernel

    with gamma = 0.175. Mean values per window.

    Nstg 20 Nstg 25 Nstg 30 Nstg 35 Nstg 40 Nstg 45

    Boosting candidates 8.84 4.80 3.24 2.27 1.97 1.64

    Boosting time (ms) 48.8 48.4 49.0 47.3 49.2 48.2

    HOG time (ms) 141.6 78.1 52.8 36.4 31.9 25.8

    SVM time (ms) 314.9 169.7 116.8 80.4 71.4 57.9

    HBHS time (ms) 505.5 296.3 218.7 164.2 152.6 131.9

    25%. However, a classifier based on 40 stages does insignificantly reduce the cost incomparison to the 35 stages. According to the figure 2, the best results were found for

    30, 35 and 40 stages. On the other hand, results on time study reveal that a classifier

    based on 35 stages is preferred rather than the 30 one. Although classifiers with 40

    and 45 stages provide good temporal results, Precision and Recall drops considerably.

    Therefore, the best combination ofAccuracy and Temporal Cost is reached for a boost-

    ing classifier with 35 stages and a threshold set to -0.4. Figure 3 shows some examples

    of the HBHS method performances over our collected dataset for different stages of

    boosting method. Boosting candidates are surrounded by white boxes. SVM contribu-

    tion and therefore the final prediction is represented in grey colour. The visual analysisreveals that sometimes the positive boosting candidates are not well framed. Then, the

    feature extraction is carried out over an image where the candidate is not centred in the

    window. SVM is more sensitive to changes than boosting, and therefore a slight drift

    in the position implies a wrong classification. The problem increases for high stages

    of boosting classifiers. For that reason, there are several occasions where SVM fails

    in the labelling. However, the final labelling is accurate and fast enough for real-time

    processing of complex scenarios.

    3.4 HBHS against Other Methods

    In this subsection we present a comparison of our method with two reference methods,

    the one based on Viola and Jones [2] and the other based on Dalal [4]. For the first

    method, we have developed our own implementation according to their theoretical ba-

    sis. For Dalals method, we have straight downloaded the source code from his official

  • 8/8/2019 2008_A Real-Time Person Detection Method for Moving Cameras

    7/8

    A Real-Time Person Detection Method for Moving Cameras 135

    Fig. 3. HBHS performance over our particular imagery collection, for different stages of the

    coarse classifier. From left to right, classifiers comprise from 25 to 45 stages. In blue, discarded

    coarse predictions. In green, SVM candidates and therefore the final prediction of HBHS.

    Table 3. Comparison of HBHS among other methods

    Precison Recall Time(ms)

    Viola and Jones 34.28 79.86 47Dalal 85.69 72.61 1990

    HBHS 79.36 72.77 164

    website [5] and compiled it under linux using gcc. Testing has been carried out using

    our personal imagery dataset.

    Table 3 shows a comparison among Viola and Jones approach, Dalal person detector

    and HBHS. This table shows that our method presents a good balance among time cost,

    Precision and Recall, achieving high temporal performance providing high Precisionand Recall rates as well. The best rates of Precision and Recall are achieved by Dalal,

    but temporal requirements are significant, too. On the other hand, Viola and Jones ap-

    proach needs the lowest computational cost but poorer detection rates are achieved.

    HBHS, however, provides similar detection rates to Dalal and still presents low com-

    putational cost, 10 times lower than Dalals method requirements, achieving real-time

    detection at a high detection rates.

    4 Conclusions and Future Work

    In this paper we present a new real-time method to detect persons in image sequences

    of diverse nature. This method has been devised with the aim of improving the balance

    detection accuracy and computational cost of the state-of-the-art methods. The pro-

    posed method comprised the sequential combination of two basic methods performing

    in a coarse-to-fine fashion. A fast coarse preliminary candidate selection is provided by

    boosting techniques. Then, SVM-based classification of HOG features extracted from

    the preliminary candidates refines the selection and provides the final detection. Exten-

    sive experimental results using real sequences have been carried out both for parameteradjustment and performance assessment. It can be seen that the proposed method ob-

    tains an improved trade-off between computational cost and detection accuracy.

    Several modifications that may enhance the overall performance of the method might

    be the following: First, tracking techniques may be considered in order to improve the

    high miss rate and the slightly deviations of the candidate bounding box that Adaboost

  • 8/8/2019 2008_A Real-Time Person Detection Method for Moving Cameras

    8/8

    136 J. Oliver et al.

    produces; Second, other implementations of SVM may be used to reduce the high com-

    putational cost of the fine classification. Third, stereo-cameras may be used in order

    to perform the preliminary candidate selection and intelligent scan considering the ex-

    pected size of a person at each depth.

    References

    1. CASBLIP Project (FP6-2004-IST-4),http://www.casblip.upv.es

    2. Viola, P., Jones, M.: Rapid object detection using a boosted cascade of simple features. IEEE

    Computer Vision and Pattern Recognition 25, 29 (2001)

    3. Mohan, A., Papageorgiou, C., Poggio, T.: Example-based object detection in images by com-

    ponents. IEEE Trans. Pattern Anal. Mach. Intell. 23(4), 349361 (2001)

    4. Dalal, N., Triggs, B.: Histograms of Oriented Gradients for Human Detection. In: Proceed-

    ings of IEEE Conference Computer Vision and Pattern Recognition, San Diego, USA, pp.886893 (June 2005)

    5. Dalal, N.: Pedestrian Detector source code,

    http://www.navneetdalal.com/software

    6. Parra, I., Fernandez, D., Sotelo, M.A., Bergasa, L.M., Revenga, P., Ocana, M., Garcia, M.A.:

    Combination of Feature Extraction Methods for SVM Pedestrian Detection. IEEE Trans.

    Intell. Trans. Sys. 8(2) (June 2007)

    7. Gavrila, D., Philomin, V.: Real-time object detection for smart vehicles. In: Proc IEEE Int.

    Conf. Comput. Vis, pp. 8793 (1999)

    8. Broggi, A., Bertozzi, M., Fascioli, A., Sechi, M.: Shape-based pedestrian detection. In: Proc.

    IEEE Intell. Veh. Symp. Dearborn, MI, pp. 215220 (October 2000)9. Bertozzi, M., Broggi, A., Chapuis, R., Chausse, F., Fascioli, A., Tibaldi, A.: Shape-based

    pedestrian detection and localization. In: Proc IEEE ITS Conf., Shanghai, China, pp. 328

    333 (October 2003)

    10. Lowe, D.: Distinctive image features from scale-invariant keypoints. International Journal of

    Computer Vision 60(2), 91110 (2004)

    11. Oren, M., Papageorgiou, C., Sinha, P., Osuna, E., Poggio, T.: Pedestrian Detection Using

    Wavelet Templates. Computer Vision Pattern Recognition (1997)

    12. Viola, P., Jones, M., Snow, D.: Detecting pedestrians using patterns of motion and appear-

    ance. In: International Conference on Computer Vision (2003)

    13. Carmen, A., Albiol, A.: Authomatic Pedestrian Detection using boosting techniques. Final

    Degree Project. Technical University of Valencia (2006)

    14. An implementation of Support Vector Machines (SVM) in C,

    http://svmlight.joachims.org/

    15. Yuan, Z., Yang, L., Qu, Y., Liu, Y., Jia, X.: A Boosting SVM Chain Learning for Visual

    Information Retrieval. In: Wang, J., Yi, Z., Zurada, J.M., Lu, B.-L., Yin, H. (eds.) ISNN

    2006. LNCS, vol. 3971, pp. 10631069. Springer, Heidelberg (2006)

    16. Freund, Y., Schapire, R.E.: A Short Introduction to Boosting. Journal of Japanese Society for

    Artificial Intelligense 14(5), 771780 (1999)

    17. Pedestrian Dataset from MIT,http://cbcl.mit.edu/software-datasets/PedestrianData.html

    18. Pedestrian Dataset from INRIA,http://pascal.inrialpes.fr/data/human/

    19. Vapnik, V.: The Nature of Statistical Learning Theory, p. 314. Springer, Heidelberg (2000)

    20. OpenCV library, http://sourceforge.net/projects/opencvlibrary/

    http://www.casblip.upv.es/http://www.navneetdalal.com/softwarehttp://svmlight.joachims.org/http://cbcl.mit.edu/software-datasets/PedestrianData.htmlhttp://pascal.inrialpes.fr/data/human/http://sourceforge.net/projects/opencvlibrary/http://sourceforge.net/projects/opencvlibrary/http://pascal.inrialpes.fr/data/human/http://cbcl.mit.edu/software-datasets/PedestrianData.htmlhttp://svmlight.joachims.org/http://www.navneetdalal.com/softwarehttp://www.casblip.upv.es/