Dpm Slides

download Dpm Slides

of 68

description

deform able parts detection slides

Transcript of Dpm Slides

  • Deformable part modelsRoss GirshickUC Berkeley

    CS231B Stanford University Guest Lecture April 16, 2013

  • Image understanding

    photo by thomas pix http://www.flickr.com/photos/thomaspix/2591427106

    Snack time in the lab

  • What objects are where?

    ..

    .

    I seetwinkies!

    robot: I see a table with twinkies,pretzels, fruit, and some mysterious chocolate things...

  • DPM lecture overview

    (a) (b) (c) (d) (e) (f) (g)Figure 6. Our HOG detectors cue mainly on silhouette contours (especially the head, shoulders and feet). The most active blocks arecentred on the image background just outside the contour. (a) The average gradient image over the training examples. (b) Each pixelshows the maximum positive SVM weight in the block centred on the pixel. (c) Likewise for the negative SVM weights. (d) A test image.(e) Its computed R-HOG descriptor. (f,g) The R-HOG descriptor weighted by respectively the positive and the negative SVM weights.

    would help to improve the detection results in more generalsituations.Acknowledgments. This work was supported by the Euro-pean Union research projects ACEMEDIA and PASCAL. Wethanks Cordelia Schmid for many useful comments. SVM-Light [10] provided reliable training of large-scale SVMs.

    References[1] S. Belongie, J. Malik, and J. Puzicha. Matching shapes. The

    8th ICCV, Vancouver, Canada, pages 454461, 2001.[2] V. de Poortere, J. Cant, B. Van den Bosch, J. de

    Prins, F. Fransens, and L. Van Gool. Efficient pedes-trian detection: a test case for svm based categorization.Workshop on Cognitive Vision, 2002. Available online:http://www.vision.ethz.ch/cogvis02/.

    [3] P. Felzenszwalb and D. Huttenlocher. Efficient matching ofpictorial structures. CVPR, Hilton Head Island, South Car-olina, USA, pages 6675, 2000.

    [4] W. T. Freeman and M. Roth. Orientation histograms forhand gesture recognition. Intl. Workshop on Automatic Face-and Gesture- Recognition, IEEE Computer Society, Zurich,Switzerland, pages 296301, June 1995.

    [5] W. T. Freeman, K. Tanaka, J. Ohta, and K. Kyuma. Com-puter vision for computer games. 2nd International Confer-ence on Automatic Face and Gesture Recognition, Killington,VT, USA, pages 100105, October 1996.

    [6] D. M. Gavrila. The visual analysis of human movement: Asurvey. CVIU, 73(1):8298, 1999.

    [7] D. M. Gavrila, J. Giebel, and S. Munder. Vision-based pedes-trian detection: the protector+ system. Proc. of the IEEE In-telligent Vehicles Symposium, Parma, Italy, 2004.

    [8] D. M. Gavrila and V. Philomin. Real-time object detection forsmart vehicles. CVPR, Fort Collins, Colorado, USA, pages8793, 1999.

    [9] S. Ioffe and D. A. Forsyth. Probabilistic methods for findingpeople. IJCV, 43(1):4568, 2001.

    [10] T. Joachims. Making large-scale svm learning practical. InB. Schlkopf, C. Burges, and A. Smola, editors, Advances inKernel Methods - Support Vector Learning. The MIT Press,Cambridge, MA, USA, 1999.

    [11] Y. Ke and R. Sukthankar. Pca-sift: A more distinctive rep-resentation for local image descriptors. CVPR, Washington,DC, USA, pages 6675, 2004.

    [12] D. G. Lowe. Distinctive image features from scale-invariantkeypoints. IJCV, 60(2):91110, 2004.

    [13] R. K. McConnell. Method of and apparatus for pattern recog-nition, January 1986. U.S. Patent No. 4,567,610.

    [14] K. Mikolajczyk and C. Schmid. A performance evaluation oflocal descriptors. PAMI, 2004. Accepted.

    [15] K. Mikolajczyk and C. Schmid. Scale and affine invariantinterest point detectors. IJCV, 60(1):6386, 2004.

    [16] K. Mikolajczyk, C. Schmid, and A. Zisserman. Human detec-tion based on a probabilistic assembly of robust part detectors.The 8th ECCV, Prague, Czech Republic, volume I, pages 6981, 2004.

    [17] A. Mohan, C. Papageorgiou, and T. Poggio. Example-basedobject detection in images by components. PAMI, 23(4):349361, April 2001.

    [18] C. Papageorgiou and T. Poggio. A trainable system for objectdetection. IJCV, 38(1):1533, 2000.

    [19] R. Ronfard, C. Schmid, and B. Triggs. Learning to parse pic-tures of people. The 7th ECCV, Copenhagen, Denmark, vol-ume IV, pages 700714, 2002.

    [20] Henry Schneiderman and Takeo Kanade. Object detectionusing the statistics of parts. IJCV, 56(3):151177, 2004.

    [21] Eric L. Schwartz. Spatial mapping in the primate sensory pro-jection: analytic structure and relevance to perception. Bio-logical Cybernetics, 25(4):181194, 1977.

    [22] P. Viola, M. J. Jones, and D. Snow. Detecting pedestriansusing patterns of motion and appearance. The 9th ICCV, Nice,France, volume 1, pages 734741, 2003.

    A Discriminatively Trained, Multiscale, Deformable Part Model

    Pedro FelzenszwalbUniversity of [email protected]

    David McAllesterToyota Technological Institute at Chicago

    [email protected]

    Deva RamananUC Irvine

    [email protected]

    Abstract

    This paper describes a discriminatively trained, multi-scale, deformable part model for object detection. Our sys-tem achieves a two-fold improvement in average precisionover the best performance in the 2006 PASCAL person de-tection challenge. It also outperforms the best results in the2007 challenge in ten out of twenty categories. The systemrelies heavily on deformable parts. While deformable partmodels have become quite popular, their value had not beendemonstrated on difficult benchmarks such as the PASCALchallenge. Our system also relies heavily on new methodsfor discriminative training. We combine a margin-sensitiveapproach for data mining hard negative examples with aformalism we call latent SVM. A latent SVM, like a hid-den CRF, leads to a non-convex training problem. How-ever, a latent SVM is semi-convex and the training prob-lem becomes convex once latent information is specified forthe positive examples. We believe that our training meth-ods will eventually make possible the effective use of morelatent information such as hierarchical (grammar) modelsand models involving latent three dimensional pose.

    1. IntroductionWe consider the problem of detecting and localizing ob-

    jects of a generic category, such as people or cars, in staticimages. We have developed a new multiscale deformablepart model for solving this problem. The models are trainedusing a discriminative procedure that only requires bound-ing box labels for the positive examples. Using these mod-els we implemented a detection system that is both highlyefficient and accurate, processing an image in about 2 sec-onds and achieving recognition rates that are significantlybetter than previous systems.

    Our system achieves a two-fold improvement in averageprecision over the winning system [5] in the 2006 PASCALperson detection challenge. The system also outperformsthe best results in the 2007 challenge in ten out of twenty

    This material is based upon work supported by the National ScienceFoundation under Grant No. 0534820 and 0535174.

    Figure 1. Example detection obtained with the person model. Themodel is defined by a coarse template, several higher resolutionpart templates and a spatial model for the location of each part.

    object categories. Figure 1 shows an example detection ob-tained with our person model.

    The notion that objects can be modeled by parts in a de-formable configuration provides an elegant framework forrepresenting object categories [13,6,10,12,13,15,16,22].While these models are appealing from a conceptual pointof view, it has been difficult to establish their value in prac-tice. On difficult datasets, deformable models are often out-performed by conceptually weaker models such as rigidtemplates [5] or bag-of-features [23]. One of our main goalsis to address this performance gap.

    Our models include both a coarse global template cov-ering an entire object and higher resolution part templates.The templates represent histogram of gradient features [5].As in [14, 19, 21], we train models discriminatively. How-ever, our system is semi-supervised, trained with a max-margin framework, and does not rely on feature detection.We also describe a simple and effective strategy for learn-ing parts from weakly-labeled data. In contrast to computa-tionally demanding approaches such as [4], we can learn amodel in 3 hours on a single CPU.

    Another contribution of our work is a new methodologyfor discriminative training. We generalize SVMs for han-dling latent variables such as part positions, and introduce anew method for data mining hard negative examples dur-ing training. We believe that handling partially labeled datais a significant issue in machine learning for computer vi-sion. For example, the PASCAL dataset only specifies a

    1

    AP 12% 27% 36% 45% 49% 2005 2008 2009 2010 2011

    Part 1: modeling

    Part 2: learning

  • Formalizing the object detection task

    Many possible ways

  • Input

    person

    motorbike

    Desired output

    Many possible ways, this one is popular:

    Formalizing the object detection task

    cat,dog,chair,cow,person,motorbike,car,...

  • Input

    person

    motorbike

    Desired output

    Performance summary:

    Average Precision (AP)0 is worst 1 is perfect

    Many possible ways, this one is popular:

    Formalizing the object detection task

    cat,dog,chair,cow,person,motorbike,car,...

  • Benchmark datasets

    PASCAL VOC 2005 2012 - 54k objects in 22k images - 20 object classes - annual competition

  • Benchmark datasets

    PASCAL VOC 2005 2012 - 54k objects in 22k images - 20 object classes - annual competition

  • Reduction to binary classification

    Figure 2. Some sample images from our new human detection database. The subjects are always upright, but with some partial occlusionsand a wide range of variations in pose, appearance, clothing, illumination and background.

    probabilities to be distinguished more easily. We will oftenuse miss rate at 104FPPW as a reference point for results.This is arbitrary but no more so than, e.g. Area Under ROC.In a multiscale detector it corresponds to a raw error rate ofabout 0.8 false positives per 640480 image tested. (The fulldetector has an even lower false positive rate owing to non-maximum suppression). Our DET curves are usually quiteshallow so even very small improvements in miss rate areequivalent to large gains in FPPW at constant miss rate. Forexample, for our default detector at 1e-4 FPPW, every 1%absolute (9% relative) reduction in miss rate is equivalent toreducing the FPPW at constant miss rate by a factor of 1.57.

    5 Overview of ResultsBefore presenting our detailed implementation and per-

    formance analysis, we compare the overall performance ofour final HOG detectors with that of some other existingmethods. Detectors based on rectangular (R-HOG) or cir-cular log-polar (C-HOG) blocks and linear or kernel SVMare compared with our implementations of the Haar wavelet,PCA-SIFT, and shape context approaches. Briefly, these ap-proaches are as follows:Generalized Haar Wavelets. This is an extended set of ori-ented Haar-like wavelets similar to (but better than) that usedin [17]. The features are rectified responses from 99 and1212 oriented 1st and 2nd derivative box filters at 45 inter-vals and the corresponding 2nd derivative xy filter.PCA-SIFT. These descriptors are based on projecting gradi-ent images onto a basis learned from training images usingPCA [11]. Ke & Sukthankar found that they outperformedSIFT for key point based matching, but this is controversial[14]. Our implementation uses 1616 blocks with the samederivative scale, overlap, etc., settings as our HOG descrip-tors. The PCA basis is calculated using positive training im-ages.Shape Contexts. The original Shape Contexts [1] used bi-nary edge-presence voting into log-polar spaced bins, irre-spective of edge orientation. We simulate this using our C-HOG descriptor (see below) with just 1 orientation bin. 16angular and 3 radial intervals with inner radius 2 pixels andouter radius 8 pixels gave the best results. Both gradient-

    strength and edge-presence based voting were tested, withthe edge threshold chosen automatically to maximize detec-tion performance (the values selected were somewhat vari-able, in the region of 2050 graylevels).Results. Fig. 3 shows the performance of the various detec-tors on the MIT and INRIA data sets. The HOG-based de-tectors greatly outperform the wavelet, PCA-SIFT and ShapeContext ones, giving near-perfect separation on the MIT testset and at least an order of magnitude reduction in FPPWon the INRIA one. Our Haar-like wavelets outperform MITwavelets because we also use 2nd order derivatives and con-trast normalize the output vector. Fig. 3(a) also shows MITsbest parts based and monolithic detectors (the points are in-terpolated from [17]), however beware that an exact compar-ison is not possible as we do not know how the database in[17] was divided into training and test parts and the nega-tive images used are not available. The performances of thefinal rectangular (R-HOG) and circular (C-HOG) detectorsare very similar, with C-HOG having the slight edge. Aug-menting R-HOG with primitive bar detectors (oriented 2ndderivatives R2-HOG) doubles the feature dimension butfurther improves the performance (by 2% at 104 FPPW).Replacing the linear SVM with a Gaussian kernel one im-proves performance by about 3% at 104 FPPW, at the costof much higher run times1. Using binary edge voting (EC-HOG) instead of gradient magnitude weighted voting (C-HOG) decreases performance by 5% at 104 FPPW, whileomitting orientation information decreases it by much more,even if additional spatial or radial bins are added (by 33% at104 FPPW, for both edges (E-ShapeC) and gradients (G-ShapeC)). PCA-SIFT also performs poorly. One reason isthat, in comparison to [11], many more (80 of 512) principalvectors have to be retained to capture the same proportion ofthe variance. This may be because the spatial registration isweaker when there is no keypoint detector.

    6 Implementation and Performance StudyWe now give details of our HOG implementations and

    systematically study the effects of the various choices on de-1We use the hard examples generated by linear R-HOG to train the ker-

    nel R-HOG detector, as kernel R-HOG generates so few false positives thatits hard example set is too sparse to improve the generalization significantly.

    pos = { ... ... }

    neg = { ... background patches ... }

    Descriptor Cues

    input image weightedpos wtsweightedneg wts

    avg. grad outside in block

    The most important cuesare head, shoulder, legsilhouettesVertical gradients insidethe person count asnegativeOverlapping blocks thosejust outside the contourare the most important

    Histograms of Oriented Gradients for Human Detection p. 11/13

    SVM Sliding window detector

    Dalal & Triggs (CVPR05)

    HOG

  • Sliding window detection

    Compute HOG of the whole image at multiple resolutions Score every subwindow of the feature pyramid Apply non-maxima suppression

    (a) (b) (c) (d) (e) (f) (g)Figure 6. Our HOG detectors cue mainly on silhouette contours (especially the head, shoulders and feet). The most active blocks arecentred on the image background just outside the contour. (a) The average gradient image over the training examples. (b) Each pixelshows the maximum positive SVM weight in the block centred on the pixel. (c) Likewise for the negative SVM weights. (d) A test image.(e) Its computed R-HOG descriptor. (f,g) The R-HOG descriptor weighted by respectively the positive and the negative SVM weights.

    would help to improve the detection results in more generalsituations.Acknowledgments. This work was supported by the Euro-pean Union research projects ACEMEDIA and PASCAL. Wethanks Cordelia Schmid for many useful comments. SVM-Light [10] provided reliable training of large-scale SVMs.

    References[1] S. Belongie, J. Malik, and J. Puzicha. Matching shapes. The

    8th ICCV, Vancouver, Canada, pages 454461, 2001.[2] V. de Poortere, J. Cant, B. Van den Bosch, J. de

    Prins, F. Fransens, and L. Van Gool. Efficient pedes-trian detection: a test case for svm based categorization.Workshop on Cognitive Vision, 2002. Available online:http://www.vision.ethz.ch/cogvis02/.

    [3] P. Felzenszwalb and D. Huttenlocher. Efficient matching ofpictorial structures. CVPR, Hilton Head Island, South Car-olina, USA, pages 6675, 2000.

    [4] W. T. Freeman and M. Roth. Orientation histograms forhand gesture recognition. Intl. Workshop on Automatic Face-and Gesture- Recognition, IEEE Computer Society, Zurich,Switzerland, pages 296301, June 1995.

    [5] W. T. Freeman, K. Tanaka, J. Ohta, and K. Kyuma. Com-puter vision for computer games. 2nd International Confer-ence on Automatic Face and Gesture Recognition, Killington,VT, USA, pages 100105, October 1996.

    [6] D. M. Gavrila. The visual analysis of human movement: Asurvey. CVIU, 73(1):8298, 1999.

    [7] D. M. Gavrila, J. Giebel, and S. Munder. Vision-based pedes-trian detection: the protector+ system. Proc. of the IEEE In-telligent Vehicles Symposium, Parma, Italy, 2004.

    [8] D. M. Gavrila and V. Philomin. Real-time object detection forsmart vehicles. CVPR, Fort Collins, Colorado, USA, pages8793, 1999.

    [9] S. Ioffe and D. A. Forsyth. Probabilistic methods for findingpeople. IJCV, 43(1):4568, 2001.

    [10] T. Joachims. Making large-scale svm learning practical. InB. Schlkopf, C. Burges, and A. Smola, editors, Advances inKernel Methods - Support Vector Learning. The MIT Press,Cambridge, MA, USA, 1999.

    [11] Y. Ke and R. Sukthankar. Pca-sift: A more distinctive rep-resentation for local image descriptors. CVPR, Washington,DC, USA, pages 6675, 2004.

    [12] D. G. Lowe. Distinctive image features from scale-invariantkeypoints. IJCV, 60(2):91110, 2004.

    [13] R. K. McConnell. Method of and apparatus for pattern recog-nition, January 1986. U.S. Patent No. 4,567,610.

    [14] K. Mikolajczyk and C. Schmid. A performance evaluation oflocal descriptors. PAMI, 2004. Accepted.

    [15] K. Mikolajczyk and C. Schmid. Scale and affine invariantinterest point detectors. IJCV, 60(1):6386, 2004.

    [16] K. Mikolajczyk, C. Schmid, and A. Zisserman. Human detec-tion based on a probabilistic assembly of robust part detectors.The 8th ECCV, Prague, Czech Republic, volume I, pages 6981, 2004.

    [17] A. Mohan, C. Papageorgiou, and T. Poggio. Example-basedobject detection in images by components. PAMI, 23(4):349361, April 2001.

    [18] C. Papageorgiou and T. Poggio. A trainable system for objectdetection. IJCV, 38(1):1533, 2000.

    [19] R. Ronfard, C. Schmid, and B. Triggs. Learning to parse pic-tures of people. The 7th ECCV, Copenhagen, Denmark, vol-ume IV, pages 700714, 2002.

    [20] Henry Schneiderman and Takeo Kanade. Object detectionusing the statistics of parts. IJCV, 56(3):151177, 2004.

    [21] Eric L. Schwartz. Spatial mapping in the primate sensory pro-jection: analytic structure and relevance to perception. Bio-logical Cybernetics, 25(4):181194, 1977.

    [22] P. Viola, M. J. Jones, and D. Snow. Detecting pedestriansusing patterns of motion and appearance. The 9th ICCV, Nice,France, volume 1, pages 734741, 2003.

    Image pyramid HOG feature pyramid

    pb+Q`2(A, T) = w (A, T)

  • Detection

    p number of locations p ~ 250,000 per image

  • Detection

    p number of locations p ~ 250,000 per image

    test set has ~ 5000 images

    >> 1.3x109 windows to classify

  • Detection

    p number of locations p ~ 250,000 per image

    test set has ~ 5000 images

    >> 1.3x109 windows to classify

    typically only ~ 1,000 true positive locations

  • Detection

    p number of locations p ~ 250,000 per image

    test set has ~ 5000 images

    >> 1.3x109 windows to classify

    typically only ~ 1,000 true positive locations

    Extremely unbalanced binary classification

  • Dalal & Triggs detector on INRIA3.5 Overview of Results 27

    0 0.2 0.4 0.6 0.8 10

    0.1

    0.2

    0.3

    0.4

    0.5

    0.6

    0.7

    0.8

    0.9

    1

    Recall

    Prec

    isio

    n

    RecallPrecision different descriptors on INRIA static person database

    Ker. RHOGLin. RHOGLin. R2HogWaveletPCASIFTLin. EShapeC

    0 0.2 0.4 0.6 0.8 10

    0.1

    0.2

    0.3

    0.4

    0.5

    0.6

    0.7

    0.8

    0.9

    1

    Recall

    Prec

    ision

    RecallPrecision descriptors on INRIA static+moving person database

    RHOG + IMHmdRHOGWavelet

    (a) (b)

    Fig. 3.6. The performance of selected detectors on the INRIA static (left) and static+moving(right) person data sets. For both of the data sets, the plots show the substantial overall gainsobtained by using HOG features rather than other state-of-the-art descriptors. (a) Comparesstatic HOG descriptors with other state of the art descriptors on INRIA static person data set.(b) Compares combined the static and motion HOG, the static HOG and the wavelet detectorson the combined INRIA static and moving person data set.

    [2001] but also includes both 1st and 2nd-order derivative filters at 45 interval and the corre-sponding 2nd derivative xy filter. It yields AP of 0.53. Shape contexts based on edges (E-ShapeC)perform considerably worse with an AP of 0.25. However, Chapter 4 will show that generalisedshape contexts [Mori and Malik 2003], which like standard shape contexts compute circularblocks with cells shaped over a log-polar grid, but which use both image gradients and orienta-tion histograms as in R-HOG, give similar performance. This highlights the fact that orientationhistograms are very effective at capturing the information needed for object recognition.

    For the video sequences we compare our combined static andmotion HOG, static HOG, andHaar wavelet detectors. The detectors were trained and tested on training and test portions ofthe combined INRIA static and moving person data set. Details on how the descriptors and thedata sets were combined are presented in Chapter 6. Figure 3.6(b) summarises the results. TheHOG-based detectors again significantly outperform the wavelet based one, but surprisinglythe combined static and motion HOG detector does not seem to offer a significant advantageover the static HOG one: The static detector gives an AP of 0.553 compared to 0.527 for themotion detector. These results are surprising and disappointing because Sect. 6.5.2, where weused DET curves (c.f . Sect. B.1) for evaluations, shows that for exactly the same data set, theindividual window classifier for the motion detector gives significantly better performance thanthe static HOG window classifier with false positive rates about one order of magnitude lowerthan those for the static HOG classifier. We are not sure what is causing this anomaly and arecurrently investigating it. It seems to be linked to the threshold used for truncating the scoresin the mean shift fusion stage (during non-maximum suppression) of the combined detector.

    AP = 75% (79% in my implementation)

    Very good Declare victory and go home?

  • Dalal & Triggs on PASCAL VOC 2007

    AP = 12%(using my implementation)

    Descriptor Cues

    input image weightedpos wtsweightedneg wts

    avg. grad outside in block

    The most important cuesare head, shoulder, legsilhouettesVertical gradients insidethe person count asnegativeOverlapping blocks thosejust outside the contourare the most important

    Histograms of Oriented Gradients for Human Detection p. 11/13

  • How can we do better?

    Revisit an old idea: part-based models (pictorial structures)- Fischler & Elschlager 73, Felzenszwalb & Huttenlocher 00

    Combine with modern features and machine learning

  • Part-based models

    Parts local appearance templates Springs spatial connections between parts (geom. prior)

    Image: [Felzenszwalb and Huttenlocher 05]

  • Part-based models

    Local appearance is easier to model than the global appearance- Training data shared across deformations- part can be local or global depending on resolution

    Generalizes to previously unseen configurations

  • General formulation

    = (,)

    = (, . . . , )

    (, . . . , ) v1v2 ppart locations in the image

    (or feature pyramid)

  • Part configuration score function

    p

    score(, . . . , ) =

    =()

    (,)

    (, )

    Part match scores

    spring costs

    v1v2Highest scoring configurations

  • Part configuration score function

    Objective: maximize score over p1,...,pn hn configurations! (h = |P|, about 250,000) Dynamic programming

    - If G = (V,E) is a tree, O(nh2) general algorithm O(nh) with some restrictions on dij

    score(, . . . , ) =

    =()

    (,)

    (, )

    Part match scores

    spring costs

  • Star-structured deformable part models

    test image star model detection

    root part

  • Recall the Dalal & Triggs detector

    HOG feature pyramid Linear filter / sliding-window detector SVM training to learn parameters w

    (a) (b) (c) (d) (e) (f) (g)Figure 6. Our HOG detectors cue mainly on silhouette contours (especially the head, shoulders and feet). The most active blocks arecentred on the image background just outside the contour. (a) The average gradient image over the training examples. (b) Each pixelshows the maximum positive SVM weight in the block centred on the pixel. (c) Likewise for the negative SVM weights. (d) A test image.(e) Its computed R-HOG descriptor. (f,g) The R-HOG descriptor weighted by respectively the positive and the negative SVM weights.

    would help to improve the detection results in more generalsituations.Acknowledgments. This work was supported by the Euro-pean Union research projects ACEMEDIA and PASCAL. Wethanks Cordelia Schmid for many useful comments. SVM-Light [10] provided reliable training of large-scale SVMs.

    References[1] S. Belongie, J. Malik, and J. Puzicha. Matching shapes. The

    8th ICCV, Vancouver, Canada, pages 454461, 2001.[2] V. de Poortere, J. Cant, B. Van den Bosch, J. de

    Prins, F. Fransens, and L. Van Gool. Efficient pedes-trian detection: a test case for svm based categorization.Workshop on Cognitive Vision, 2002. Available online:http://www.vision.ethz.ch/cogvis02/.

    [3] P. Felzenszwalb and D. Huttenlocher. Efficient matching ofpictorial structures. CVPR, Hilton Head Island, South Car-olina, USA, pages 6675, 2000.

    [4] W. T. Freeman and M. Roth. Orientation histograms forhand gesture recognition. Intl. Workshop on Automatic Face-and Gesture- Recognition, IEEE Computer Society, Zurich,Switzerland, pages 296301, June 1995.

    [5] W. T. Freeman, K. Tanaka, J. Ohta, and K. Kyuma. Com-puter vision for computer games. 2nd International Confer-ence on Automatic Face and Gesture Recognition, Killington,VT, USA, pages 100105, October 1996.

    [6] D. M. Gavrila. The visual analysis of human movement: Asurvey. CVIU, 73(1):8298, 1999.

    [7] D. M. Gavrila, J. Giebel, and S. Munder. Vision-based pedes-trian detection: the protector+ system. Proc. of the IEEE In-telligent Vehicles Symposium, Parma, Italy, 2004.

    [8] D. M. Gavrila and V. Philomin. Real-time object detection forsmart vehicles. CVPR, Fort Collins, Colorado, USA, pages8793, 1999.

    [9] S. Ioffe and D. A. Forsyth. Probabilistic methods for findingpeople. IJCV, 43(1):4568, 2001.

    [10] T. Joachims. Making large-scale svm learning practical. InB. Schlkopf, C. Burges, and A. Smola, editors, Advances inKernel Methods - Support Vector Learning. The MIT Press,Cambridge, MA, USA, 1999.

    [11] Y. Ke and R. Sukthankar. Pca-sift: A more distinctive rep-resentation for local image descriptors. CVPR, Washington,DC, USA, pages 6675, 2004.

    [12] D. G. Lowe. Distinctive image features from scale-invariantkeypoints. IJCV, 60(2):91110, 2004.

    [13] R. K. McConnell. Method of and apparatus for pattern recog-nition, January 1986. U.S. Patent No. 4,567,610.

    [14] K. Mikolajczyk and C. Schmid. A performance evaluation oflocal descriptors. PAMI, 2004. Accepted.

    [15] K. Mikolajczyk and C. Schmid. Scale and affine invariantinterest point detectors. IJCV, 60(1):6386, 2004.

    [16] K. Mikolajczyk, C. Schmid, and A. Zisserman. Human detec-tion based on a probabilistic assembly of robust part detectors.The 8th ECCV, Prague, Czech Republic, volume I, pages 6981, 2004.

    [17] A. Mohan, C. Papageorgiou, and T. Poggio. Example-basedobject detection in images by components. PAMI, 23(4):349361, April 2001.

    [18] C. Papageorgiou and T. Poggio. A trainable system for objectdetection. IJCV, 38(1):1533, 2000.

    [19] R. Ronfard, C. Schmid, and B. Triggs. Learning to parse pic-tures of people. The 7th ECCV, Copenhagen, Denmark, vol-ume IV, pages 700714, 2002.

    [20] Henry Schneiderman and Takeo Kanade. Object detectionusing the statistics of parts. IJCV, 56(3):151177, 2004.

    [21] Eric L. Schwartz. Spatial mapping in the primate sensory pro-jection: analytic structure and relevance to perception. Bio-logical Cybernetics, 25(4):181194, 1977.

    [22] P. Viola, M. J. Jones, and D. Snow. Detecting pedestriansusing patterns of motion and appearance. The 9th ICCV, Nice,France, volume 1, pages 734741, 2003.

    Image pyramid HOG feature pyramid

    b+Q`2(A, T) = w (A, T)p

  • D&T + parts

    Add parts to the Dalal & Triggs detector- HOG features- Linear filters / sliding-window detector- Discriminative training

    A Discriminatively Trained, Multiscale, Deformable Part Model

    Pedro FelzenszwalbUniversity of [email protected]

    David McAllesterToyota Technological Institute at Chicago

    [email protected]

    Deva RamananUC Irvine

    [email protected]

    Abstract

    This paper describes a discriminatively trained, multi-scale, deformable part model for object detection. Our sys-tem achieves a two-fold improvement in average precisionover the best performance in the 2006 PASCAL person de-tection challenge. It also outperforms the best results in the2007 challenge in ten out of twenty categories. The systemrelies heavily on deformable parts. While deformable partmodels have become quite popular, their value had not beendemonstrated on difficult benchmarks such as the PASCALchallenge. Our system also relies heavily on new methodsfor discriminative training. We combine a margin-sensitiveapproach for data mining hard negative examples with aformalism we call latent SVM. A latent SVM, like a hid-den CRF, leads to a non-convex training problem. How-ever, a latent SVM is semi-convex and the training prob-lem becomes convex once latent information is specified forthe positive examples. We believe that our training meth-ods will eventually make possible the effective use of morelatent information such as hierarchical (grammar) modelsand models involving latent three dimensional pose.

    1. IntroductionWe consider the problem of detecting and localizing ob-

    jects of a generic category, such as people or cars, in staticimages. We have developed a new multiscale deformablepart model for solving this problem. The models are trainedusing a discriminative procedure that only requires bound-ing box labels for the positive examples. Using these mod-els we implemented a detection system that is both highlyefficient and accurate, processing an image in about 2 sec-onds and achieving recognition rates that are significantlybetter than previous systems.

    Our system achieves a two-fold improvement in averageprecision over the winning system [5] in the 2006 PASCALperson detection challenge. The system also outperformsthe best results in the 2007 challenge in ten out of twenty

    This material is based upon work supported by the National ScienceFoundation under Grant No. 0534820 and 0535174.

    Figure 1. Example detection obtained with the person model. Themodel is defined by a coarse template, several higher resolutionpart templates and a spatial model for the location of each part.

    object categories. Figure 1 shows an example detection ob-tained with our person model.

    The notion that objects can be modeled by parts in a de-formable configuration provides an elegant framework forrepresenting object categories [13,6,10,12,13,15,16,22].While these models are appealing from a conceptual pointof view, it has been difficult to establish their value in prac-tice. On difficult datasets, deformable models are often out-performed by conceptually weaker models such as rigidtemplates [5] or bag-of-features [23]. One of our main goalsis to address this performance gap.

    Our models include both a coarse global template cov-ering an entire object and higher resolution part templates.The templates represent histogram of gradient features [5].As in [14, 19, 21], we train models discriminatively. How-ever, our system is semi-supervised, trained with a max-margin framework, and does not rely on feature detection.We also describe a simple and effective strategy for learn-ing parts from weakly-labeled data. In contrast to computa-tionally demanding approaches such as [4], we can learn amodel in 3 hours on a single CPU.

    Another contribution of our work is a new methodologyfor discriminative training. We generalize SVMs for han-dling latent variables such as part positions, and introduce anew method for data mining hard negative examples dur-ing training. We believe that handling partially labeled datais a significant issue in machine learning for computer vi-sion. For example, the PASCAL dataset only specifies a

    1

    [FMR CVPR08][FGMR PAMI10]

    p0

    z

    Image pyramid HOG feature pyramid

    root

  • Sliding window DPM score function

    A Discriminatively Trained, Multiscale, Deformable Part Model

    Pedro FelzenszwalbUniversity of [email protected]

    David McAllesterToyota Technological Institute at Chicago

    [email protected]

    Deva RamananUC Irvine

    [email protected]

    Abstract

    This paper describes a discriminatively trained, multi-scale, deformable part model for object detection. Our sys-tem achieves a two-fold improvement in average precisionover the best performance in the 2006 PASCAL person de-tection challenge. It also outperforms the best results in the2007 challenge in ten out of twenty categories. The systemrelies heavily on deformable parts. While deformable partmodels have become quite popular, their value had not beendemonstrated on difficult benchmarks such as the PASCALchallenge. Our system also relies heavily on new methodsfor discriminative training. We combine a margin-sensitiveapproach for data mining hard negative examples with aformalism we call latent SVM. A latent SVM, like a hid-den CRF, leads to a non-convex training problem. How-ever, a latent SVM is semi-convex and the training prob-lem becomes convex once latent information is specified forthe positive examples. We believe that our training meth-ods will eventually make possible the effective use of morelatent information such as hierarchical (grammar) modelsand models involving latent three dimensional pose.

    1. IntroductionWe consider the problem of detecting and localizing ob-

    jects of a generic category, such as people or cars, in staticimages. We have developed a new multiscale deformablepart model for solving this problem. The models are trainedusing a discriminative procedure that only requires bound-ing box labels for the positive examples. Using these mod-els we implemented a detection system that is both highlyefficient and accurate, processing an image in about 2 sec-onds and achieving recognition rates that are significantlybetter than previous systems.

    Our system achieves a two-fold improvement in averageprecision over the winning system [5] in the 2006 PASCALperson detection challenge. The system also outperformsthe best results in the 2007 challenge in ten out of twenty

    This material is based upon work supported by the National ScienceFoundation under Grant No. 0534820 and 0535174.

    Figure 1. Example detection obtained with the person model. Themodel is defined by a coarse template, several higher resolutionpart templates and a spatial model for the location of each part.

    object categories. Figure 1 shows an example detection ob-tained with our person model.

    The notion that objects can be modeled by parts in a de-formable configuration provides an elegant framework forrepresenting object categories [13,6,10,12,13,15,16,22].While these models are appealing from a conceptual pointof view, it has been difficult to establish their value in prac-tice. On difficult datasets, deformable models are often out-performed by conceptually weaker models such as rigidtemplates [5] or bag-of-features [23]. One of our main goalsis to address this performance gap.

    Our models include both a coarse global template cov-ering an entire object and higher resolution part templates.The templates represent histogram of gradient features [5].As in [14, 19, 21], we train models discriminatively. How-ever, our system is semi-supervised, trained with a max-margin framework, and does not rely on feature detection.We also describe a simple and effective strategy for learn-ing parts from weakly-labeled data. In contrast to computa-tionally demanding approaches such as [4], we can learn amodel in 3 hours on a single CPU.

    Another contribution of our work is a new methodologyfor discriminative training. We generalize SVMs for han-dling latent variables such as part positions, and introduce anew method for data mining hard negative examples dur-ing training. We believe that handling partially labeled datais a significant issue in machine learning for computer vi-sion. For example, the PASCAL dataset only specifies a

    1

    p0

    z

    Spring costsFilter scores

    x = (TR, . . . , TM)

    score(A, Ty) = maxTR,...,TMMB=yKB(A, TB)

    MB=R/B(Ty, TB)

    Image pyramid HOG feature pyramid

    root

  • Detection in a slide

    +

    xxx

    ...

    ...

    ...

    model

    response of root filter

    transformed responses

    responses of part filters

    feature map feature map at 2x resolution

    detection scores for each root location

    low value high value

    color encoding of filter response values

    root filter

    1-st part filter n-th part filter

    test image

    max

    [() (, )]

  • What are the parts?

  • Aspect soup

    General philosophy: enrich models to better represent the data

    aero bike bird boat bottle bus car cat chair cow table dog horse mbike person plant sheep sofa train tv

    Our rank 3 1 2 1 1 2 2 4 1 1 1 4 2 2 1 1 2 1 4 1Our score .180 .411 .092 .098 .249 .349 .396 .110 .155 .165 .110 .062 .301 .337 .267 .140 .141 .156 .206 .336Darmstadt .301

    INRIA Normal .092 .246 .012 .002 .068 .197 .265 .018 .097 .039 .017 .016 .225 .153 .121 .093 .002 .102 .157 .242INRIA Plus .136 .287 .041 .025 .077 .279 .294 .132 .106 .127 .067 .071 .335 .249 .092 .072 .011 .092 .242 .275

    IRISA .281 .318 .026 .097 .119 .289 .227 .221 .175 .253MPI Center .060 .110 .028 .031 .000 .164 .172 .208 .002 .044 .049 .141 .198 .170 .091 .004 .091 .034 .237 .051MPI ESSOL .152 .157 .098 .016 .001 .186 .120 .240 .007 .061 .098 .162 .034 .208 .117 .002 .046 .147 .110 .054

    Oxford .262 .409 .393 .432 .375 .334TKK .186 .078 .043 .072 .002 .116 .184 .050 .028 .100 .086 .126 .186 .135 .061 .019 .036 .058 .067 .090

    Table 1. PASCAL VOC 2007 results. Average precision scores of our system and other systems that entered the competition [7]. Emptyboxes indicate that a method was not tested in the corresponding class. The best score in each class is shown in bold. Our current systemranks first in 10 out of 20 classes. A preliminary version of our system ranked first in 6 classes in the official competition.

    BottleCar

    BicycleSofa

    Figure 4. Some models learned from the PASCAL VOC 2007 dataset. We show the total energy in each orientation of the HOG cells inthe root and part filters, with the part filters placed at the center of the allowable displacements. We also show the spatial model for eachpart, where bright values represent cheap placements, and dark values represent expensive placements.

    in the PASCAL competition was .16, obtained using a rigidtemplate model of HOG features [5]. The best previous re-sult of .19 adds a segmentation-based verification step [20].Figure 6 summarizes the performance of several models wetrained. Our root-only model is equivalent to the modelfrom [5] and it scores slightly higher at .18. Performancejumps to .24 when the model is trained with a LSVM thatselects a latent position and scale for each positive example.This suggests LSVMs are useful even for rigid templatesbecause they allow for self-adjustment of the detection win-dow in the training examples. Adding deformable parts in-creases performance to .34 AP a factor of two above thebest previous score. Finally, we trained a model with parts

    but no root filter and obtained .29 AP. This illustrates theadvantage of using a multiscale representation.

    We also investigated the effect of the spatial model andallowable deformations on the 2006 person dataset. Recallthat si is the allowable displacement of a part, measured inHOG cells. We trained a rigid model with high-resolutionparts by setting si to 0. This model outperforms the root-only system by .27 to .24. If we increase the amount ofallowable displacements without using a deformation cost,we start to approach a bag-of-features. Performance peaksat si = 1, suggesting it is useful to constrain the part dis-placements. The optimal strategy allows for larger displace-ments while using an explicit deformation cost. The follow-

    6

    aero bike bird boat bottle bus car cat chair cow table dog horse mbike person plant sheep sofa train tv

    Our rank 3 1 2 1 1 2 2 4 1 1 1 4 2 2 1 1 2 1 4 1Our score .180 .411 .092 .098 .249 .349 .396 .110 .155 .165 .110 .062 .301 .337 .267 .140 .141 .156 .206 .336Darmstadt .301

    INRIA Normal .092 .246 .012 .002 .068 .197 .265 .018 .097 .039 .017 .016 .225 .153 .121 .093 .002 .102 .157 .242INRIA Plus .136 .287 .041 .025 .077 .279 .294 .132 .106 .127 .067 .071 .335 .249 .092 .072 .011 .092 .242 .275

    IRISA .281 .318 .026 .097 .119 .289 .227 .221 .175 .253MPI Center .060 .110 .028 .031 .000 .164 .172 .208 .002 .044 .049 .141 .198 .170 .091 .004 .091 .034 .237 .051MPI ESSOL .152 .157 .098 .016 .001 .186 .120 .240 .007 .061 .098 .162 .034 .208 .117 .002 .046 .147 .110 .054

    Oxford .262 .409 .393 .432 .375 .334TKK .186 .078 .043 .072 .002 .116 .184 .050 .028 .100 .086 .126 .186 .135 .061 .019 .036 .058 .067 .090

    Table 1. PASCAL VOC 2007 results. Average precision scores of our system and other systems that entered the competition [7]. Emptyboxes indicate that a method was not tested in the corresponding class. The best score in each class is shown in bold. Our current systemranks first in 10 out of 20 classes. A preliminary version of our system ranked first in 6 classes in the official competition.

    BottleCar

    BicycleSofa

    Figure 4. Some models learned from the PASCAL VOC 2007 dataset. We show the total energy in each orientation of the HOG cells inthe root and part filters, with the part filters placed at the center of the allowable displacements. We also show the spatial model for eachpart, where bright values represent cheap placements, and dark values represent expensive placements.

    in the PASCAL competition was .16, obtained using a rigidtemplate model of HOG features [5]. The best previous re-sult of .19 adds a segmentation-based verification step [20].Figure 6 summarizes the performance of several models wetrained. Our root-only model is equivalent to the modelfrom [5] and it scores slightly higher at .18. Performancejumps to .24 when the model is trained with a LSVM thatselects a latent position and scale for each positive example.This suggests LSVMs are useful even for rigid templatesbecause they allow for self-adjustment of the detection win-dow in the training examples. Adding deformable parts in-creases performance to .34 AP a factor of two above thebest previous score. Finally, we trained a model with parts

    but no root filter and obtained .29 AP. This illustrates theadvantage of using a multiscale representation.

    We also investigated the effect of the spatial model andallowable deformations on the 2006 person dataset. Recallthat si is the allowable displacement of a part, measured inHOG cells. We trained a rigid model with high-resolutionparts by setting si to 0. This model outperforms the root-only system by .27 to .24. If we increase the amount ofallowable displacements without using a deformation cost,we start to approach a bag-of-features. Performance peaksat si = 1, suggesting it is useful to constrain the part dis-placements. The optimal strategy allows for larger displace-ments while using an explicit deformation cost. The follow-

    6

    aero bike bird boat bottle bus car cat chair cow table dog horse mbike person plant sheep sofa train tv

    Our rank 3 1 2 1 1 2 2 4 1 1 1 4 2 2 1 1 2 1 4 1Our score .180 .411 .092 .098 .249 .349 .396 .110 .155 .165 .110 .062 .301 .337 .267 .140 .141 .156 .206 .336Darmstadt .301

    INRIA Normal .092 .246 .012 .002 .068 .197 .265 .018 .097 .039 .017 .016 .225 .153 .121 .093 .002 .102 .157 .242INRIA Plus .136 .287 .041 .025 .077 .279 .294 .132 .106 .127 .067 .071 .335 .249 .092 .072 .011 .092 .242 .275

    IRISA .281 .318 .026 .097 .119 .289 .227 .221 .175 .253MPI Center .060 .110 .028 .031 .000 .164 .172 .208 .002 .044 .049 .141 .198 .170 .091 .004 .091 .034 .237 .051MPI ESSOL .152 .157 .098 .016 .001 .186 .120 .240 .007 .061 .098 .162 .034 .208 .117 .002 .046 .147 .110 .054

    Oxford .262 .409 .393 .432 .375 .334TKK .186 .078 .043 .072 .002 .116 .184 .050 .028 .100 .086 .126 .186 .135 .061 .019 .036 .058 .067 .090

    Table 1. PASCAL VOC 2007 results. Average precision scores of our system and other systems that entered the competition [7]. Emptyboxes indicate that a method was not tested in the corresponding class. The best score in each class is shown in bold. Our current systemranks first in 10 out of 20 classes. A preliminary version of our system ranked first in 6 classes in the official competition.

    BottleCar

    BicycleSofa

    Figure 4. Some models learned from the PASCAL VOC 2007 dataset. We show the total energy in each orientation of the HOG cells inthe root and part filters, with the part filters placed at the center of the allowable displacements. We also show the spatial model for eachpart, where bright values represent cheap placements, and dark values represent expensive placements.

    in the PASCAL competition was .16, obtained using a rigidtemplate model of HOG features [5]. The best previous re-sult of .19 adds a segmentation-based verification step [20].Figure 6 summarizes the performance of several models wetrained. Our root-only model is equivalent to the modelfrom [5] and it scores slightly higher at .18. Performancejumps to .24 when the model is trained with a LSVM thatselects a latent position and scale for each positive example.This suggests LSVMs are useful even for rigid templatesbecause they allow for self-adjustment of the detection win-dow in the training examples. Adding deformable parts in-creases performance to .34 AP a factor of two above thebest previous score. Finally, we trained a model with parts

    but no root filter and obtained .29 AP. This illustrates theadvantage of using a multiscale representation.

    We also investigated the effect of the spatial model andallowable deformations on the 2006 person dataset. Recallthat si is the allowable displacement of a part, measured inHOG cells. We trained a rigid model with high-resolutionparts by setting si to 0. This model outperforms the root-only system by .27 to .24. If we increase the amount ofallowable displacements without using a deformation cost,we start to approach a bag-of-features. Performance peaksat si = 1, suggesting it is useful to constrain the part dis-placements. The optimal strategy allows for larger displace-ments while using an explicit deformation cost. The follow-

    6

  • Mixture models

    Data driven: aspect, occlusion modes, subclasses

    FMR CVPR 08: AP = 0.27 (person) FGMR PAMI 10: AP = 0.36 (person)

    (a) Car component 1 (initial parts)

    (b) Car component 1 (trained parts)

    (c) Car component 2 (initial parts)

    (d) Car component 2 (trained parts)

    (e) Car component 3 (initial parts)

    (f) Car component 3 (trained parts)

    Figure 4.3: Car components with parts initialized by interpolated the root filter to twice itsresolution (a,c,e), and parts after training with LSVM or WL-SSVM (b,d,f).

    62

    (a) Car component 1 (initial parts)

    (b) Car component 1 (trained parts)

    (c) Car component 2 (initial parts)

    (d) Car component 2 (trained parts)

    (e) Car component 3 (initial parts)

    (f) Car component 3 (trained parts)

    Figure 4.3: Car components with parts initialized by interpolated the root filter to twice itsresolution (a,c,e), and parts after training with LSVM or WL-SSVM (b,d,f).

    62

    (a) Car component 1 (initial parts)

    (b) Car component 1 (trained parts)

    (c) Car component 2 (initial parts)

    (d) Car component 2 (trained parts)

    (e) Car component 3 (initial parts)

    (f) Car component 3 (trained parts)

    Figure 4.3: Car components with parts initialized by interpolated the root filter to twice itsresolution (a,c,e), and parts after training with LSVM or WL-SSVM (b,d,f).

    62

  • Pushmipullyu?

    Good generalization properties on Doctor Dolittles farm

    This was supposed todetect horses

    ( + ) / 2 =

  • Latent orientation

    Unsupervised left/right orientation discovery

    FGMR PAMI 10: AP = 0.36 (person)voc-release5: AP = 0.45 (person)Publicly available code for the whole system: current voc-release5

    0.42

    0.47

    0.57

    horse AP

  • Summary of results

    (a) (b) (c) (d) (e) (f) (g)Figure 6. Our HOG detectors cue mainly on silhouette contours (especially the head, shoulders and feet). The most active blocks arecentred on the image background just outside the contour. (a) The average gradient image over the training examples. (b) Each pixelshows the maximum positive SVM weight in the block centred on the pixel. (c) Likewise for the negative SVM weights. (d) A test image.(e) Its computed R-HOG descriptor. (f,g) The R-HOG descriptor weighted by respectively the positive and the negative SVM weights.

    would help to improve the detection results in more generalsituations.Acknowledgments. This work was supported by the Euro-pean Union research projects ACEMEDIA and PASCAL. Wethanks Cordelia Schmid for many useful comments. SVM-Light [10] provided reliable training of large-scale SVMs.

    References[1] S. Belongie, J. Malik, and J. Puzicha. Matching shapes. The

    8th ICCV, Vancouver, Canada, pages 454461, 2001.[2] V. de Poortere, J. Cant, B. Van den Bosch, J. de

    Prins, F. Fransens, and L. Van Gool. Efficient pedes-trian detection: a test case for svm based categorization.Workshop on Cognitive Vision, 2002. Available online:http://www.vision.ethz.ch/cogvis02/.

    [3] P. Felzenszwalb and D. Huttenlocher. Efficient matching ofpictorial structures. CVPR, Hilton Head Island, South Car-olina, USA, pages 6675, 2000.

    [4] W. T. Freeman and M. Roth. Orientation histograms forhand gesture recognition. Intl. Workshop on Automatic Face-and Gesture- Recognition, IEEE Computer Society, Zurich,Switzerland, pages 296301, June 1995.

    [5] W. T. Freeman, K. Tanaka, J. Ohta, and K. Kyuma. Com-puter vision for computer games. 2nd International Confer-ence on Automatic Face and Gesture Recognition, Killington,VT, USA, pages 100105, October 1996.

    [6] D. M. Gavrila. The visual analysis of human movement: Asurvey. CVIU, 73(1):8298, 1999.

    [7] D. M. Gavrila, J. Giebel, and S. Munder. Vision-based pedes-trian detection: the protector+ system. Proc. of the IEEE In-telligent Vehicles Symposium, Parma, Italy, 2004.

    [8] D. M. Gavrila and V. Philomin. Real-time object detection forsmart vehicles. CVPR, Fort Collins, Colorado, USA, pages8793, 1999.

    [9] S. Ioffe and D. A. Forsyth. Probabilistic methods for findingpeople. IJCV, 43(1):4568, 2001.

    [10] T. Joachims. Making large-scale svm learning practical. InB. Schlkopf, C. Burges, and A. Smola, editors, Advances inKernel Methods - Support Vector Learning. The MIT Press,Cambridge, MA, USA, 1999.

    [11] Y. Ke and R. Sukthankar. Pca-sift: A more distinctive rep-resentation for local image descriptors. CVPR, Washington,DC, USA, pages 6675, 2004.

    [12] D. G. Lowe. Distinctive image features from scale-invariantkeypoints. IJCV, 60(2):91110, 2004.

    [13] R. K. McConnell. Method of and apparatus for pattern recog-nition, January 1986. U.S. Patent No. 4,567,610.

    [14] K. Mikolajczyk and C. Schmid. A performance evaluation oflocal descriptors. PAMI, 2004. Accepted.

    [15] K. Mikolajczyk and C. Schmid. Scale and affine invariantinterest point detectors. IJCV, 60(1):6386, 2004.

    [16] K. Mikolajczyk, C. Schmid, and A. Zisserman. Human detec-tion based on a probabilistic assembly of robust part detectors.The 8th ECCV, Prague, Czech Republic, volume I, pages 6981, 2004.

    [17] A. Mohan, C. Papageorgiou, and T. Poggio. Example-basedobject detection in images by components. PAMI, 23(4):349361, April 2001.

    [18] C. Papageorgiou and T. Poggio. A trainable system for objectdetection. IJCV, 38(1):1533, 2000.

    [19] R. Ronfard, C. Schmid, and B. Triggs. Learning to parse pic-tures of people. The 7th ECCV, Copenhagen, Denmark, vol-ume IV, pages 700714, 2002.

    [20] Henry Schneiderman and Takeo Kanade. Object detectionusing the statistics of parts. IJCV, 56(3):151177, 2004.

    [21] Eric L. Schwartz. Spatial mapping in the primate sensory pro-jection: analytic structure and relevance to perception. Bio-logical Cybernetics, 25(4):181194, 1977.

    [22] P. Viola, M. J. Jones, and D. Snow. Detecting pedestriansusing patterns of motion and appearance. The 9th ICCV, Nice,France, volume 1, pages 734741, 2003.

    A Discriminatively Trained, Multiscale, Deformable Part Model

    Pedro FelzenszwalbUniversity of [email protected]

    David McAllesterToyota Technological Institute at Chicago

    [email protected]

    Deva RamananUC Irvine

    [email protected]

    Abstract

    This paper describes a discriminatively trained, multi-scale, deformable part model for object detection. Our sys-tem achieves a two-fold improvement in average precisionover the best performance in the 2006 PASCAL person de-tection challenge. It also outperforms the best results in the2007 challenge in ten out of twenty categories. The systemrelies heavily on deformable parts. While deformable partmodels have become quite popular, their value had not beendemonstrated on difficult benchmarks such as the PASCALchallenge. Our system also relies heavily on new methodsfor discriminative training. We combine a margin-sensitiveapproach for data mining hard negative examples with aformalism we call latent SVM. A latent SVM, like a hid-den CRF, leads to a non-convex training problem. How-ever, a latent SVM is semi-convex and the training prob-lem becomes convex once latent information is specified forthe positive examples. We believe that our training meth-ods will eventually make possible the effective use of morelatent information such as hierarchical (grammar) modelsand models involving latent three dimensional pose.

    1. IntroductionWe consider the problem of detecting and localizing ob-

    jects of a generic category, such as people or cars, in staticimages. We have developed a new multiscale deformablepart model for solving this problem. The models are trainedusing a discriminative procedure that only requires bound-ing box labels for the positive examples. Using these mod-els we implemented a detection system that is both highlyefficient and accurate, processing an image in about 2 sec-onds and achieving recognition rates that are significantlybetter than previous systems.

    Our system achieves a two-fold improvement in averageprecision over the winning system [5] in the 2006 PASCALperson detection challenge. The system also outperformsthe best results in the 2007 challenge in ten out of twenty

    This material is based upon work supported by the National ScienceFoundation under Grant No. 0534820 and 0535174.

    Figure 1. Example detection obtained with the person model. Themodel is defined by a coarse template, several higher resolutionpart templates and a spatial model for the location of each part.

    object categories. Figure 1 shows an example detection ob-tained with our person model.

    The notion that objects can be modeled by parts in a de-formable configuration provides an elegant framework forrepresenting object categories [13,6,10,12,13,15,16,22].While these models are appealing from a conceptual pointof view, it has been difficult to establish their value in prac-tice. On difficult datasets, deformable models are often out-performed by conceptually weaker models such as rigidtemplates [5] or bag-of-features [23]. One of our main goalsis to address this performance gap.

    Our models include both a coarse global template cov-ering an entire object and higher resolution part templates.The templates represent histogram of gradient features [5].As in [14, 19, 21], we train models discriminatively. How-ever, our system is semi-supervised, trained with a max-margin framework, and does not rely on feature detection.We also describe a simple and effective strategy for learn-ing parts from weakly-labeled data. In contrast to computa-tionally demanding approaches such as [4], we can learn amodel in 3 hours on a single CPU.

    Another contribution of our work is a new methodologyfor discriminative training. We generalize SVMs for han-dling latent variables such as part positions, and introduce anew method for data mining hard negative examples dur-ing training. We believe that handling partially labeled datais a significant issue in machine learning for computer vi-sion. For example, the PASCAL dataset only specifies a

    1

    [DT05]AP 0.12

    [FMR08]AP 0.27 [FGMR10]

    AP 0.36 [GFM voc-release5]AP 0.45

    [GFM11]AP 0.49

  • Part 2: DPM parameter learning

    ? ?

    ??

    ???

    ??

    ?

    ?

    ?

    component 1 component 2

    fixed model structure

  • Part 2: DPM parameter learning

    ? ?

    ??

    ???

    ??

    ?

    ?

    ?

    component 1 component 2

    fixed model structure training images y

    +1

  • Part 2: DPM parameter learning

    ? ?

    ??

    ???

    ??

    ?

    ?

    ?

    component 1 component 2

    fixed model structure training images y

    +1

    -1

  • Part 2: DPM parameter learning

    ? ?

    ??

    ???

    ??

    ?

    ?

    ?

    component 1 component 2

    fixed model structure training images y

    +1

    -1Parameters to learn: biases (per component) deformation costs (per part) filter weights

  • Linear parameterization

    Spring costsFilter scores

    x = (TR, . . . , TM)

    score(A, Ty) = maxTR,...,TMMB=yKB(A, TB)

    MB=R/B(Ty, TB)

    KB(A, TB) = wB (A, TB)

    /B(Ty, TB) = dB (/tk, /vk, /t, /v)

    Filter scores

    Spring costs

    b+Q`2(A, Ty) = maxx w (A, (Ty, x))

  • Positive examples (y = +1)

    We want7w(t) = maxxw(t)w (t, x)

    to score >= +1

    w(t) includes all z with more than 70% overlap with ground truth

    x specifies an image and bounding box

    person

  • Negative examples (y = -1)

    x specifies an image and a HOG pyramid location p0

    We want7w(t) = maxxw(t)w (t, x)

    to score

  • Typical dataset

    300 8,000 positive examples

    500 million to 1 billion negative examples(not including latent configurations!)

    Large-scale*

    *unless someone from google is here

  • How we learn parameters: latent SVM

    1(w) = Rkwk + *Bmax{y, R vB7w(tB)}

  • 1(w) = Rkwk + *Bmax{y, R vB7w(tB)}

    1(w) = Rkwk + *BS

    max{y, R maxxw(t)

    w (tB, x)}

    + *BL

    max{y, R+ maxxw(t)

    w (tB, x)}

    How we learn parameters: latent SVM

  • 1(w) = Rkwk + *Bmax{y, R vB7w(tB)}

    1(w) = Rkwk + *BS

    max{y, R maxxw(t)

    w (tB, x)}

    + *BL

    max{y, R+ maxxw(t)

    w (tB, x)}

    w

    + score

    z1

    z2 z3

    z4

    convex

    How we learn parameters: latent SVM

  • 1(w) = Rkwk + *Bmax{y, R vB7w(tB)}

    w

    score

    z1

    z2 z3

    z4

    1(w) = Rkwk + *BS

    max{y, R maxxw(t)

    w (tB, x)}

    + *BL

    max{y, R+ maxxw(t)

    w (tB, x)}

    w

    + score

    z1

    z2 z3

    z4

    convexconcave :(

    How we learn parameters: latent SVM

  • Observations

    w

    score

    z1

    z2 z3

    z4

    w

    + score

    z1

    z2 z3

    z4

    convexconcave :(

    Latent SVM objective is convex in the negatives

    but not in the positives

    >> semi-convex

  • Convex upper bound on loss

    w

    score

    z1

    z2 z3

    z4

    w (current)

    w

    score

    z1

    ZPi = z2

    z3

    z4

    w (current)

    max{y, R maxxw(t)

    w (tB, x)}

    max{y, Rw (tB,wSB)}convex

  • Auxiliary objective

    Let ZP = {ZP1, ZP2, ... }

    1(w,wS) = Rkwk + *BS

    max{y, Rw (tB,wSB)}

    + *BL

    max{y, R+ maxxw(t)

    w (tB, x)}

  • Auxiliary objective

    Let ZP = {ZP1, ZP2, ... }

    1(w,wS) = Rkwk + *BS

    max{y, Rw (tB,wSB)}

    + *BL

    max{y, R+ maxxw(t)

    w (tB, x)}

    Note that 1(w,wT) minwS 1(w,wS) = 1(w)

  • Auxiliary objective

    Let ZP = {ZP1, ZP2, ... }

    1(w,wS) = Rkwk + *BS

    max{y, Rw (tB,wSB)}

    + *BL

    max{y, R+ maxxw(t)

    w (tB, x)}

    w = minw,wS1(w,wS) = minw 1(w)and

    Note that 1(w,wT) minwS 1(w,wS) = 1(w)

  • Auxiliary objective

    w = minw,wS1(w,wS) = minw 1(w)

    This isnt any easier to optimize

  • Auxiliary objective

    w = minw,wS1(w,wS) = minw 1(w)

    This isnt any easier to optimize

    Find stationary point by coordinate descent on 1(w,wS)

  • Auxiliary objective

    w = minw,wS1(w,wS) = minw 1(w)

    This isnt any easier to optimize

    Find stationary point by coordinate descent on 1(w,wS)

    Initialization: either by picking a w(0) (or ZP)

  • Auxiliary objective

    w = minw,wS1(w,wS) = minw 1(w)

    This isnt any easier to optimize

    Find stationary point by coordinate descent on 1(w,wS)

    Initialization: either by picking a w(0) (or ZP)

    Step 1:wSB = argmax

    xw(tB)w(i) (tB, x) B S

  • Auxiliary objective

    w = minw,wS1(w,wS) = minw 1(w)

    This isnt any easier to optimize

    Find stationary point by coordinate descent on 1(w,wS)

    Initialization: either by picking a w(0) (or ZP)

    Step 1:

    Step 2:w(i+R) = argmin

    w1(w,wS)

    wSB = argmaxxw(tB)

    w(i) (tB, x) B S

  • Step 1

    This is just detection:

    +

    xxx

    ...

    ...

    ...

    model

    response of root filter

    transformed responses

    responses of part filters

    feature map feature map at 2x resolution

    detection scores for each root location

    low value high value

    color encoding of filter response values

    root filter

    1-st part filter n-th part filter

    test image

    wSB = argmaxxw(tB)

    w(i) (tB, x) B S

  • Step 2

    minwRkw

    k + *BS

    max{y, Rw (tB,wSB)}

    + *BL

    max{y, R+ maxxw(t)

    w (tB, x)}

    Convex

  • Step 2

    minwRkw

    k + *BS

    max{y, Rw (tB,wSB)}

    + *BL

    max{y, R+ maxxw(t)

    w (tB, x)}

    Convex

    Similar to a structural SVM

  • Step 2

    minwRkw

    k + *BS

    max{y, Rw (tB,wSB)}

    + *BL

    max{y, R+ maxxw(t)

    w (tB, x)}

    Convex

    Similar to a structural SVM

    But, recall 500 million to 1 billion negative examples!

  • Step 2

    minwRkw

    k + *BS

    max{y, Rw (tB,wSB)}

    + *BL

    max{y, R+ maxxw(t)

    w (tB, x)}

    Convex

    Similar to a structural SVM

    But, recall 500 million to 1 billion negative examples!

    Can be solved by a working set method bootstrapping data mining constraint generation requires a bit of engineering to make this fast

  • Comments

    Latent SVM is mathematically equivalent to MI-SVM (Andrews et al. NIPS 2003)

    Latent SVM can be written as a latent structural SVM (Yu and Joachims ICML 2009)

    natural optimization algorithm is concave-convex procedure similar to, but not exactly the same as, coordinate descent

    xi1

    bag of instances for xi

    xi2xi3

    z1

    z2z3

    latent labels for xi

  • What about the model structure?

    ? ?

    ??

    ???

    ??

    ?

    ?

    ?

    component 1 component 2

    fixed model structure training images y

    +1

    -1Model structure # components # parts per component root and part filter shapes part anchor locations

  • Learning model structure

    Split positives by aspect ratio

    Warp to common size

    Train Dalal & Triggs model for each aspect ratio on its own

  • Learning model structure

    Use D&T filters as initial w for LSVM training

    Merge components

    Root filter placement and component choice are latent

  • Learning model structure

    Add parts to cover high-energy areas of root filters

    Continue training model with LSVM

  • Learning model structure

    without orientation clustering

    with orientation clustering

  • Learning model structure

    In summary repeated application of LSVM training to models of increasing complexity structure learning involves many heuristics (and vision insight!)