Video Object Detection Speedup Using Staggered Sampling

Abstract

This paper presents an enhancement of the standard sampling strategy for filter-based object detection and tracking in video streams. The proposed method, called staggered sampling, seeks to maximize the sampling density across video frames, thus reducing the number of patches sampled while retaining proportionally high recall rates. The method can be tailored to virtually any constraint on resources and may be used in conjunction with any filter-based object detector / tracking algorithm combination. We test our method using a modified version of the face detector in the OpenCV library and a simple tracking algorithm. The resulting detector was applied to some video sequences from the QCIF collection. Our results show that staggered sampling can achieve around 90% of the recall of full (dense) sampling while only evaluating the detector on around 10% of the image locations. At the same time the precision of the detector increases. The staggered sampling approach therefore addresses the problem of acquiring new objects in an object tracking framework by enabling a low-cost background scan of the video stream to run continuously. The simplicity and robustness of this approach make it an excellent enhancement to existing video object detection methods.

1. Introduction There have been many recent advances in object

detection algorithms leading to real time implementations on a wide variety of platforms. In some cases real-time performance is achieved by coupling an accurate but computationally expensive object detector with some sort of fast and robust tracking algorithm ([7]—[11]). Such systems often require elaborate (or expensive) initialization procedures and also have difficulty picking up new objects entering the scene. In other cases the object detection algorithm itself is sufficiently efficient to allow real-time performance (e.g. [1]), possibly through the use of parallel architectures. In our experience, object detection algorithms are normally used as a pre-cursor to some other kind of processing (e.g. image enhancement, parameter setting, image capture, face recognition, scene

decomposition and understanding) which may run concurrently on the same platform. Particularly on low power platforms, or when dealing with higher resolution video streams, it is critical that the object detection procedure does not consume too much of the platform resources in order to release resources for the main applications. Although resource availability may vary over time the detection process must maintain real-time rates or unacceptable performance degradation will result.

Many object detection methods (e.g. [1]—[5]) are based on applying a classification algorithm to image patches at locations and scales determined by a sampling strategy. As far as we are aware, such methods all employ a sampling strategy consisting of a spatial grid at multiple scales over each frame with the vertical and horizontal grid steps given by a constant for each grid scale. The same sampling is applied to consecutive frames in a video sequence. For the purposes of this paper we shall presume that the “native” sampling strategy of a detector is the finest resolution sampling, and we refer to it as 1x1x1, corresponding to the x-step, the y-step and the scale step.

The simplest way to reduce the resources required by object detection is to coarsen the sampling, either by increasing the spatial grid step, or reducing the number of scales sampled, or both. If the object detection algorithm is reasonably robust to changes in scale and location then this can result in significant efficiency gains for proportionally much smaller degradation in the detection accuracy. However it turns out that for the same computational cost we can achieve better detection accuracy by implementing a slightly more sophisticated sampling strategy.

The main contribution of this paper is a sequence of simple adjustments, called staggered sampling, which may be made to the conventional sampling strategy to yield higher detection accuracy for the same computational cost. We also give an example of how this may be combined with a simple tracking algorithm to give an interleaved detection and tracking framework.

2. Related Work This work is most related to the field of detecting and

tracking objects in video. The simplest approach to this

Video Object Detection Speedup Using Staggered Sampling

Darryl Greig

Hewlett Packard Labs Long Down Ave, Bristol BS34 8QZ, UK

[email protected]

978-1-4244-5498-3/09/$25.00 ©2009 IEEE

problem is to create a patch-based detector that is both robust and efficient enough to be applied to every frame of a video stream in real-time (e.g. [1], [3], [6]) using the standard sampling strategy. However no matter how efficient a detector is, there still must be some processing on every location in the sampling grid, which bounds the maximum speed achievable without resorting to parallelization of one form or another. A more serious issue arises from the inherent precision/recall tradeoff in most object detection algorithms, which usually results in objects being lost in some percentage of the frames. One response to this is to couple the object detection algorithm with a tracking algorithm that retains historical knowledge to provide a more robust solution.

In [7] the authors use a face detector together with a modified version of the CONDENSATION tracking algorithm to maintain a probability map in which faces appear as local maxima. The detection algorithm is run concurrently with the tracking to maintain accurate probabilities and to cope with the appearance of new faces. [9] follows a similar strategy, using particle filters rather than CONDENSATION for tracking. A holistic detection and tracking framework is presented in [10], once again relying on frame-by-frame detection which is “primed” by a tracking algorithm to enhance the results. The focus of these approaches is the effective tracking of a detected object, and in many cases the actual algorithm doing the detection is of secondary importance. A different approach is taken in [8], where a conventional face detector is coupled with a new kind of face detector that is reported to perform better on rotated faces. The conventional detector is used in the normal course of running, and the new detector is employed only when a track is lost. A sophisticated tracking algorithm is proposed in [11] but again it is initialized by a separate object detector.

The current work is actually orthogonal to these approaches in that the video sampling techniques may be used as a further enhancement to any object tracking framework that relies on point verification of objects. Indeed, one of the great problems of object detection and tracking is how to maintain a continual scan of the scene for the appearance of new objects without degrading the performance of existing object tracking. By specifying a graded sampling system we may allow a background scan for new objects to consume resources at an acceptable level while returning the best possible detection results.

3. Staggered Sampling In the first instance, suppose we have a perfect detector

for some object population �, whereby perfect we mean that for all occurences �∈� in an image, the detector will give a positive response if and only if applied to the exact location of an occurence. Suppose further that each image

contains a single occurrence � at a location distributed according to a uniform prior. Then the probability of failing to detect the occurrence is simply 1- (ns/N) where ns is the number of locations in our sampling strategy and N is the total number of locations in the image.

While this detector is “perfect” in some theoretical sense, it clearly isn’t very robust – missing the object location by a single pixel has the same effect as missing it by the maximal dimension of the image. In practise good object detectors applied to locations close to an object occurrence can often detect the object with reasonably high reliability. We may represent this by enhancing the perfect detector (we’ll refer to this as the robust detector) to give a positive response with probability given by some monotone decreasing function P(δ) where δ is the Euclidean distance of the detection location from the nearest object occurrence location, and let P(0)≡1. If a false positive is defined to be a positive response greater than some distance τ from any object occurrence, then the detector will produce false positives if ∃ δ > τ such that P(δ) > 0, and in that sense is not perfect. However depending on P it can be rather more useful. Note that real detectors are actually somewhat worse than the robust detector since they produce false positives independent of the presence of any true object occurrence, and few detectors will detect 100% of occurences even when positioned at exactly the right location.

3.1. Spatial Sampling of a Single Scale

Consider a video stream recording a single occurrence of a detectable object at a random location in the frame, and suppose that object is stationary throughout the video stream. Using the perfect detector described above and applying the same sampling strategy frame-by-frame results in a detection probability of ns/N. Furthermore if the object is not discovered in the first frame it will not be located in any frame.

However if the sampling strategy is adapted so that in a sequence of m frames no sampling grid location is visited more than once and ns sampling points are visited in each frame (therefore mns ≤ N), then while the computation cost remains the same, the probability of finding the object in the sequence becomes mns/N.

The benefits of this become clear if we devise a sampling strategy in which all N locations in the image are visited in m frames. In that case the probability of detecting the object is 1 and we may compute E(Fd), the expected number of frames before detection as per Equation (1). The computational cost is fixed at N/m locations per frame. Therefore we could tailor a sampling strategy to any constraint on computational resources simply by choosing a suitable m value, so long as we are willing to pay the price of a larger E(Fd).

( )( )m

em

mmN

n

N

niFE

m

m

m

i

s

i

sd

264.021

121

1)(1

1

≈��

�� −→

−−=

��

��

��

�� −=

∞→

=

−

� (1)

In the case of the perfect detector, and assuming uniform random placement of the object, there is no incentive to enforce any between-frame structure on the sampling strategy apart from disjoint sets of sample points. However if we consider the more realistic robust detector described above, then the proximity of the object to a sampling point becomes a factor.

As an example, suppose the minimal probability of detection for an object at any location in the image must be at least some value η≥0, and suppose that the robust detector has a detection probability function given by

� ≤≥

=otherwise

ifP

0)(

γδηδ (2)

Then a square sampling grid with grid step γ�2 will provide the necessary coverage on a single image. Now, given a video stream with the conditions described in the perfect detector case (single object, uniformly distributed location) let us consider a four-frame stagger (i.e. m=4) in which a quarter of the grid points are evaluated per frame. Figure 1 shows the sampling grid with the detection probability function represented by the disc around the sample points.

Figure 1: 2x2 Staggered Sampling Strategy

A particular sampling strategy corresponds to dividing the set of sample points into four equal sized groups. Clearly the probability of detection after four frames is equal no matter what strategy is employed, so the strategies are compared on the value of E(Fd), and this quantity is minimized by ensuring the probability of detection in the early stagger stages is as high as possible.

Let stagger step i be associated with a set of possible object locations Ai such that for each x∈ Ai the distance from the sampling point di(x)≤γ. To simplify the argument, suppose further that each stagger step has only a single sampling point (i.e. ns=1), and that a single object � is set in the video stream at a random location x�. The

probability of detecting the object within the first i stagger steps is then

( )( ) ( )( )��

� 1

1

1

1

1

)( −

=

−

=

=

−+=

=i

j ji

i

j ji

i

j ji

AAPAPAP

APP

(3)

where P(Ai) is the probability that x�∈ Ai and that � is detected within the first i stagger steps. This is maximized by minimizing the probability of the intersection in (3). Under the reasonable assumption that the detection probability on stagger step i is a monotone decreasing function of the distance di(x�), this is achieved by positioning the sampling point as far as possible from the i-1 previous sampling points. Thus an optimal sampling strategy is one in which the sampling points are selected such that the associated set of candidate locations Ai have the smallest possible overlap with the sets of candidate locations associated with previous sampling points.

Returning to the case described by Figure 1; if we refer to the black (plain pattern) discs as K, magenta (vertical line pattern) M, yellow (dotted pattern) Y and cyan (horizontal line pattern) C, and a sampling strategy is given by an ordering of these four point sets, then the eight optimal strategies are CKMY, CKYM, KCMY, KCYM, YMKC, YMCK, MYCK, MYKC. Equation (4) gives matrices for sampling orders for 2x2, 3x3, 4x4, 5x5 and 6x6 sampling.

�

��

�1320

��

�

��

�

�

164827350

��

�

��

�

�

4116913115351471282100

(4)��

�

��

�

�

4121761014123203152224816519221139718110

��

�

��

�

�

4182225613147331027172030135293249341131215263228815

1216223190

Note that stagger steps higher than 2x2 the optimal sampling order also requires that the proximity to sampling locations in neighboring sampling grids as well as in previous sampling steps be taken into account. For

example in the 5x5 sampling order above considering the grid of a single sampling neighborhood in isolation would result in choosing the bottom right position for sampling step 1. However a sampling neighborhood to the bottom right would have already sampled its top-left location in step 0, thus the interior point identified in the 5x5 matrix (4) is preferred.

3.2. Sampling the Scales

So far we have described a method for staggering the sampling of each object scale in a video stream, however most object detectors are designed to operate at multiple scales, where each scale is related to the previous scale by a scale factor. In many cases this scale factor is quite close to 1 (1.1, 1.2 are common) to allow for the normal variance of object scales in natural images. The scale space may also be subject to staggered sampling, wherein neighboring scales are sampling in successive frames. For example scale space sampling with factor 2 samples scales 0, 2, 4, … in the even frames, and the alternative scales are sampled in the odd frames. In a similar spirit to the spatial sampling grid above this exploits the robustness of a detector to maintain reasonably high detection rates while sampling considerably less locations.

......

...

...

...

...

...

...

Figure 2: Two methods of implementing 2x2x2 staggering - the numbers indicate the frame index at which the location is sampled.

If scale space sampling is implemented together with spatial grid sampling, then allowance should be made to conclude the sampling of the spatial grid patterns together with the scaling steps. This can be done in two ways, either by (a) interleaving the scale space sampling within a single pattern of the spatial grid sampling, or alternatively by (b) interleaving a full spatial grid pattern within the scale space sampling. An example of the two methods is

given for 2x2x2 sampling in Figure 2. In our experiments the scale stagger interleaving the

spatial stagger (strategy (a)) performed slightly better, however it is likely that this would change with the robustness characteristics of the particular detector being used. In an ideal case the scale factor and spatial grid resolution are chosen to match the detector characteristics exactly, so that increasing either the x, y or scale directions by some factor f will result in a similar deterioration in the final performance. In practice the sampling grids are not always matched so well to the detector; for example a detector that is particularly robust to scale changes might benefit more from strategy (b) than strategy (a).

3.3. Object Tracking

The key to successfully implementing a staggered sampling technique is to enhance the search strategy in future frames by leveraging past detection results, which amounts to a tracking strategy. In the simplest form this might be a “one-frame memory”, in which a more intense (1x1x1) sampling is undertaken in the immediate locality of objects found in the previous frame. This approach has the advantage of being very simple to implement, and tends to give very good results when the object detector has a high recall rate (i.e. when a location containing an object is sampled, the probability of returning a “false” is very low), and when objects are rarely occluded in the video streams under consideration.

For the results here we use a more sophisticated tracking implementation in which a memory of “lost” tracks is retained and intensive searching continues both around current and lost tracks. Such a method requires only an incremental increase in infrastructure and computational cost, but yields a sampling strategy that is much more robust to occlusions and detection failures. Pseudo code of this algorithm is supplied as supplemental material.

The particular tracking methodology used is not central - any reasonable tracking algorithm can be used once objects are detected. On the other hand, inadequate tracking will normally result in a significant deterioration in recall rates when staggered sampling is applied. In fact if no degradation is observed, then we may conclude that the original (dense) sampling grid has been set at too fine a step for the inherent robustness of the detector and should be coarsened.

4. Results We tested the staggered sampling approach using two

implementations. The first is an open source face detector, which we applied to some well known video sequences. In the second case we applied our own custom face detector to a more difficult sequence extracted from a movie trailer.

4.1. OpenCV Face Detector & QCIF Sequences

We first tested the staggered sampling approach using the OpenCV [6] upright face detection module from the as an object detector (haarcascade_frontalface_alt2). We implemented staggered sampling via a modified version of cvHaarDetectObjects, the OpenCV sampling routine. Some heuristics in the original code to reduce the false positive count were also removed, but the original “grouping” component which combines multiple overlapping detections was left in. In the final list of detection locations for each frame a false positive is marked when (1) the location doesn’t overlap a face in the frame at all, (2) the location overlaps a face in such a way that some of the main facial features (eyes, nose, mouth) are not contained within the location – this penalizes nonsensical overlaps and (3) the location overlaps a whole face, but is more than twice the area of the face box defined by the chin, jaw line and forehead – this penalizes detections in which the face is unreasonably small. The dense, naïve and staggered sampling tests all use the same object tracking algorithm as described in Section 3.3. Dense sampling is used as a benchmark for comparison with the other sampling methods.

We used four video sequences (described below) from the QCIF test sequence set to test the effectiveness of our methods. In each case the sequences contained at most one face per frame, and exactly one contiguous subsequence of frames containing all the face images. A face is considered to be detected in a given frame only if the face detector verifies it (i.e. a verified track). False positives were identified using frame-by-frame visual inspection of the detection results.

miss_am

This sequence has a single frontal, upright face in each frame. There is some translation of the face throughout the sequence, but tracking is easily maintained. The benchmark sampling detects all instances of the face as well as around 5 false positives.

suzie

This sequence has a single face in each frame, but with rapid and significant change of pose and location at some points in the sequence. The face is not detectable by the benchmark sampling in about 20% of the frames.

foreman This sequence has a single face in about 75% of the frames. The faces have some unusual view angles and rapid rotations, but very little translation.

carphone This sequence has a single face in each frame, but with quite a bit of in-plane rotation and some unusual expressions.

Clearly we don’t expect a staggered sampling strategy

to give better results (at least in terms of recall) than dense sampling, so we present the results in a relative format,

using the ratio between recall rates rather than the absolute values. Figures 4–7 show the results on the four video sequences. In each case the x-axis is the proportion of sampling points included in the sampling grid, and for clarity a log scale is used. For example, a 1x1x2 sampling has 0.5 of the dense grid, 2x2x3 has 1/12th and so on.

Figure 3: Test sequences (from top-left), miss_am, suzie, foreman, carphone

Note that the spatial stagger matrices in Equation (4)

could equally be implemented in any of the four 90° rotations of the matrices, so to collect our results we ran staggered and naïve sampling using all four rotations (for naïve sampling the position of the 0 is the only position sampled) and recorded the best, worst and average results.

o o

o

o

oo

o

o

oo

oo

o

ooo o oo oo o o o

0.01 0.02 0.05 0.10 0.20 0.50 1.00

0.0

0.2

0.4

0.6

0.8

1.0

Proportion of Sample Points Used (log scale)

Rec

all(N

aive

)/Rec

all(S

tagg

ered

)

*

*

**** *

**** ** *** * ** ** * * *

+

++

+

+

+

++

++++

+

+++ + ++ ++ + + +

xxx

xxx

xx xxx xx xxx x xx xx x x x

o*+x

suziecarphonemiss_amforeman

Figure 4: Ratio of the worst recall results for naïve sampling to the worst recall results for staggered sampling

Figure 4 compares staggered sampling with a naïve

sampling achieved by simply coarsening the sampling grid. In this case it is appropriate to compare the worst results of naïve sampling with the worst results of staggered sampling as the conjecture is that staggered sampling reduces the catastrophic failures which may

occur using naïve sampling. This is borne out in Figure 4, where we can clearly see a very rapid deterioration of the naïve sampling recall rates compared to the staggered sampling recall rates as the proportion of sample points drops below around 5%.

o

oo o

oo

o

ooo

ooo

ooo o oo oo o o o

0.01 0.02 0.05 0.10 0.20 0.50 1.00

0.0

0.2

0.4

0.6

0.8

1.0


Rec

all(N

aive

)/Rec

all(S

tagg

ered

)

*

*

**** *

* *** ** *** * ** ** * * *

+ ++

+

++

++++++

++++ + ++ ++ + + +x x

xxxx xx xxx xx xxx x xx xx x x x

o*+x


Figure 5: Ratio of the average recall results for naïve sampling to the worst recall results for staggered sampling

For the purposes of comparison, Figure 5 gives the same plot as Figure 4, but this time comparing the ratio of the average recall results for naïve and staggered sampling. A similar curve to that of Figure 4 is observed but with a rather less pronounced dropoff at the small sample points proportions.

o oo ooo oo ooo oo ooo o oo o

o o o o

0.01 0.02 0.05 0.10 0.20 0.50 1.00

0.0

0.2

0.4

0.6

0.8

1.0


Rec

all(S

tagg

ered

)/Rec

all(D

ense

)

* ** *** ** *** ** *** * ** ** * * *+ ++ +++ ++ +++++ +++ + ++ ++ + + +x xx xxx xx xxx xx xxx x xx xx x x x

o*+x


Figure 6: Ratio between the average staggered sampling recall rates and the dense sampling recall rate

Figure 6 compares the average recall rates using

staggered sampling with the recall rate for each sequence using dense sampling. Even in the case of the very difficult “suzie” sequence the deterioration is fairly graceful, reaching almost 80% of the original recall at the lowest point. This suggests that even with quite aggressive

decimation of the sampling grid we are able to produce reasonable results in terms of recall.

Although the staggered sampling cannot be expected to improve on dense sampling in terms of recall, it can be expected to improve precision. Indeed, any sampling scheme (including naïve sampling) that reduces the total number of locations sampled will reduce the false positive rate and therefore increase precision. Figure 7 demonstrates the increase in precision as the number of points sampled decreases.

o oo ooo

oo

ooo

oo

o

oo

o

oo

oo

oo

o

0.01 0.02 0.05 0.10 0.20 0.50 1.00

0.80

0.85

0.90

0.95

1.00


Prec

isio

n(St

agge

red)

* ** **

*** *

*

*

**

*** ***

** **

*

+ +++++ ++ +++++ +++ + ++

++

+ + +

x xx xxx xx xxx xx xxx x xx xx x xx

o*+x


Figure 7: Average staggered sampling precision results

4.2. Custom Face Detector & Complex Sequence

For a more complex video sequence we cropped a 255 frame scene from a movie trailer. The scene has 8 changes of camera view during which tracks will be lost, it has a very complex background and several occlusion events. Out of the 255 frames, 22 contain no detectable faces, 121 contain a single face, 69 contain 2 faces and 43 contain 3 faces.

Figure 8: Frame from the complex video sequence

When we ran the OpenCV detector on this sequence we found that the precision / recall characteristics of the detector made it extremely difficult to distinguish signal from noise, so we performed a separate analysis using our own custom face detector.

The results from this experiment are shown in Figure 9, where for clarity we have plotted the x-axis on a log scale. The top-right point corresponds to the dense sampling result. Clearly in this case staggered sampling again offers

a significant improvement over the naïve implementation, especially as the proportion of sample points used becomes very small.

ooo

o

oo

oo

oo

o

0.01 0.02 0.05 0.10 0.20 0.50 1.00

0.3

0.5

0.7

0.9


Rec

all R

ate

****

***

****

*o

Staggered SamplingNaive Sampling

Figure 9: Recall rate vs. sampling density (log scale) on complex sequence

5. Conclusions We have presented a method of staggering the sampling

grid in video object detection that requires very little extra infrastructure or computational overhead. This method can be used in any video object detection framework that implements a sampling grid, together with any tracking method to retain detections from previous frames. We have shown that staggered sampling offers a graceful and contained degeneration of recall rates while omitting up to 95% of the original sampling points, and improving precision. Staggered sampling may be used in resource constrained detection and tracking environments to maintain a comparatively high quality background scan of a scene for the appearance of new objects, yet work within variable and aggressive system constraints.

Acknowledgements The author would like to thank the referees for their

comments that have improved the presentation and results of this paper.

References [1] Viola, P.; Jones, M., "Rapid object detection using a

boosted cascade of simple features," Computer Vision and Pattern Recognition, 2001. pp. I-511—I-518, 2001.

[2] Rowley, H.A.; Baluja, S.; Kanade, T., "Neural network-based face detection," Computer Vision and Pattern Recognition, 1996. pp. 203—208, 1996.

[3] Li, S.Z.; Zhenqiu Zhang, "FloatBoost learning and statistical face detection," IEEE Transactions on Pattern Analysis and Machine Intelligence, 26(9), pp. 1112—1123, 2004.

[4] Schneiderman, H.; Kanade, T., "A statistical method for 3D object detection applied to faces and cars," Computer Vision and Pattern Recognition, 2000, pp.746—751 vol.1, 2000.

[5] Papageorgiou, C.; Poggio, T., “A Trainable System for Object Detection”, International Journal of Computer Vision 38(1), pp. 15—33, 2000.

[6] Froba, B.; Kublbeck, C., "Robust face detection at video frame rate based on edge orientation features," Automatic Face and Gesture Recognition, 2002, pp. 342—347, 2002.

[7] Verma, R.C.; Schmid, C.; Mikolajczyk, K., “Face detection and tracking in a video by propagating detection probabilities,” IEEE Transactions on Pattern Analysis and Machine Intelligence 25(10), pp. 1215—1228, 2003.

[8] Zhengrong Yao; Haibo Li, “Tracking a Detected Face with Dynamic Programming,” Image and Vision Computing. 24(6), pp. 573—580, 2006.

[9] Czyz, J., “Object Detection in Video via Particle Filters,” 18th International Conference on Pattern Recognition, 2006, pp.820—823, 2006.

[10] Wang, J.; Bebis, G.; Nicolescu, M.; Nicolescu, M.; Miller, R., “Improving target detection by coupling it with tracking”, Machine Vision and Applications, 20(4), pp. 205—223, 2009.

[11] Comaniciu, D.; Ramesh, V.; Meer, P., “Kernel-based object tracking,” IEEE Transactions on Pattern Analysis and Machine Intelligence, 25(5), pp. 564—577, 2003.

[12] Intel’s OpenCV library, now hosted at SourceForge, http://sourceforge.net/projects/opencvlibrary/

Video Object Detection Speedup Using Staggered Sampling

Documents

Transcript of Video Object Detection Speedup Using Staggered Sampling