A Harmonic Retrieval Framework For Discontinuous Motion ... › s › resources › imghar~1.pdf ·...

1242 IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 7, NO. 9, SEPTEMBER 1998

A Harmonic Retrieval Framework forDiscontinuous Motion Estimation

Wei-Ge Chen,Member, IEEE,Georgios B. Giannakis,Fellow, IEEE,and N. Nandhakumar,Senior Member, IEEE

Abstract—Motion discontinuities arise when there are oc-clusions or multiple moving objects in the scene that is im-aged. Conventional regularization techniques use smoothnessconstraints but are not applicable to motion discontinuities. Inthis paper, we show that discontinuous (or multiple) motionestimation can be viewed as a multicomponent harmonic retrievalproblem. From this viewpoint, a number of established techniquesfor harmonic retrieval can be applied to solve the challeng-ing problem of discontinuous (or multiple) motion. Comparedwith existing techniques, the resulting algorithm is not iterative,which not only implies computational efficiency but also obviatesconcerns regarding convergence or local minima. It also addsflexibility to spatio-temporal techniques which have suffered fromlack of explicit modeling of discontinuous motion. Experimentalverification of our framework on both synthetic data as well asreal image data is provided.

Index Terms—Compression, computer vision, discontinuousmotion, harmonic retrieval, motion estimation, multimedia, mul-tiple motion, video communication.

I. INTRODUCTION

M OTION estimation plays an important role in computervision as well as in video communications, which

has become increasingly important due to the rapid growthof multimedia applications [11]. In computer vision, two-dimensional (2-D) image motion estimation is useful forreconstructing three-dimensional motion or scene structure[2]. In video communications, image motion estimation ismainly used for interframe video compression [21, ch. 10].Accurate estimation of image motion is important for highcompression ratios because it facilitates reduction of temporaldependency among video frames. Due to its importance,motion computation has been studied extensively and manydifferent methods have been proposed with various degrees ofsuccess (see, e.g., [5]).

Manuscript received January 20, 1995; revised February 20, 1997. Thework of W.-G. Chen and N. Nandhakumar was supported by the NationalScience Foundation under Grant IRI-91109584. The work of G. B. Giannakiswas supported by the National Science Foundation under Grant MIP9210230and by the Office of Naval Research under Grant N00014-93-1-0485. Theassociate editor coordinating the review of this manuscript and approving itfor publication was Prof. Eric Dubois.

W.-G. Chen is with Microsoft Corporation, Redmond, WA 98052 USA (e-mail: [email protected]).

G. B. Giannakis is with the Department of Electrical Engineering, Uni-versity of Virginia, Charlottesville, VA 22903-2442 USA (e-mail: [email protected]).

N. Nandhakumar is with the LG Electronics Research Center of America,Inc., Princeton, NJ 08550 USA (e-mail: [email protected]).

Publisher Item Identifier S 1057-7149(98)06387-8.

This paper introduces a new framework for computingmotion in the presence of motion discontinuities. Motiondiscontinuities appear as spatial discontinuities of motion fieldsand arise frequently when there are occlusions or multiplemoving objects in the scene that is being imaged. In fact,multiple moving objects can be thought as a special caseof occlusions where the occluding surfaces are in motion.Since motion discontinuities occur when the motion fieldcontains multiple (at least two) regions of piecewise smoothmotion, discontinuous motion can also be thought as multiplemotion. Techniques for discontinuous motion will be equallysuccessful for multiple motion and vice versa.

An inherent obstacle in motion computation is the apertureproblem, which manifests itself in nonunique local motion es-timates and renders motion computation an ill-posed problem[17, ch. 12]. Usually, a unique estimate of the global motionfield can be obtained through regularization, which essentiallyresorts to a stabilizing term that imposes a smoothness con-straint on the global motion field [18]. It is rather obviousthat the smoothness constraint should only be imposed withinregions where motion is smooth and should not be appliedacross motion discontinuities. However, motion discontinuitiesare usually not known prior to motion computation—thenotable “chicken-and-egg” problem [17, ch. 12].

Many approaches have been attempted to avoid the erro-neous smoothing across motion discontinuities, e.g., [4], [6],[15], [16], [20], [23], [25], [28, p. 77], [30]–[33]. Amongthe most notable and successful ones seem to be the Markovrandom field (MRF) based approaches [15], [20], [30]–[32].No closed-form solution is available with the MRF-basedapproaches, instead an iterative algorithm is used to optimizea highly nonlinear and nonconvex function.

Approaches for motion estimation can be categorizedinto two groups: i) techniques which usually exploit fea-ture/correlation matching or the differential optical flowconstraint in the spatial domain of images, and ii) spatio-temporal techniques that operate in the spatio-temporalfrequency domain. Most of the methods developed to dealwith motion discontinuities are restricted to the first groupof techniques. Much less study of discontinuous motionexists with respect to the spatio-temporal techniques. Manydifferent spatio-temporal formulations have assumed thatmotion is constant within the spatio-temporal window, andhence motion discontinuities are ignored [1], [9], [12], [14],[26]. Despite their favorable experimental performance [5],lack of explicit models for discontinuous motion hampers

1057–7149/98$10.00 1998 IEEE

CHEN et al.: DISCONTINUOUS MOTION ESTIMATION 1243

(a) (b) (c)

Fig. 1. Time-varying partition of the image plane. (a)–(c) Time-varying partition sets for a circular entity and a stationary background att = 0; 1; 2. Theshape ofW2(t) does not change but the position changes. However, the shape ofW1(t) is “deformed” over time due to occlusion from the circular entity.

application of spatio-temporal techniques to the estimation ofcomplex motion fields where multiple motion is common.

This paper provides a new viewpoint on computing discon-tinuous (or multiple) motion within the spatio-temporal class.We show that when each smooth piece of the motion field issufficiently smooth, multiple motion estimation amounts to amulticomponent harmonic retrieval problem. This viewpointallows us to exploit many mature results from a century ofresearch on frequency estimation (see, e.g., [24, ch. 10]). Ourapproach is unique in that

• the resulting algorithm isnot necessarily iterative, a clearcomputational advantage, which also avoids convergenceproblems;

• our spatio-temporal solution explicitly models and esti-mates discontinuous motion, and thus, adds flexibility tospatio-temporal techniques;

• velocity estimation is achieved regardless of, and there-fore is not affected by, the spatial distribution (shape)of the moving image region and the “density” of motiondiscontinuities;

• Neurophysiological evidence suggests that the humanvision system may use the spatio-temporal approach tocompute motion [3]. With this regard, our formulationmay serve as a candidate model for discontinuous motionprocessing in the human vision system.

In the following section, we first establish an explicit modelfor discontinuous motion, and demonstrate the equivalence ofdiscontinuous motion estimation to multicomponent harmonicretrieval. Section III deals with the adaptation of differentfrequency estimation techniques to this problem and discussesother implementation details. Finally, in Section IV, we pro-vide experimental results of applying our technique to bothsynthetic as well as real image sequences.

II. M ODELING MULTIPLE MOTION

Consider a partition of the image region (could be the whole image plane or a window of interest onthe image plane) such that

and (1)

Fig. 1(a) gives an illustrative example of such a partition.Conditions in (1) guarantee that members of the partitionare disjoint and they fully cover the image region. Let

be the image motion (velocity)vector field on where and are thecontinuoushorizontaland vertical coordinates of , respectively. We are interestedin a particular partition satisfying the followingconditions.

(C1) On each , the image motion field iscontinuous with respect to and , .

(C2) On ,1 for every pair satisfyingand , the motion fieldis discontinuous with respect to and at

.

Conditions (C1) and (C2) state that a valid partition de-composes the image plane into regions within each of whichmotion is continuous with respect toand (C1), but betweeneach adjacent pair of which motion is discontinuous withrespect to and (C2). We name each member of the partition,

, to be a smooth motion region. The boundaries of eachsmooth motion region are referred to as motion discontinuitiesor motion boundaries.

Each smooth motion region usually corresponds to aphysical entity (object) whose image appears on . In atime-varying image sequence, a smooth motion region,,may change with time due to the motion of and theocclusion among the objects. We write it as to explicitlydenote such a time dependency. Note that the numberofsets in the partition may be time-varying and should be denotedby . However, for brevity, we maintain the constancyof and let . An example can be found inFig. 1(a)–(c) where a striped circular entity [ ] undergoesmotion from left to right and the gray background [ ]remains stationary.

A. Problem Definition

We collect a sequence of images at discretetime instants . Note that usually the images aresampled spatially and , are discrete. However, in thefollowing discussion we retain , as continuous for no-tational simplicity. Given the time-varying image sequence,finding and for all constitutes thegeneral multiple motion estimation problem. We note thatwhen , no motion discontinuity is present, and thusthe image motion field is smooth throughout the image region

. Techniques involving unconditional global smoothness

1[Wl] and [Wn] denote the closure ofWl andWn, respectively.


constraints may be effective in this case. But when ,unconditional smoothness constraints are not appropriate anderroneous velocity estimates will result, especially at motionboundaries.

We assume that for a short time interval motionis time-invariant. In addition, as commonly assumed in themotion estimation literature [5], we approximate the slow spa-tial variations within each smooth motion region by constantmotion. In our terms, we restrict on each tobe constant, according to the following assumption:

(A1)

This assumption is reasonable when the variation in depthis relatively small within each compared to the viewingdistance. In particular, (A1) holds when the image regionis locally defined. Note that we allow disconnected constantmotion regions that have the same motion to be consideredas one single constant motion region. As a consequence, weassume

(A2)

Under assumptions (A1) and (A2), we wish to find, ,and given the time-varying image sequence

.

B. An Explicit Model for Multiple Motion

Let denote the observed image with supportregion and be what we will call the “ideal”image associated with physical entity at . By“ideal,” we mean that it would be the image of if nothingoccluded . The observed image of physical entity at

, , is with support region .Occasionally, is identical to . But morefrequently, they are not identical and becomes animaginary quantity because it cannot be observed in full. Oneshould not be too concerned with the existence of ,because we introduce the notation merely for the convenienceof illustration.

Motion of induces motion of the support region of. When motion is mostly translational and perspective

effects of the imaging system are negligible, the “ideal” imageat time is the translated version of , namely

. Furthermore, the actual “observed”image of is the translated “ideal” image with the newsupport region . Thus, we denote the “observed” imageby , where standsfor the indicator (window) function of , i.e.,2

ifelsewhere.

2Wl(t) is a set containing image points whilewl(x; y; t) is a binary valuedfunction defined on the setWl(t). DistinguishingWl(t) andwl(x; y; t) isnecessary for the rest of the discussion.

Fig. 2(a)–(c) depicts of as defined in Fig. 1.Due to the definition of and the fact that ’sdo not overlap, the observed image can be written as the“superposition” of the ideal images of all entities restrictedby their corresponding ’s,

(2)

The estimation of would be easier if were notdeformable with time, which in fact means that no occlusion(or deocclusion) occurred during . In the following, wedecompose into a “constant” portion and a “deformable”portion. The constant portion of that is not deformablewith time is useful for motion estimation while the deformableportion of is an error source because it introduces noisedue to “the appearance of new” or “the disappearance of old”pixels.

Consider registering all to by shifting thewindow function according to the motion and define an“average window” function as

(3)

By shifting to the current time, the average windowfunction describes part ofthat has a constant support “shape.” Next consider the rest of

(4)

the “window difference,” which only accounts for the defor-mation of over time [e.g., Fig. 2(e)–(g)]. Substituting(3) and (4) into (2) yields

(5)

(6)

where

(7)

the “virtual” image of entity , is generally not directlyobserved but artificially defined to be the nondeformable partof that appears simply shifted from frame to frame.It is effectively the signal component from which motionestimates are going to be extracted. The error term

(8)

corresponds to the part of that cannot be described bythe shift operation and is considered as the noise componentin our model.


Fig. 2. Window functions, average windows and window differences for the partitions in Fig. 1. (a)–(c) Window functionw1(x; y; t), t = 0; 1; 2.(d) w0(x�u1t; y�v1t), the average window ofw1(x; y; t). Only one instance is shown becauseu1 = v1 = 0 (the shift is zero) andw10(x�u1t; y�v1t)is the same for allt. (e)–(g) Window difference,d1(x; y; t), t = 0; 1; 2, of w1(x; y; t) shown in (d)–(f).d1(x; y; t) is deformed over time and cannotbe described by a shift operation.w2(x; y; t) and related figures are not shown since the circular entity is not occluded by anything and thus the averagewindow, w20(x � u2t; y � v2t), coincides withw2(x; y; t); the window differences,d2(x; y; t), are zero everywhere.

III. ESTIMATING MULTIPLE MOTION

In this section, we first demonstrate that based on (6),multiple motion estimation is equivalent to a multicompo-nent harmonic retrieval problem. Afterward, we adapt twoharmonic retrieval techniques so that they become most suitedto the task of motion estimation. Then, we address some issuesthat are unique to motion estimation.

A. Multiple Motion Estimation as Frequency Estimation

Taking the 2-D spatial Fourier transform of both sides of(6) yields

(9)

For fixed , signal , viewed as a timeseries, is a sum of harmonics in noise, theth harmonic

having frequency

(10)

Estimating velocities amounts to estimating the fre-quencies of the harmonics in (10). For each of thevelocities,the corresponding harmonic signal component has ampli-tude square while the noise component hastime-averaged amplitude square .Intuitively speaking, for better estimation of, we prefer theamplitude of the signal component to be larger than that ofthe noise component. Thus, it is natural to define the signal-to-noise ratio (SNR) as

SNR (11)

and use it as a measure or indicator for reliable motionestimation.


Recalling the definitions of and in (7)and (8), we infer that the SNR in (11) primarily depends on

and the amount of occlusion during .However, without specific knowledge of , we canonly hope to characterize the SNR by adopting a stochasticformulation. Let the ideal image be a stationaryrandom field characterized by its power spectrum density

. Using a property of the periodogram [7, Th.4.4.2, p. 95] and taking into account (7) and (8), the expectedvalues ( ) of the amplitude square of the signal and noisecomponents in (9) are found to be

(12)

and

(13)

where in deriving (13) we also have assumed that isindependent of for . In light of (12) and (13),the stochastic definition of (11) is given by

SNR (14)

which is intuitively appealing. Indeed in (14), if the windowdifference is “small” compared to the averagewindow , the SNR is high; but if the windowdifference is large, the SNR is low. An extreme case occurswhen there is no occlusion for a particular physical entity,the corresponding window does not “deform” over time,hence, the window difference is zero everywhere.We then have infinite SNR [see, e.g., and inFigs. 1 and Fig. 2(a)–(c), respectively]. Again, without specificknowledge of occlusion, it is not possible to further analyze theSNR. In the following, we select three examples that may oftenoccur in practical imaging situations and numerically analyzethe effect of occlusion on the SNR in (14) for a particularphysical entity . What these typical cases reveal can beuseful guidelines for the application of our formulation.

Case 1—Linearly Occluding Object:We assume thatis an rectangle and the area of shrinks

linearly with . In particular, we assume that the shrinkingoccurs only in the direction, i.e., but

. This is the typical case when an object is graduallymoving behind an occluding surface and thus the visible partis gradually decreasing. For example, this is the case whena car is starting to be occluded by a building or when a caris moving out of the field of view. Equivalently, the resultsof this case apply to linearly enlarging , which happenswhen the object is coming out of the occluding surface, forexample, a car emerging from behind a building or fromoutside of the field of view. The SNR in (14) is evaluated

Fig. 3. The SNR as a function of the velocity of the occluded physicalentity (case 1). The SNR drops asu increases. Note also that whenu = 10

pixels/frame, the physical entityOl has disappeared from the viewcompletelyat t = 10.

numerically using pixels, pixels andplotted in Fig. 3 against pixels/frame for

. Note that the SNR drops as increases whichmakes sense since the amount of occlusion increases aswell. When , the SNR is at infinity. Note also thatwhen pixels/frame, which means thatthe physical entity, e.g., the car, has disappeared from theview completelyat the end of the image sequence. For allvalues, the SNR is high enough for satisfactory performanceof harmonic retrieval algorithms. A general guideline bornout of our experience with synthetic and real data is thatwhen physical entities become invisible (complete occlusion)in more than 10% of the frames at the end of the sequence,motion estimates may become unreliable.

Case 2—Moving Foreground over Stationary Background:Let be an rectangle whose size and position donot change. Let be an rectangle whose sizedoes not change but horizontal position changes according to

. In this case, we consider . This isthe typical case where the stationary background is occludedby a moving foreground object. The SNR in (14) is evaluatednumerically using pixels, pixelsand plotted in Fig. 4 against pixels/frame for

. In Fig. 4, the SNR drops initially as increasesbut remains constant for the rest part. The reason for the initialdecrease is that the support size of increases asincreases [cf., (14)]. When , the occluding rectangle

no longer overlaps between successive frames and thesize of the support of in (14) equals the size of

for the remaining frames. Thus, further increasingno longer causes SNR decrease. Again, for allin Fig. 4 theSNR is high enough for satisfactory performance of harmonicretrieval algorithms.

Case 3—Completely Visible/Invisible Entity:In this case,we consider the physical entity being completely visible insome frames of the sequence but completely invisible in theremaining frames of the sequence.

Let us assume that the physical entity is completelyinvisible in out of frames. We wish to determine how large


Fig. 4. The SNR as a function of the velocity of the occluding surface (Case2). The SNR drops initially asu increases but remains constant for the restpart.

Fig. 5. The SNR as a function of number of frames in which the physicalentity is completely occluded. The SNR decreases as that number increases.As long asOl is visible in more than half (five out of ten) of the availableframes, the SNR is satisfactory.

an we can afford. Specifically, we assume be anrectangle (completely visible) in frames, and(completely invisible) in frames (missing frames). The SNRin (14) is evaluated numerically using pixelsfor . As can be seen in Fig. 5, the SNR decreasesas increases. As long as is visible in more than half ofthe available frames, the SNR is high enough for satisfactoryperformance of harmonic retrieval techniques.

In summary, we have shown that using the model in(6), estimating velocities of multiple motion is equivalentto estimating frequencies of multicomponent harmonics. Theocclusion can be modeled as a noise term and in usualcircumstances the noise level is tolerable for harmonic retrievaltechniques.3 Note that although we have analyzed only threespecial cases, many additional real life scenarios can bethought of as combinations of the above cases. The guidelinesthat we have developed may still be applicable with somemodifications. In real situations however, these guidelines have

3Note that since any harmonic retrieval method will depend on the datalength in terms of resolution and variance of the estimates, for a given SNRthe performance of our framework will be influenced by the length of theimage sequenceT in a similar manner (see, e.g., [24]).

to be applied conservatively because we have not consideredother sources of modeling errors such as perspective effectsand sensor noise that also affect the SNR in (14). Furthermore,these three cases were chosen to quantitatively demonstrate theeffect of occlusion on the SNR. They are by no means the onlysorts of occlusion that our formulation can handle.

B. Velocity Estimation Based on Periodogram Analysis

The classical tool for estimating the parameters of har-monics in noise is the periodogram (see, e.g., [24, ch. 4]).When the sample size of the data is large enough, periodogramanalysis yields frequency estimates approaching the maximumlikelihood ones. Another advantage of estimating frequenciesby picking peaks in the periodogram is that it is computation-ally efficient due to the fast Fourier transform (FFT). In ourexperiment, we illustrate that periodogram analysis is effectivefor the estimation of multiple motion. In the following, we firstlist the basic formulae for periodogram analysis and then makemodifications to adapt it for multiple motion estimation.

For fixed , the periodogram of our time seriesis given by

(15)

Recalling (10), for each different ,peaks at

(16)

In space, (16) represents the parametric equationsfor 3-D planes, on which has maximalvalues. This is reminiscent of the traditional spatio-temporalapproaches in which, however, only one plane is present dueto a single motion (see, e.g., [14]). Fig. 6 depicts a casewhere three motion regions and thus three motion planes in

space are present. In geometric terms, estimatingmotion amounts to estimating the orientations of these motionplanes. However, traditional spatio-temporal approaches thatdeal with a single plane are clearly not applicable sincethe multiplicity of planes clearly violates the single planeassumption.

Signal , for a chosen value of , may be viewedas a time series that is a sum of harmonics in noise, theth harmonic having frequency

(17)

which is a special case of (10). The periodogramreveals the information about the frequencies. Specifically,the number of different component velocitiesis obtained by

# of “dominant” peaks of (18)

Denote the component velocity set bywhere is the number of distinct component velocities.Let be -tuple and denote the number ofzero padded when computing the FFT which is assumed to be


Fig. 6. Geometric interpretation of the periodogram analysis: (a) in(!x; !y ; !) space,IF (!x; !y; !) peaks onN = 3 planes defined by (16);(b) the intersection of the motion planes and the!y = 0 plane, (c) the intersection of the motion planes and the!x = 0 plane; and (d) the intersectionof the motion planes and the!x = !y plane. In (b) and (c), the three lines correspond to three motion planes in (a). In (d), however, two linesoverlap and can only be estimated as one line.

large enough. The set is estimated via [cf. (17)]

(19)

subject to

(20)

Essentially, (19) and (20) pick local maxima from theperiodogram , because if and only if

are the local maxima, will be max-imized. The geometric interpretation of (17)–(20) is that, in

space, by setting we are in fact looking atthe intersection of the motion planes defined by (16) and theplane . In this subspace ( plane), the lines ofintersection with the motion planes are illustrated in Fig. 6(b).

Similarly, in the subspace [see Fig. 6(c)], we have


and

(22)

subject to

(23)

where is the estimate of the component velocity set

, and is -tuple .Now consider the subspace [see Fig. 6(c)].

According to (10), the frequencies of the harmonics are. Similar to (18)–(22), we have


and

(25)

subject to

(26)

where is the estimate of the sum component velocityset , and is tuple

.


Note that even when , , it is still possible tohave for some , or, for some ,or, for some . Geometrically speaking,although the motion planes in space are distinct,the intersection lines in each subspace may not be distinct. Forinstance, one can compare Fig. 6(a)–(c), where three distinctplanes and lines are present, with Fig. 6(d), where only twolines are present for the subspace. Thus, we alwayshave , and . It is therefore naturalto define the estimate of to be

(27)

The estimates for the component velocity sets, , andare not the same as the estimates for the vectors .

We have to find the correct pairings of componentsandin order to find the estimates for . Since the total

number of velocity vectors is usually not very large, asimple exhaustive search approach suffices. Due to (27), someof component velocity sets estimates, , and mayhave less than elements. First, we make every componentvelocity set have elements by systematically repeating someof its existing elements. Then, we try all possible pairings ofthese augmented component velocity sets and pick the onewith the minimal mean squared error as the correct pairing.Formally speaking, if we let denote the cardinality of aset, then , denoting an extended-permutation of theset , is defined as follows:

1) if , is the same as regularpermutation of ,

2) if , whereis the smallest among such sets with .

The correct pairing of and is then obtained by

(28)

where , , ,, and .

The above estimation procedure holds for each and everydistinct pair . In the noise-free case, choos-ing one will be sufficient for velocity estimation.However, due to noise effects, averaging the estimates overmany may be desirable for reducing the estimates’variance. Averaging of obtained at different

pairs necessitates the correct matching of elementsof each set of , which means that for aparticular , should correspond to the same truevalue independent of and . If the averaging is per-formed over inconsistent element that correspondsto different true values, the estimates will deteriorate ratherthan improve. Thus, estimating at different

and then averaging is not feasible unless we can

establish reliable element-wise correspondence. Alternatively,one may avoid the element-wise correspondence problem bydeveloping averaging schemes such as the one we describenext.

Recall that when , for fixed , peaksat . The scaled periodogram defined by

(29)

will thus peak at . In other words, if we work with thescaled periodogram, the peak locations remain the same forall . Now, the task of averaging over different valuescan be performed by averaging the scaled periodogram priorto peak picking. Define the average scaled periodogram as

(30)

The scaled periodogram can be easily obtained by adjustingthe amount of zero padding (depending on) prior tocomputing the FFT. The average scaled periodograms w.r.t.

, , and w.r.t. , , canbe similarly defined as in (30). Substituting (30) into (18)–(25)may lead to velocity estimators with reduced variance.

Remark: It is possible to develop an alternative approachto replace the procedure after frequency detection and arriveat . For a fix pair of , let


and

(32)

subject to

(33)

where is tuple . From the geometric inter-pretation, we know that triplets are points onthe planes as defined in (10) (see also Fig. 6). Given enoughof these points, a clustering algorithm such as ISODATA(see, e.g., [29]) can be used to estimate the number ofplanes (motion) and the parameters of the planes, which are

. Note that the average periodogram technique isno longer applicable, leaving this approach more likely to benoise sensitive. This is due to that averaging before detection ispreferred to detection before averaging (see, e.g., [9] and ref-erences therein). Also, clustering algorithms generally requiremany samples and are more complicated. However, furthertheoretical analysis, implementation, and comparison of thealternative approach with the current approach are beyond thescope of this paper and will be investigated in future research.


C. Velocity Estimation Based on Subspace Methods

A basic limitation of frequency estimation based on peri-odogram analysis is the Rayleigh resolution limit [13, ch. 12],according to which frequencies separated by less thanHz cannot be resolved. Thus, the resolution problem is moresevere for small temporal sample sizes (small) which isprecisely the case for motion estimation. For example, if

and , periodogram analysis cannot tell componentvelocities that are less than 1 pixel/frame apart from eachother. In situations when high precision velocity estimatesare desired, we propose to use superresolution frequencyestimation techniques based on subspace decomposition, suchas the multiple signal classification (MUSIC) algorithm (see,e.g., [13, p. 456]). If we substitute MUSIC for periodogramanalysis in Section III-B, the superresolution velocity estima-tion follows. The pairing procedure remains the same whilethe average periodogram part is no longer applicable. In ourcurrent implementation, we avoid the element-wise correspon-dence problem by simply not averaging over .

D. Dominant Component Problem

It often happens that a dominantly large part of the imagemoves coherently. For example, a stationary background maylead to a dominant harmonic in (9). The dominant harmonicforms a very strong peak in the periodogram and the MUSICspectrum, which makes the detection of weaker peaks difficultespecially at low SNR. The usual solution to this problem inharmonic retrieval is a step-by-step peak detection; i.e., afterthe detection of the dominant peak, the detected harmonic isremoved and with a smaller dynamic range one continues withthe next dominant peak (see, e.g., [27]). For example, in the

subspace, we use the following procedure.

Step 1) Set .Step 2) Find the strongest peak location in the average

scaled periodogram of (30).Step 3) Use a notch filter to remove the harmonic estimated

in Step 2.Step 4) Let .Step 5) Repeat Step 2 until no more peaks appear (for

threshold selection used to decide the presence(absence) of peaks see [27]).

Step 6) Let .

E. Spatial Assignment of Motion Estimates

In some applications, e.g., synthetic aperture radar, thegoal is merely to measure the velocities. However, obtainingthe velocity estimates , as we have described inprevious sections, is only a part of our motion estimationproblem, because we have to assign the’s to the correctspatial locations in the image; i.e., we have to find theassociated with each . It is interesting to notice that this step,namely explicit spatial assignment of velocity estimates, isunique to this new framework for multiple motion estimation.In existing motion estimation methods, the assignment isalways concurrent with velocity estimation and thus implicit.

We define the one-step prediction error as

(34)

Within , the prediction error should be zero under idealconditions. If we assign a label foreach according to

we can find the estimate of as

(35)

Prior to the labeling process (35), one may use lowpassfiltering to combat the noise in . We realize thaterroneous spatial assignment is mainly caused by the noisein which is largely owing to the implementation of(34). To calculate , interpolation has to be employedwhen the motion vectors do not have integer values. Theuse of interpolation, however, is common to many existingtechniques especially the MRF-based approaches, e.g., [20].

F. Contrast Enhancement

In the following, we describe a preprocessing step that mayprove beneficial to the motion estimation procedures discussedso far. When considering the SNR issue, we have examinedthe dependency of the SNR on the image content only in theaverage sense. However, preprocessing the image sequenceand enhancing the useful portion of the image content mayalso increase the SNR. One such situation occurs when themoving entity is small in size and has little contrast withthe background. Correspondingly, the harmonic componentfor this entity will have so low an amplitude that the peakdetection becomes difficult. Different contrast enhancementtechniques can be helpful in this situation. In particular,we consider the following simple histogram modification.Assuming that the histogram be unimodal withvalues , we let , andcreate a new image according to the mapping:

if

if

if .

(36)

Essentially, the mapping of (36) increases the brightness ofthose pixels that appear less frequently in the original image.Experimental results have shown that contrast enhancementachieved by (36) often improves motion estimates. However,we emphasize that the preprocessing step is contingent but notimperative for the successful application of our framework.


Fig. 7. Synthetic motion sequence (ten frames). The images are arranged in row-first format. The background is stationary (u0 = 0; v0 = 0) while the blockon the upper-left corner moves downwards (u1 = 0; v1 = 2) and the other block on the lower right corner moves to the northeast direction (u2 = 1; v2 = �1).

(a) (b)

Fig. 8. Estimated motion field of the synthetic sequence in Fig. 7: (a) motion field att = 0 and (b) motion field att = 8.

G. Performance Issues

In this section, we discuss two additional issues on theperformance of the algorithm, namely the identifiability of themotion vectors and the complexity of the algorithm.

In some cases, it is impossible to estimate motion from theimage data even under ideal conditions, e.g., when there isnot enough texture in the image. It is of theoretical interestto identify a broader set of such cases. From a systemidentification point of view, we need to establish conditionsfor motion identifiability.

Suppose we have established the number of motion vectors,, and the correspondence between component motions to

arrive at . Further, suppose is known. Let usconsider for a particular . In order to obtain a uniqueestimate of , we need two pairs of and such that

(37)

has a unique solution, as follows.

TABLE ITRUE AND ESTIMATED MOTION PARAMETERS

(IC1) and are linearlyindependent.

Furthermore, we have to ensure that atand , and

are uniquely identifiable. To this end, ignoring theeffect of noise on frequency estimation, it sufficesto have

(IC2) , and(IC3) .

Note that these conditions should not be understood asrestrictions of this work since they are common to most


Fig. 9. Multicomponent harmonic signal, periodograms and MUSIC spectrum. (a) The real part ofF (!x; !y; t) with !x = 0; !y = 0:4909. It is thesuperposition of three harmonic components. (b) Periodogram of the multicomponent harmonic signal in (a). Three peaks correspond to three components.(c) Periodogram of three component harmonic signal with close frequency separation. Three peaks become indistinguishable due to the resolution limit ofthe periodogram. (d) MUSIC spectrum of the same data for (c). Three peaks are clearly visible.

motion estimation algorithms. Violations of these conditionsare manifestations of many known problems in motion esti-mation. Some illustrations are in order. For example, whenthe image data have a constant gray level within the finiteFourier transform window, (IC1) and (IC2) will be violatedsince we only have and .In fact, the problem persists when the image data have aconstant gray level along a particular orientation—because, inthis case, the possible pairs will be on a straight lineand thus linearly dependent. These are all manifestations ofwhat is commonly known as the aperture problem in motionestimation (see, e.g., [19]).

(IC3) is closely related to temporal aliasing in motionestimation. In general, when (IC1) and (IC2) are satisfied,(IC3) limits the range of motion vectors that are identifiable.Theoretically, the magnitude of can be arbitrarily largeas long as the Fourier transform of the spatial image, i.e.,

, is continuous, in which case, one can alwaysselect and to be arbitrarily small so that (IC3) issatisfied. But, when is discrete, becomesnonidentifiable beyond a certain range. For example, in thecase of periodic texture, it is impossible to distinguish motionvectors that are different by multiples of the period of thetexture, if the range of motion vectors is unknown. A similar

approach that looks at the motion identifiability problemfrom the system identification point of view can be foundin [10].

Finally, we note that the proposed algorithm is computation-ally efficient. The whole process consists of 2-D spatial FFT’sof each frame and several small sized 1-D FFT’s dependingon the number of motion vectors and the length of temporalprocessing window. The correspondence can be solved atminimal cost even using the exhaustive approach as describedin Section III-B. For example, when , i.e., there arethree motion vectors to estimate, all we need to do is theevaluate the right hand argument of (28) 216 times and sortthem. As for the labeling process, we need to perform motionprediction times for each possible motion vector at eachpixel. As a result, extra memory buffers are needed to storethe intermediate results of motion prediction.

IV. EXPERIMENTAL RESULTS

We have carried out experiments on synthetic image se-quences as well as real image data. The validity of theformulation, the accuracy of the estimates, and comparisonwith an existing method are established through the syntheticdata. We then applied our formulation to the publicly available


Fig. 10. Synthetic motion sequence (ten frames). The images are arranged in row-first format. The background is stationary (u0 = 0; v0 = 0) whilethe foreground grid moves to the northwest direction (u1 = �1; v2 = �1).

(a) (b)

Fig. 11. Estimated motion field of the synthetic sequence by our formulation: (a) overall motion field att = 0 and (b) detailed version of theupper-left corner (32� 32) of (a).

TABLE IITRUE AND ESTIMATED MOTION PARAMETERS

Hamburg taxi data. In all cases, we have treated the wholeimage as a single processing window to demonstrate theeffectiveness of our algorithm for multiple motion. As isthe case for most motion estimation algorithms, better resultscould be obtained if our algorithm were applied adaptively tosmaller local blocks within which the motion vectors are closerto be constant, assuming that noise effects do not dominate.

A. Feasibility, Accuracy, and MUSIC for Motion

Fig. 7 shows artificially generated motion sequence of tenframes using the Brodatz texture data [8]. The images are

arranged in row-first format. The background is stationary( ) while the block on the upper-left cornermoves downwards ( ) and the other block onthe lower right corner moves to the northeast direction (

). Fig. 9(a) depicts the real part ofwith . Although it is supposed to bethe superposition of three harmonic components—one withzero frequency (the background), one with frequency 0.9818rad (the first block) and one with frequency0.4909 rad(the second block), visually one cannot discern the individualharmonic components. In the periodogram, however, we seeclearly three peaks corresponding to the three harmonic com-ponents [Fig. 9(b)]. When becomes smaller, the differencebetween ’s also becomes smaller [cf. (17)] and thus thethree peaks become difficult to separate in the periodogram[Fig. 9(c)], but still are possible to separate in the MUSICspectrum [Fig. 9(d)].


(a) (b)

Fig. 12. Estimated motion field of the synthetic sequence by Singh’s method: (a) overall motion field att = 0 and (b) detailed version of theupper-left corner (32� 32) of (a).

Fig. 13. Estimated motion field of the synthetic sequence by the phase-basedmethod. This figure shows the normal (optical) flow only. The algorithm didnot produce any output for the full flow due to the low confidence of theestimates.

We therefore recognize the feasibility of the periodogramapproach and its resolution limit, which motivates the useof MUSIC for motion. Table I shows the true and estimatedmotion parameters while Fig. 8 shows the spatial motionvector map. A 3 3 finite impulse response (FIR) lowpassfilter was used prior to the labeling process of (35). Theresults on synthetic data indicate that our formulation is indeedcapable of estimating multiple motion accurately.

B. “Dense” Motion Discontinuities

One of the features of our formulation is that motionestimation is achieved regardless of the spatial distribution(shape) of the image of the physical entities. Thus, in contrastto existing methods which usually prefer clustered regions withhomogeneous motion and “sparse” motion discontinuities,our algorithm is expected to perform well even when themoving object (occluding surface) is distributed in space andmotion discontinuities are rather “dense” and abundant. Thesesituations arise, for example, when one looks through foliageor screen windows.

We chose Singh’s method for comparison, since it hasbeen reported to have better performance at motion discon-tinuities than traditional methods [28, pp. 76–78] and itsimplementation is made publicly available by Barronet al.[5]. Fig. 10 shows an artificially generated motion sequenceof ten frames using the Brodatz texture data [8]. The imagesare arranged in row-first format. The background is stationary( ) while the rectangular grid in the foreground,simulating a screen window, is moving to the northwestdirection ( ). The true parameters and thoseestimated using our method are shown in Table II. Fig. 11(a)shows the overall motion field estimated using our algorithm,and Fig. 11(b) is the detailed version of the 3232 upper-leftcorner of Fig. 11(a) for better visualization. The motion fieldestimated using Singh’s method is shown in Fig. 12(a) and (b).In Fig. 13, we show the result using the phase-based algorithm[12] on the same data. The implementation is from the samesource [5]. We remind the reader that the condition for thiscomparison is very unfavorable for the phase-based algorithmsince it does not explicitly model motion discontinuity. Theseexamples suggest that “dense” motion discontinuities havelittle effect on our formulation while they may have significantadverse effects on other methods.


(a) (b)

Fig. 14. The first and last frame of the Hamburg taxi sequence: (a) image att = 0 and (b) image att = 20.

TABLE IIIMANUALLY MEASURED APPROXIMATE MOTION

PARAMETERS AND ESTIMATED MOTION PARAMETERS

C. Results on Real Data

Fig. 14(a) and (b) show the first and the last frames of theHamburg taxi data [22]. There are five distinct moving entitiesin this sequence:

1) the stationary background;2) the turning taxi in the center;3) a bus in the lower right corner moving to the upper-left;4) a black car in the lower left corner moving to the lower

right; and5) a pedestrian in the upper-left corner.

Unfortunately, the groundtruth values of motion param-eters are not available. However, in Table III, we provide“groundtruth” parameters through manual feature point track-ing for comparison with the estimated parameters (see also[5]). Since these manually generated values may not bealways reliable, the differences between the estimates and themanually measured values should not be understood as errors.Since we have used the whole image as a single processingwindow, the pedestrian is not detected due to its relativelytiny size and slow motion. Preprocessing using histogrammodification described in Section III-F was applied. Fig. 15(a)and (b) show the estimated motion field at the beginning andthe end of the sequence. A 7 7 FIR lowpass filter was usedprior to the labeling process of (35).

Results in Table III and Fig. 15(a) and (b) demonstrate thatour algorithm can effectively estimate multiple motion in realimage data.

V. CONCLUSIONS

It is widely assumed that, locally if not globally, eachpiece of the piecewise smooth motion field can be sufficientlydescribed by a single motion vector. Under this assumption,

(a)

(b)

Fig. 15. Estimated motion field for the Hamburg taxi sequence: (a) motionfield at t = 0 and (b) motion field att = 19. For visualization purpose, themotion fields are subsampled by a factor of five.

we have introduced a new framework to process discontinuous(or multiple) motion. In contrast to most existing techniques,


our velocity estimation algorithm isnoniterative; furthermore,velocities are computed regardless of motion discontinuitiesand thus, the proposed algorithm performs well when themoving object (occluding surface) is distributed in space andmotion discontinuities are rather “dense” and abundant.

Our framework broadens considerably the scope of thespatio-temporal approaches, which have suffered from lackof explicit modeling and processing of motion discontinu-ities. It also enables us to achieve superresolution in ve-locity estimates, e.g., by using MUSIC for motion, whichis fundamentally not reachable by existing spatio-temporalapproaches,

As for future research, we may combine this framework withthe time-varying motion estimation techniques developed in[9] and achieve time-varying multiple motion estimation. Thecombined discontinuous (or multiple) and time-varying motionmodels are more realistic models for motion computation invideo sequences. Preliminary results are encouraging and willbe reported elsewhere.

ACKNOWLEDGMENT

The authors thank Prof. G. Zhou at Georgia Tech forthe implementation of the MUSIC algorithm. The authorsappreciate the efforts and the suggestions of the anonymousreviewers.

REFERENCES

[1] E. H. Adelson and H. R. Bergen, “Spatiotemporal energy models for theperception of motion,”J. Opt. Soc. Amer. A,vol. 2, pp. 284–299, 1985.

[2] J. K. Aggarwal and N. Nandhakumar, “On the computation of mo-tion from sequences of images—A review,”Proc. IEEE, vol. 76, pp.917–935, Aug. 1988.

[3] T. D. Albright, “Direction and orientation selectivity of neurons in visualarea MT of the macaque,”J. Neurophysiol.,vol. 52, pp. 1106–1130,1984.

[4] P. Anandan, “A computational framework and an algorithm for themeasurement of visual motion,”Int. J. Comput. Vis.,vol. 6, pp. 283–310,1989.

[5] J. L. Barron, D. J. Fleet, and S. S. Beauchemin, “Systems and experi-ment: Performance of optical flow techniques,”Int. J. Comput. Vis.,vol.12, pp. 43–47, 1994.

[6] M. J. Black, “Recursive nonlinear estimation of discontinuous flowfields,” in Comput. Vis.—ECCV’94,J.-O. Eklundh, Ed. Berlin, Ger-many: Springer-Verlag, 1994, vol. 1, pp. 138–145.

[7] D. R. Brillinger, Time Series: Data Analysis and Theory.San Fran-cisco: Holden-day Inc., 1981.

[8] P. Brodatz,Textures: A Photographic Album for Artists and Designers.New York: Dover, 1966.

[9] W.-G. Chen, G. B. Giannakis, and N. Nandhakumar, “Spatio-temporalapproach for time-varying image motion estimation,” inProc. IEEEInternational Conference on Image Processing,Austin, TX, Nov. 1994,vol. II, pp. 232–236.

[10] W.-G. Chen, N. Nandhakumar, and W. N. Martin, “Image motionestimation from motion smear—A new computation model,”IEEETrans. Pattern Anal. Machine Intell.,vol. 18, pp. 412–425, Apr.1996.

[11] J. L. Flanagan, “Technologies for multimedia communications,”Proc.IEEE, vol. 82, pp. 590–603, Apr. 1994.

[12] D. J. Fleet and A. D. Jepson, “Computation of component image velocityfrom local phase information,”Int. J. Comput. Vis.,vol. 5, pp. 77–104,1990.

[13] S. Haykin, Adaptive Filter Theory. Englewood Cliffs, NJ: Prentice-Hall, 1991.

[14] D. J. Heeger, “Optical flow from spatiotemporal filters,” inProc. 1stInt. Conf. Computer Vision,June 1987, pp. 181–190.

[15] F. Heitz and P. Bouthemy, “Motion estimation and segmentation using aglobal Bayesian approach,” inProc. IEEE Int. Conf. Acoustics, Speech,Signal Processing,1990, pp. 2305–2308.

[16] E. Hildreth, “Computations underlying the measurement of visual mo-tion,” Artif. Intell., vol. 23, pp. 309–354, 1984.

[17] B. K. P. Horn,Robot Vision. Cambridge, MA: MIT Press, 1986.[18] B. K. P. Horn and B. G. Schunck, “Determining optical flow,”Artif.

Intell., vol. 17, pp. 185–203, 1981.[19] , “Determining optical flow,”Artif. Intell., vol. 24, pp. 185–203,

1981.[20] J. Konrad and E. Dubois, “Bayesian estimation of motion vector fields,”

IEEE Trans. Pattern Anal. Machine Intell.,vol. 14, pp. 910–927, Sept.1992.

[21] J. S. Lim,Two-Dimensional Signal and Image Processing.EnglewoodCliffs, NJ: Prentice-Hall, 1990.

[22] H.-H. Nagel, “Displacement vectors derived from second-order intensityvariations in image sequences,”Comput. Vision Graph. Image Process.,vol. 21, pp. 85–117, 1983.

[23] H.-H. Nagel and W. Enkelmann, “Dynamic occlusion analysis in opticalflow fields,” IEEE Trans. Pattern Anal. Machine Intell.,vol. 8, pp.565–593, 1986.

[24] B. Porat,Digital Processing of Random Signals: Theory and Method.Englewood Cliffs, NJ: Prentice-Hall, 1994.

[25] M. Proesmans, L. V. Gool, E. Pauwels, and A. Oosterlinck, “De-termination of optical flow and its discontinuities using nonlineardiffusion,” in Comput. Vis.—ECCV’94,J.-O. Eklundh, Ed. Berlin,Germany: Springer-Verlag, 1994, vol. 2, pp. 295–304.

[26] I. Reed, R. Gagliardi, and L. Stotts, “Optical moving target detectionwith 3-D matched filtering,”IEEE Trans. Aerosp. Electron. Syst.,vol.24, pp. 327–335, 1988.

[27] R. H. Shumway, “Replicated time-series regression: An approach tosignal detection and estimation,” inHandbook of Statistics,D. R.Brillinger and P. R. Krishnaiah, Eds. Amsterdam, The Netherlands:Elsevier, 1983, vol. 3, pp. 383–408.

[28] A. Singh,Optic Flow Computation: A Unified Perspective.Los Alami-tos, CA: IEEE Comput. Soc. Press, 1991.

[29] C. W. Therrien,Decision, Estimation and Classification: An Introductionto Pattern Recognition and Related Topics.New York: Wiley, 1989.

[30] T. Y. Tian and M. Shah, “Motion segmentation and estimation,” inProc.IEEE Int. Conf. Image Processing,Austin, TX, Nov. 1994, vol. II, pp.785–789.

[31] J. Y. A. Wang and E. H. Adelson, “Representing moving images withlayers,” IEEE Trans. Image Processing,vol. 3, pp. 625–638, 1994.

[32] J. Zhang and J. Hanauer, “The application of mean field theory to imagemotion estimation,”IEEE Trans. Image Processing,vol. 4, pp. 19–32,1995.

[33] H. Zheng and S. D. Blostein, “An error-weighted regularization algo-rithm for image motion-field estimation,”IEEE Trans. Image Process-ing, vol. 2, pp. 246–252, 1993.

Wei-Ge Chen(M’95) received the B.S. degree fromBeijing University, Beijing, China in 1989, and theM.S. degree in biophysics and the Ph.D. degree inelectrical engineering, both from the University ofVirginia, Charlottesville, in 1992 and 1995, respec-tively.

Since 1995, he has been with the MicrosoftCorporation, Redmond, WA, working on the devel-opment of advanced video compression technology.He has been an active participant of the Moving Pic-tures Expert Group. His research interests includes

image/video processing, analysis, and compression.


Georgios B. Giannakis(S’84–M’86–SM’91–F’96)received the Diploma in electrical engineeringfrom the National Technical University of Athens,Greece, 1981, and the M.Sc. degree in electricalengineering the M.Sc. degree in mathematics in1986, and the Ph.D. degree in electrical engineeringin 1986, all from the University of SouthernCalifornia (USC), Los Angeles.

After lecturing for one year at USC, he joinedthe University of Virginia, Charlottesville, inSeptember 1987, where he is now a Professor

with the Department of Electrical Engineering. His general interests liein the areas of signal processing, estimation and detection theory, andsystem identification. Specific research areas of current interest includediversity techniques for channel estimation and multiuser communications,nonstationary and cyclostationary signal analysis, wavelets in statistical signalprocessing, and non-Gaussian signal processing with applications to SAR,array, and image processing.

Dr. Giannakis received the IEEE Signal Processing Society’s 1992 PaperAward in the Statistical Signal and Array Processing (SSAP) area. Heco-organized the 1993 IEEE Signal Processing Workshop on Higher-OrderStatistics, the 1996 IEEE Workshop on Statistical Signal and Array Processing,and the first IEEE Signal Processing Workshop on Wireless Communicationsin 1997. He was Guest Co-Editor of two special issues on high-order statisticsof the International Journal of Adaptive Control and Signal Processingandthe EURASIP journalSignal Processing.He was also Guest Co-Editor of aspecial issue on signal processing for advanced communications of the IEEETRANSACTIONS ON SIGNAL PROCESSING(January 1997). He has served as anAssociate Editor for the IEEE TRANSACTIONS ON SIGNAL PROCESSING andthe IEEE SIGNAL PROCESSINGLETTERS, a secretary of the Signal ProcessingConference Board, and a member of the IEEE SP Publications Board andthe SSAP Technical Committee. He is also a member of the IMS and theEuropean Association for Signal Processing.

N. Nandhakumar (S’78–M’86–SM’91) receivedthe M.S. degree in computer, information, and con-trol engineering from the University of Michigan,Ann Arbor, and the Ph.D. degree in electrical engi-neering from the University of Texas at Austin.

Currently, he is Manager of the Video and ImageProcessing Group at LG Electronics Research Cen-ter, Princeton, NJ, where he is pursuing researchin the areas of video indexing and retrieval, videocompression, motion analysis, and areas related toimage and video processing for networked multi-

media applications. Previously, he led the development of machine visiontechnology for emerging wafer inspection markets. He has also taught graduatecourses and directed sponsored research in the areas of computer vision, imageprocessing, and pattern recognition. His research and development activity hasdealt with the estimation of motion from image sequences, autonomous navi-gation for mobile robots, 3-D object reconstruction, integration of multisensorydata, and development of machine vision systems for industrial automation.He established the Machine Vision Laboratory at the University of Virginia,Charlottesville, where he holds a visiting faculty position. His research hasbeen supported by federal, state, and industrial sources. Results of his researchhave been published in more than 80 journal papers, conference proceedings,and book chapters. Recently, his graduate students received awards at ZenithData Systems’ Master’s of Innovation Contest, and at B. F. Goodrich’sCollegiate Inventors Program. He has also participated in the organization ofseveral international conferences on computer vision, pattern recognition, andimage analysis. He is Associate Editor of theJournal of Pattern Recognition.

Dr. Nandhakumar is a member of the International Society for OpticalEngineering and a member of the IEEE Computer Society Pattern Analysisand Machine Intelligence Technical Committee.

A Harmonic Retrieval Framework For Discontinuous Motion ... › s › resources › imghar~1.pdf ·...

Documents

Transcript of A Harmonic Retrieval Framework For Discontinuous Motion ... › s › resources › imghar~1.pdf ·...