A Uniﬁed Algebraic Approach to 2-D and 3-D Motion ...rvidal/publications/ijcv05-motion.pdf ·...

A Unified Algebraic Approach to 2-D and 3-D MotionSegmentation and Estimation∗

Rene VidalCenter for Imaging ScienceDepartment of Biomedical Engineering, Johns Hopkins University308B Clark Hall, 3400 N. Charles St., Baltimore, MD 21218Tel: (410)-516-7306 E-mail:[email protected]

Yi MaDepartment of ECE, University of Illinois at Urbana-Champaign1406 West Green Street, Urbana, IL 61801Tel: (217)-244-0871 E-mail:[email protected]

September 20, 2004

Abstract. In this paper, we present an analytic solution to the problem of estimating anunknown number of 2-D and 3-D motion models from two-view point correspondences oroptical flow. The key to our approach is to view the estimation of multiple motion models asthe estimation of a singlemultibody motion model. This is possible thanks to two importantalgebraic facts. First, we show that all the image measurements, regardless of their associatedmotion model, can befit with a single real or complexpolynomial. Second, we show thatthe parameters of the individual motion model can be obtained from thederivativesof thepolynomial at any measurement associated with the motion. This leads to a novel motionsegmentation and estimation algorithm that applies to most of the two-view motion modelsthat have been adopted in computer vision. Our experiments show that the proposed algorithmout-performs existing algebraic and factorization-based methods in terms of efficiency androbustness, and provides a good initialization for iterative techniques, such as ExpectationMaximization, whose performance strongly depends on good initialization.

Keywords: Multibody structure from motion, motion segmentation, multibody epipolar con-straint, multibody fundamental matrix, multibody homography, and Generalized PCA (GPCA).

1. Introduction

A classic problem in visual motion analysis is to estimate a motion model fora set of 2-D feature points as they move in a video sequence. When the sceneis static, i.e. when either the camera or a single object moves, the problem offitting a 3-D model compatible with the structure and motion of the scene iswell understood [13, 24]. For instance, it is well-known that two views of ascene are related by the so-called epipolar constraint [22] and that multipleviews are related by the so-called multilinear constraints [14]. These con-straints can be used to estimate a single motion model using linear techniques∗ This paper is an extended version of [36]. The authors thank Sampreet Niyogi for his

help with the experimental section of the paper. Work supported by Hopkins WSE and UIUCECE research funds.

c© 2004Kluwer Academic Publishers. Printed in the Netherlands.

ijcv05-motion.tex; 20/09/2004; 3:12; p.1

2 Vidal and Ma

such as the eight-point algorithm and its variations. In the presence of noise,many optimal nonlinear algorithms for single body motion estimation havebeen proposed. For example, see [23, 31, 37, 46] for the discrete case and [28]for the differential case.

In this paper we address the more general case ofdynamic scenesin whichboth the camera and an unknown number of objects with unknown 3-D struc-ture can move independently. Thus, different regions of the image may obeydifferent 2-D or 3-D motion models due to depth discontinuities, perspectiveeffects, multiple motions, etc. More specifically, we consider the followingproblem:

Problem 1 (Multiple-Motion Segmentation and Estimation)

Given a set of image measurements{(xj1, x

j2)}N

j=1 taken from two views ofa motion sequence related by a collection ofn motion models{Mi}n

i=1,estimate the number of motion models and their parameters without knowingwhich image measurements correspond to which motion model.

The above problem can be divided into two main categories: first, 2-Dmotion segmentation, which refers to the estimation of the 2-D motion fieldin the image plane, and second, 3-D motion segmentation, which refers tothe estimation of the 3-D motion (rotation and translation) of multiple rigidlymoving objects relative to the camera.

When the scene is static, one can model the 2-D motion of the scene as amixture of 2-D motion models such as translational, affine or projective. Eventhough a single 3-D motion is present, multiple 2-D motion models arise,because of perspective effects, depth discontinuities, occlusions, transparentmotions, etc. In this case, the task of 2-D motion segmentation is to estimatethese models from the image data. When the scene is dynamic one can stillmodel the scene with a mixture of 2-D motion models. Some of these mod-els are due to independent 3-D motions, e.g., when the motion of an objectrelative to the camera can be well approximated by the affine motion model.Others are due to perspective effects and/or depth discontinuities, e.g., whensome of the 3-D motions are broken into different 2-D motions. The task of3-D motion segmentation is to obtain a collection of 3-D motion models, inspite of perspective effects and/or depth discontinuities.

1.1. RELATED L ITERATURE

Classical approaches to 2-D motion segmentation separate the image flowinto different regions by looking for flow discontinuities [29, 3], fit a mixtureof parametric models through successive computation of dominant motions[15] or use a layered representation of the motion field [6]. The problem hasalso been formalized in a maximum likelihood framework [16, 2, 44, 45, 32]

ijcv05-motion.tex; 20/09/2004; 3:12; p.2

A Unified Algebraic Approach to 2-D and 3-D Motion Segmentation and Estimation 3

in which the estimation of the motion models and their regions of supportis done by alternating between the segmentation of the image measurementsand the estimation of the motion parameters using the Expectation Maximiza-tion (EM) algorithm. EM-like approaches provide robust motion estimatesby combining information over large regions in the image. However, theirconvergence to the optimal solution strongly depends on correct initialization[27, 32]. Existing initialization techniques estimate a 2-D motion representa-tion from local patches and cluster this representation using normalized cuts[27], or K-means [43]. The drawback of these approaches is that they arebased on a local computation of 2-D motion, which is subject to the apertureproblem and to the estimation of a single model across motion boundaries.Some of these problems can be partially solved by incorporating multipleframes and a local process that forces the clusters to be connected [21].The only existing algebraic solution to 2-D motion segmentation is basedon bi-homogeneous polynomial factorization and can be found in [41].

The 3-D motion segmentation problem has received relatively less atten-tion. Existing work [7] solves this problem by first clustering the featurescorresponding to the same motion using e.g., K-means or spectral clustering,and then estimating a single motion model for each group. This can also bedone in a probabilistic framework [33] by alternating between feature clus-tering and single-body motion estimation using the EM algorithm. In orderto deal with the initialization problem of EM-like approaches, recent workhas concentrated on the study of the geometry of dynamic scenes, includingthe analysis of multiple points moving linearly with constant speed [9, 26]or in a conic section [1], multiple points moving in a plane [30], multipletranslating planes [47], self-calibration from multiple motions [8, 10], mul-tiple moving objects seen by an affine camera [4, 17, 49, 19, 50, 20, 35],and two-object segmentation from two perspective views [48]. The case ofmultiple moving objects seen by two perspective views was recently studiedin [40, 42], which proposed a generalization of the 8-point algorithm based onthe so-calledmultibody epipolar constraintand its associatedmultibody fun-damental matrix. The method simultaneously recovers multiple fundamentalmatrices using multivariate polynomial factorization, and can be extendedto three perspective views via the so-calledmultibody trifocal tensor[12].To the best of our knowledge [25] is the only existing work on 3-D motionsegmentation from omnidirectional cameras.

1.2. CONTRIBUTIONS OFTHIS PAPER

In this paper, we address the initialization of iterative approaches to motionestimation and segmentation by proposing anon-iterativealgebraic solutionto Problem 1 that applies to most 2-D and 3-D motion models in computervision, as detailed in Table I.

ijcv05-motion.tex; 20/09/2004; 3:12; p.3

4 Vidal and Ma

Table I. 2-D and 3-D Motion Models Considered in This Paper.

Motion models Model equations Model parameters Segmentation of

2-D translational x2 = x1 + Ti {Ti ∈ R2}ni=1 Hyperplanes inC2

2-D similarity x2 = λiRix1 + Ti {Ri∈SO(2), λi∈R}ni=1 Hyperplanes inC3

2-D affine x2 = Ai [ x11 ] {Ai ∈ R2×3}n

i=1 Hyperplanes inC4

3-D translational 0 = xT2 [Ti]×x1 {Ti ∈ R3}n

i=1 Hyperplanes inR3

3-D rigid-body 0 = xT2 Fix1 {Fi ∈ R3×3}n

i=1 Bilinear forms inR3×3

3-D homographyx2 ∼ Hix1 {Hi ∈ R3×3}ni=1 Bilinear forms inC2×3

The key to our approach is to view the estimation of multiple motionmodels as the estimation of asingle, though more complex,multibody mo-tion modelthat is then factored into the original models. This is achieved by(1) algebraically eliminating the feature segmentation problem, (2) fitting asingle multibody motion model to all the image measurements, and (3) seg-menting the multibody motion model into its individual components. Morespecifically, our approach proceeds as follows:

1. Eliminate Feature Segmentation: Find an algebraic equation that is sat-isfied by all the image measurements, regardless of the motion modelassociated with each measurement. For the motion models considered inthis paper, theith motion model will be typically defined by an algebraicequation of the formf(x1, x2,Mi) = 0, for i = 1, . . . , n. Therefore analgebraic equation that is satisfied by all the data is

g(x1, x2,M) = f(x1, x2,M1) · · · · · f(x1, x2,Mn) = 0. (1)

Such an equation represents a singlemultibody motion modelwhose pa-rametersM encode those of the original motion models{Mi}n

i=1.

2. Multibody Motion Estimation: Estimate the parametersM of the multi-body motion model from the given image measurements. For the motionmodels considered in this paper, the parametersM will correspond tothe coefficients of a real or complex polynomialpn of degreen. Wewill show that the number of motionsn and the coefficientsM can beobtainedlinearly after properly embedding the image measurements intoa higher-dimensional space.

3. Motion Segmentation: Recover the parameters of the original motion mod-els from the parameters of the multibody motion modelM, that is,

M → {Mi}ni=1. (2)

ijcv05-motion.tex; 20/09/2004; 3:12; p.4


We will show that the individual motion parametersMi can be com-puted from thederivativesof pn evaluated at a collection ofn imagemeasurements which can be obtained automatically from the data.

This new approach to motion segmentation offers two important technicaladvantages over previously known algebraic solutions to the segmentation of3-D translational [39] and rigid-body motions (fundamental matrices) [40]based on homogeneouspolynomial factorization:

1. It is based onpolynomial differentiationrather thanpolynomial factor-ization, which greatly improves the efficiency, accuracy and robustnessof the algorithm.

2. It applies to either feature point correspondences or optical flow andincludes most of the two-view motion models in computer vision: 2-D translational, similarity, and affine, or 3-D translational, rigid-bodymotions (fundamental matrices), or motions of planar scenes (homogra-phies), as shown in Table I. The unification is achieved by embeddingsome of the motion models into the complex domain, which resolvescases such as 2-D affine motions and 3-D homographies that could notbe solved in the real domain.

With respect to extant probabilistic methods, our approach has the advantageof providing a global, non-iterative solution that does not need initialization.Therefore, our method can be used to initialize any iterative or optimizationbased technique, such as EM, or else in a layered (multiscale) or hierarchicalfashion at the user’s discretion. Although the derivation of the algorithm willassume noise free data, the algorithm is designed to work with a moderatelevel of noise, as we will point out shortly. In its present form, however, thealgorithm does not consider the presence of outliers in the data.

2. Segmenting Hyperplanes inCK

As we will see shortly, most 2-D and 3-D motion segmentation problems areequivalent or can be reduced to clustering data lying on multiple hyperplanesin R3, C2, C3, or C4. Rather than solving this problem for each particularcase, we present in this section a unified solution for the common mathe-matical problem of segmenting hyperplanes inCK with an arbitraryK byadapting the Generalized PCA algorithm of [38] to the complex domain.

To this end, letz be a vector inCK and letzT be its transposewithoutconjugation.1 A homogeneous polynomial of degreen in z is a polynomial

1 For simplicity, we will not follow the standard definition of Hermitian transpose, whichinvolves conjugating the entries ofz.

ijcv05-motion.tex; 20/09/2004; 3:12; p.5

6 Vidal and Ma

pn(z) such thatpn(λz) = λnpn(z) for all λ in C. The space of all ho-mogeneous polynomials of degreen in K variables,Sn, is a vector spaceof dimensionMn(K) .= ( n+K−1

K−1 ) = ( n+K−1n ). A particular basis forSn

is obtained by considering all the monomials of degreen in K variables,that is zI .= zn1

1 zn22 · · · znK

K with 0 ≤ nj ≤ n for j = 1, . . . , K, andn1 + n2 + · · · + nK = n. To representSn, it is convenient to define theVeronese mapνn : CK → CMn(K) of degreen as [11]

νn : [z1, . . . , zK ]T 7→ [. . . ,zI , . . .]T

with the indexI chosen in the degree-lexicographic order. The Veronese mapis also known as thepolynomial embeddingin the machine learning commu-nity. Using this notation, each polynomialpn(z) ∈ Sn can be written as alinear combination of the monomialsxI as

pn(z) = cT νn(z) =∑

cn1,n2,··· ,nK zn11 zn2

2 · · · znKK , (3)

wherec ∈ CMn(K) is the vector of coefficients.Assume now that we are given a set of sample pointsZ

.= {zj ∈ CK}Nj=1

drawn fromn ≥ 1 different hyperplanes{Pi ⊆ CK}ni=1 of dimensionK−1.

Without knowing which points belongs to which hyperplane, we would liketo determine the number of hyperplanes, a basis for each hyperplane, and thesegmentation of the data points.

Notice that every(K − 1)-dimensional hyperplanePi ⊂ CK can berepresented by itsnormalvectorbi ∈ CK as

Pi.= {z ∈ CK : bT

i z = bi1z1 + bi2z2 + · · ·+ biKzK = 0}. (4)

We assume that the hyperplanes are different from each other, and hence thenormal vectors{bi} are pairwise linearly independent. For uniqueness, wealso assume that either the norm or the last entry of eachbi is 1.

2.1. ELIMINATING SEGMENTATION

We first notice that each pointz ∈ Z, regardless of which hyperplane of{Pi}n

i=1 it is associated with, must satisfy the following homogeneous poly-nomial of degreen in K complex variables

pn(z) .=n∏

i=1

(bTi z) = cT νn(z) =

∑cn1,n2,...,nK zn1

1 zn22 · · · znK

K = 0, (5)

because we must havebTi z = 0 for one of thebi. In the context of motion

segmentation, the vectorsbi represent the parameters of each individual mo-tion model, and the coefficient vectorc ∈ CMn(K) represents themultibodymotion parameters. We calln thedegree of the multibody motion model.

ijcv05-motion.tex; 20/09/2004; 3:12; p.6


2.2. ESTIMATING THE DEGREE ANDPARAMETERS OF THEMULTIBODY

MOTION MODEL

Since the polynomialpn(z) = cT νn(z) must be satisfied by all the datapointsZ = {zj ∈ CK}N

j=1, we have thatcT νn(zj) = 0 for all j = 1, . . . , N .Thus, we obtain the following linear system onc

Lnc = 0 ∈ CN , (6)

whereLn.= [νn(z1), νn(z2), . . . , νn(zN )]T ∈ CN×Mn(K). However, the

above so far we have assumed that we know the number of hyperplanesn.In practice, this is not always the case. In general, we have the followingstatement:

LEMMA 1. Given a sufficient number of sample points in general positiononn hyperplanes inCK , we have

n = arg mini

{rank(Li) = Mi(K)− 1

}, (7)

and the equationLnc = 0 determines the vectorc up to a nonzero scale.

Proof. The proof of this lemma can be found in [34] and is a consequenceof the basic algebraic fact that there is a one-to-one correspondence betweena polynomial and its zero set. Therefore, there is no polynomial of degreei < n that vanishes on all the points of then hyperplanes, hence we must haverank(Li) = Mi(K) for i < n. Conversely, there are multiple polynomials ofdegreei > n, namely any multiple ofpn(z), which are satisfied by all thedata, hence rank(Li) < Mi(K) − 1 for i > n. Therefore, the casei = n isthe only one in which systemLnc = 0 has a unique solution (up to scale).

According to this lemma, the number of hyperplanes (or motions) can beuniquely determined from the smallesti such thatLi drops rank. Further-more, if the last entry of eachbi is equal to one, so is the last entry ofc,hence one can solve forc uniquely in this case.

In the presence of noise, we cannot directly estimaten and subsequentlyc from (7), because the matrixLi may be full rank for alli ≥ 1. By bor-rowing tools from the model selection literature [18], for noisy data we maydetermine the number of hyperplanes (or motions) as

n = arg mini

{ σ2Mi(K)(Li)

∑Mi(K)−1k=1 σ2

k(Li)+ κMi(K)

}, (8)

whereσk(Li) is the kth singular value ofLi and κ > 0 is a (weighting)parameter. Oncen is determined, we can solve forc in a least-squares senseas the singular vector ofLn associated with its smallest singular value. Onecan normalizec so that its last entry is 1, whenever appropriate.

ijcv05-motion.tex; 20/09/2004; 3:12; p.7

8 Vidal and Ma

2.3. SEGMENTING THE MULTIBODY MOTION

Given c, we now present an algorithm for computing the parametersbi ofeach individual hyperplane (or motion) from the derivatives ofpn. To thisend, we consider the derivative ofpn(z),

Dpn(z) .=∂pn(z)

∂z=

n∑

i=1

∏

` 6=i

(bT` z)bi, (9)

and notice that if we evaluateDpn(z) at a pointz = yi that corresponds totheith hyperplane (or motion), i.e. ifyi is such thatbT

i yi = 0, then we haveDpn(yi) ∼ bi. Therefore, givenc we can obtain the hyperplane (motion)parameters as

bi =Dpn(z)

eTKDpn(z)

∣∣∣∣∣z=yi

or bi =Dpn(z)‖Dpn(z)‖

∣∣∣∣z=yi

(10)

depending on whethereTKbi = 1 or ‖bi‖ = 1, whereeK = [0, . . . , 0, 1]T ∈

CK andyi ∈ CK is a nonzero vector such thatbTi yi = 0.

The rest of the problem is to find one pointyi ∈ CK in each one of thehyperplanesPi = {z ∈ CK : bT

i z = 0} for i = 1, . . . , n. To this end, noticethat we can always choose a pointyn lying on one of the hyperplanes as anyof the points in the data setZ. However, in the presence of noise and outliers,an arbitrary point inZ may be far from the hyperplanes. The question is thenhow to compute thedistancefrom each data point to its closest hyperplane,withoutknowing the normals to the hyperplanes. The following lemma allowsus to compute a first order approximation to such a distance:

LEMMA 2. Let z ∈ Pi be the projection of a pointz ∈ CK onto its closesthyperplanePi. Also let Π .= [IK−1 0] ∈ R(K−1)×K or IK ∈ RK×K ,depending on whethereT

Kbi = 1 or ‖bi‖ = 1 for i = 1, . . . , n, respectively.Then the Euclidean distance fromz toPi is given by

‖z − z‖ =|pn(z)|

‖ΠDpn(z)‖ + O(‖z − z‖2). (11)

Proof.See [34].

Therefore, we can choose a point in the data set close to one of the sub-spaces as:

yn = argminz∈Z

|pn(z)|‖ΠDpn(z)‖ , (12)

and then compute the normal vector atyn asbn = Dpn(yn)/(eTKDpn(yn))

or bi = Dpn(yi)/‖Dpn(yi)‖. In order to find a pointyn−1 in one of the

ijcv05-motion.tex; 20/09/2004; 3:12; p.8


remaining hyperplanes, we could just remove the points onPn from Z andcomputeyn−1 similarly to (12), but minimizing overZ\Pn, and so on. How-ever, the above process is not very robust in the presence of noise. Therefore,we propose an alternative solution that penalizes choosing a point fromPn in(12) by dividing the objective function by the distance fromz toPn, namely|bT

nz|/‖Πbn‖. That is, we can choose a point on or close to∪n−1i=1 Pi as

yn−1 = argminz∈Z

|pn(z)|‖ΠDpn(z)‖ + δ

|bTn z|

‖Πbn‖ + δ, (13)

whereδ > 0 is a small positive number chosen to avoid cases in whichboth the numerator and the denominator are zero (e.g., with perfect data).By repeating this process for the remaining hyperplanes, we obtain the fol-lowing Polynomial Differentiation Algorithm (Algorithm 1) for segmentinghyperplanes inCK .

Algorithm 1 (PDA: Segmenting Multiple Hyperplanes in CK)

Given data pointsZ = {zj ∈ CK}Nj=1:

− construct the embedded data matrix

Li =[νi(z1), νi(z2), . . . , νi(zN )

]T ∈ CN×Mi(K), i = 1, 2, . . . ;

− determine the number of hyperplanesn from

n = arg mini

{ σ2Mi(K)(Li)

∑Mi(K)−1k=1 σ2

k(Li)+ κMi(K)

};

− solvefor c ∈ CMn(K) from the linear system

Lnc = 0;

− setpn(z) = cT νn(z);

− for i = n : 1,

yi = arg minz∈Z

|pn(z)|‖ΠDpn(z)‖ + δ

|bTi+1z|···|bT

n z|‖Πbi+1‖···‖Πbn‖ + δ

; bi =Dpn(yi)

eTKDpn(yi)

;

− end.

ijcv05-motion.tex; 20/09/2004; 3:12; p.9

10 Vidal and Ma

Notice that one could also choose the pointsyi in a purely algebraicfashion, e.g., by intersecting a random line with the hyperplanes, or else bydividing the polynomialpn(z) by bT

nz. However, we have chosen to presentthe simpler Algorithm 1 instead, because it has a better performance withnoisy data and is not very sensitive to the choice ofδ.

3. 2-D Motion Segmentation by Segmenting Hyperplanes inCK

This section considers the problem of segmenting a collection of 2-D motionmodels from point correspondences in two frames of a video sequence, orfrom optical flow measurements at each pixel. We show that for 2-D transla-tional, similarity and affine motion models, the motion segmentation problemis equivalent to segmenting hyperplanes inC2, C3, orC4, respectively.

3.1. SEGMENTATION OF 2-D TRANSLATIONAL MOTIONS: SEGMENTING

HYPERPLANES INC2

3.1.1. The Case of Feature PointsUnder the 2-D translational motion model the two images are related by oneout ofn possible 2-D translations{Ti ∈ R2}n

i=1. That is, for each feature pairx1 ∈ R2 andx2 ∈ R2 there exists a 2-D translationTi ∈ R2 such that

x2 = x1 + Ti. (14)

Therefore, if we interpret the displacement of the features(x2 − x1) and the2-D translationsTi as complex numbers(x2 −x1) ∈ C andTi ∈ C, then wecan re-write equation (14) as

bTi z

.=[Ti 1

] [1

−(x2 − x1)

]= 0 ∈ C2. (15)

The above equation corresponds to a hyperplane inC2 whose normal vectorbi encodes the 2-D translational motionTi. Therefore, the segmentation of 2-D translational motions from a set of point correspondences{(xj

1,xj2)}N

j=1 isequivalent to clustering data{bi ∈ C2}n

i=1 lying onn hyperplanes inC2 withnormal vectors{bi ∈ C2}n

i=1. As such, we can obtain the motion parameters{bi ∈ C2}n

i=1 by applying Algorithm 1 withK = 2 andΠ = [1 0] to acollection ofN ≥ Mn(2) − 1 = n image measurements{zj ∈ C2}N

j=1 ingeneral position on then hyperplanes. The originalreal motion parametersare then given as

Ti = [Re(bi1), Im(bi1)]T , for i = 1, . . . , n. (16)

ijcv05-motion.tex; 20/09/2004; 3:12; p.10


3.1.2. The Case of Translational Optical FlowImagine now that rather than a collection of feature points we are giventhe optical flow{uj ∈ R2}N

j=1 between two consecutive views of a videosequence. If we assume that the optical flow is piecewise constant, i.e. theoptical flow of every pixel in the image takes onlyn possible values{Ti ∈R2}n

i=1, then at each pixelj ∈ {1, . . . , N} there exists a motionTi such that

uj = Ti. (17)

The problem is now to estimate then motion models{Ti}ni=1 from the optical

flow {uj}Nj=1. This problem can be solved using the same technique as in the

case of feature points after replacingx2 − x1 = u.

3.2. SEGMENTATION OF 2-D SIMILARITY MOTIONS: SEGMENTING

HYPERPLANES INC3

3.2.1. The Case of Feature PointsIn this case, we assume that for each feature point(x1, x2) there exists a 2-Drigid-body motion(Ri, Ti) ∈ SE(2) and a scaleλi ∈ R+ such that

x2 = λiRix1 + Ti = λi

[cos(θi) − sin(θi)sin(θi) cos(θi)

]x1 + Ti. (18)

Therefore, if we interpret the rotation matrix as a unitary complex numberRi = exp(θi

√−1) ∈ C, and the translation vector and the image featuresas points in the complex planeTi, x1, x2 ∈ C, then we can write the 2-Dsimilarity motion model as the following hyperplane inC3:

bTi z

.=[λiRi Ti 1

]

x1

1−x2

= 0. (19)

Therefore, the segmentation of 2-D similarity motions is equivalent to seg-menting hyperplanes inC3. As such, we can apply Algorithm 1 withK = 3andΠ = [I2 0] to a collection ofN ≥ Mn(3) − 1 ∼ O(n2) image mea-surements{zj ∈ C3}N

j=1 in general position on the hyperplanes to obtain themotion parameters{bi ∈ C3}n

i=1. The originalreal motion parameters arethen given as

λi = |bi1|, θi = ∠bi1, andTi = [Re(bi2), Im(bi2)]T , for i = 1, . . . , n. (20)

ijcv05-motion.tex; 20/09/2004; 3:12; p.11

12 Vidal and Ma

3.2.2. The Case of Optical FlowLet {uj ∈ R2}N

j=1 be N measurements of the optical flow at theN pixels{xj ∈ R2}N

j=1. We assume that the optical flow can be modeled as a collec-tion of n 2-D similarity motion models asu = λiRix + Ti. Therefore, thesegmentation of 2-D similarity motions from optical flow measurements canbe solved as in the case of feature points, after replacingx2 = u andx1 = x.

3.3. SEGMENTATION OF 2-D AFFINE MOTIONS: SEGMENTING

HYPERPLANES INC4

3.3.1. The Case of Feature PointsIn this case, we assume that the images are related by a collection ofn 2-Daffine motion models{Ai ∈ R2×3}n

i=1. That is, for each feature pair(x1, x2)there exist a 2-D affine motionAi such that

x2 = Ai

[x1

1

]=

[a11 a12 a13

a21 a22 a23

]

i

[x1

1

]. (21)

Therefore, if we interpretx2 as a complex numberx2 ∈ C, but we still thinkof x1 as a vector inR2, then we have

x2 = aTi

[x1

1

]=

[a11 + a21

√−1, a12 + a22

√−1, a13 + a23

√−1]i

[x1

1

].

(22)The above equation represents the following hyperplane inC4

bTi z =

[aT

i 1]

x1

1−x2

= 0, (23)

where the normal vectorbi ∈ C4 encodes the affine motion parameters andthe data pointz ∈ C4 encodes the image measurementsx1 ∈ R2 andx2 ∈ C.Therefore, the segmentation of 2-D affine motion models is equivalent to seg-menting hyperplanes inC4. As such, we can apply Algorithm 1 withK = 4andΠ = [I3 0] to a collection ofN ≥ Mn(4) − 1 ∼ O(n3) image mea-surements{zj ∈ C4}N

j=1 in general position to obtain the motion parameters{bi ∈ C3}n

i=1. The original affine motion models are then obtained as

Ai =[Re(bi1) Re(bi2) Re(bi3)Im(bi1) Im(bi2) Im(bi3)

]∈ R2×3, for i = 1, . . . , n. (24)

ijcv05-motion.tex; 20/09/2004; 3:12; p.12


3.3.2. The Case of Affine Optical FlowIn this case, the optical flowu is modeled as being generated by a collectionof n affine motion models{Ai ∈ R2×3}n

i=1 of the form

u = Ai

[x1

]. (25)

Therefore, the segmentation of 2-D affine motions can be solved as in thecase of feature points, after replacingx2 = u andx1 = x.

4. 3-D Motion Segmentation

This section considers the problem of segmenting a collection of 3-D motionmodels from measurements of either the position of a set of feature points intwo frames of a video sequence, or optical flow measurements at each pixel.We show that for 3-D translational, fundamental matrices and homographies,the motion segmentation problem is equivalent to segmenting hyperplanes inR3, quadratic surfaces inR3×3 and quadratic surfaces inC3×2, respectively.

4.1. SEGMENTATION OF 3-D TRANSLATIONAL MOTIONS: SEGMENTING

HYPERPLANES INR3

4.1.1. The Case of Feature PointsIn this case, we assume that the scene can be modeled as a mixture of purelytranslational motion models,{Ti ∈ R3}n

i=1, whereTi represents the transla-tion (calibrated case) or theepipole(uncalibrated case) of objecti relativeto the camera between the two frames. A solution to this problem basedon polynomial factorization was proposed in [39]. Here we present a muchsimpler solution based on polynomial differentiation.

Given the imagesx1 ∈ P2 andx2 ∈ P2 of a point in objecti in the firstand second frame, they must satisfy the well-known epipolar constraint forlinear motions

−xT2 [Ti]×x1 = T T

i (x2 × x1) = T Ti ` = 0, (26)

where` = (x2 × x1) ∈ R3 is known as theepipolar lineassociated withthe image pair(x1, x2). Therefore, the segmentation of 3-D translationalmotions is equivalent to clustering data (epipolar lines) lying on a collectionof hyperplanes inR3 whose normal vectors are then epipoles{Ti}n

i=1. Assuch, we can apply Algorithm 1 withK = 3 to N ≥ Mn(3) − 1 ∼ O(n2)epipolar lines{`j = xj

1×xj2}N

j=1 in general position to estimate the epipoles{Ti}n

i=1 from the derivatives of the polynomialpn(`) = (T T1 `) · · · (T T

n `).The only difference is that in this case the last entry of each epipole isnot

ijcv05-motion.tex; 20/09/2004; 3:12; p.13

14 Vidal and Ma

constrained to be equal to one. Therefore, when choosing the pointsyi inAlgorithm 1 we should takeΠ = IK not to eliminate the last coordinate.Furthermore, we compute the epipoles up to an unknown scale factor as

Ti = Dpn(yi)/‖Dpn(yi)‖, i = 1, . . . , n, (27)

where the unknown scale is lost under perspective projection.Alternatively, one can evaluate the epipole associated with each epipolar

line {`j}nj=1 asDpn(`j) and then apply any clustering algorithm that deals

with projective data to the points{Dpn(`j)}Nj=1 to obtain then epipoles. For

example, one can apply spectral clustering using the absolute value of theangle betweenDpn(`j) andDpn(`j′) as a pairwise distance between imagepairsj andj′.

4.1.2. The Case of Optical FlowIn the case of optical flow generated by purely translating objects we haveuT [Ti]×x = 0, whereu is augmented as a three-dimensional vector asu =[u, v, 0]T ∈ R3. Thus, one can estimate the translations{Ti ∈ R3}n

i=1 asbefore by replacingx2 = u andx1 = x.

4.2. SEGMENTATION OF 3-D RIGID-BODY MOTIONS: SEGMENTING

QUADRATIC FORMS INR3×3

In this section, we consider the problem of segmenting multiple 3-D rigid-body motions from two perspective views. The geometry of the problemwas studied by [40], who introduced the multibody epipolar constraint andits associated fundamental matrix, together with a generalization of the 8-point algorithm based on the factorization of bi-homogeneous polynomials.Here we present a much simpler solution based on taking derivatives of themultibody epipolar constraint, thus avoiding polynomial factorization.

Assume that the motion of the objects relative to the camera between thetwo views can be modeled as a mixture of 3-D rigid-body motions{(Ri, Ti) ∈SE(3)}n

i=1 which are represented with a nonzero rank-2fundamental ma-trix Fi. Given an image pair(x1, x2), there exists a motioni such that thefollowing epipolar constraint is satisfied

xT2 Fix1 = 0. (28)

Therefore, the followingmultibody epipolar constraint[40] must be satisfiedby the number of independent motionsn, the fundamental matrices{Fi}n

i=1

and the image pair(x1, x2), regardless of the object to which the image pairbelongs

pn(x1, x2).=

n∏

i=1

(xT

2 Fix1

)= 0. (29)

ijcv05-motion.tex; 20/09/2004; 3:12; p.14


The multibody epipolar constraint can be written in bilinear form as

νn(x2)TFνn(x1) = 0, (30)

whereF ∈ RMn(3)×Mn(3) is the so-calledmultibody fundamental matrix[40].When the number of motionsn is known, one can linearly estimateF fromN ≥ Mn(3)2 − 1 ∼ O(n4) image pairs in general position by solving thelinear systemLnf = 0, wheref ∈ RMn(3)2 is the stack of the rows ofFandLn ∈ RN×Mn(3) is a matrix whosejth row is

(νn(xj

2)⊗ νn(xj1)

)Twith

⊗ the Kronecker product. When the number of motions is unknown, one canestimate it as

n = min{i : rank(Li) = Mi(3)2 − 1}, (31)

whereLi is computed with the Veronese map of degreei.We now show how to estimate the fundamental matrices{Fi}n

i=1 fromthe multibody fundamental matrixF by taking derivatives of the multibodyepipolar constraint. Recall that, given a pointx1 ∈ P2 in the first imageframe, the epipolar lines associated with it are defined as`i

.= Fix1 ∈ R3,i = 1, . . . , n. Therefore, if the image pair(x1,x2) corresponds to motioni,i.e. if xT

2 Fix1 = 0, then

∂

∂x2νn(x2)TFνn(x1) =

n∑

i=1

∏

` 6=i

(xT2 F`x1)(Fix1) =

∏

` 6=i

(xT2 F`x1)(Fix1) ∼ `i.

(32)

In other words, the partial derivative of the multibody epipolar constraintwith respect tox2 evaluated at(x1, x2) is proportional tothe epipolar lineassociated with(x1, x2) in the second view. Similarly, the partial derivativeof the multibody epipolar constraint with respect tox1 evaluated at(x1, x2)is proportional totheepipolar line associated with(x1, x2) in the first view.Therefore, given a set of image pairs{(xj

1,xj2)}N

j=1 and the multibody fun-

damental matrixF ∈ RMn(3)×Mn(3), we can estimate a collection of epipolarlines{`j}N

j=1 associated with each image pair.2 As described in Section 4.1,this collection of epipolar lines must pass through then epipoles{Ti}n

i=1.Therefore, we can apply Algorithm 1 withK = 3 andΠ = I3 to the epipolarlines{`j}N

j=1 to obtain then epipoles{Ti}ni=1 up to a scale factor, as in equa-

tion (27). If then epipoles are different,3 then we can immediately compute

2 Remember from Section 4.1 that in the case of purely translating objects the epipolarlines were readily obtained asx1 × x2. Here the calculation is more involved because of therotational component of the rigid-body motions.

3 Notice that this is not a strong assumption. If two individual fundamental matrices sharethe same (left) epipoles, one can consider the right epipoles (in the first image frame) instead,because it is extremely rare that two motions give rise to the same left and right epipoles. Infact, this happens only when the rotation axes of the two motions are equal to each other andparallel to the translation direction [40].

ijcv05-motion.tex; 20/09/2004; 3:12; p.15

16 Vidal and Ma

the n fundamental matrices{Fi}ni=1 by assigning the image pair(xj

1, xj2)

to groupi if i = arg min`=1,...n(T Ti `j)2 and then applying the eight-point

algorithm to the image pairs in groupi = 1, . . . , n.

4.3. SEGMENTATION OF 3-D HOMOGRAPHIES: SEGMENTING

QUADRATIC FORMS INC2×3

The motion segmentation scheme described in the previous section assumesthat the displacement of each object between the two views relative to thecamera is nonzero, i.e.Ti 6= 0, otherwise the individual fundamental matricesare zero. Furthermore, it also requires that the 3-D points be in general con-figuration, otherwise one cannot uniquely recover each fundamental matrixfrom its epipolar constraint. The latter case occurs, for example, in the case ofa planar structure, i.e. when the 3-D points lie on a plane [13]. Nevertheless,both in the case of a purely rotating object (with respect to the camera center)or in the case of a planar 3-D structure, the motion model between the twoviewsx1 ∈ P2 andx2 ∈ P2 is described by a homography matrixH ∈ R3×3

such that [13]

x2 ∼ Hx1.=

[h11 h12 h13h21 h22 h23h31 h32 h33

]x1. (33)

Consider the case in which we are given a set of image pairs{(xj1, x

j2)}N

j=1

that can be modeled withn independenthomographies{Hi}ni=1 (the precise

notion of independence will become clear later in Definition 1). Note thatthen homographies donot necessarily correspond ton different rigid-bodymotions. This is because it could be the case that one rigidly moving objectconsists of two or more planes, hence its rigid-body motion will lead to two ormore homographies. Therefore, then homographies can represent anythingfrom 1 up ton rigid-body motions. In either case, it is evident from the formof equation (33) that we cannot take the product of all the equations, as wedid with the epipolar constraints, because we have two linearly independentequations per image pair. We resolve this difficulty by working in the complexdomain, as we describe below.

4.3.1. Complexification of HomographiesWe interpret the second imagex2 ∈ P2 as a point inCP by considering thefirst two coordinates ofx2 as a complex number and appending a one to it.However, we still think ofx1 as a point inP2. With this interpretation, wecan rewrite (33) as

x2 ∼ H1,2x1.=

[h11+h21

√−1 h12+h22√−1 h13+h23

√−1h31 h32 h33

]x1, (34)

whereH1,2 ∈ C2×3 now represents acomplex homography4 obtained bycomplexifying the first two rows ofH (as indicated by the superscripts). Let

4 Strictly speaking, we embed each real homography matrix into an affine complex matrix.

ijcv05-motion.tex; 20/09/2004; 3:12; p.16


w2 be the vector inCP perpendicular tox2, i.e. if x2 = [ z1 ] thenw2 =

[1−z

].

Then we can rewrite (34) as the followingcomplexbilinear constraint

wT2 H1,2x1 = 0, (35)

which we call thecomplex homography constraint.We can therefore interpret the motion segmentation problem as one in

which we are given image data{xj1 ∈ P2}N

j=1 and{wj2 ∈ CP}N

j=1 related

by a collection ofn complex homographies{H1,2i ∈ C2×3}n

i=1. Then eachimage pair(x1, w2) has to satisfy themultibody homography constraint

n∏

i=1

(wT2 H1,2

i x1) = νn(w2)TH1,2νn(x1) = 0, (36)

regardless of which one of then complex homographies is associated withthe image pair. We call the matrixH1,2 ∈ CMn(2)×Mn(3) themultibody ho-mography. Now, since the multibody homography constraint (36) is linear inthe multibody homographyH1,2, whenn is known we can linearly solve forH1,2 from (36) givenN ≥ Mn(2)Mn(3)− (Mn(3) + 1)/2 ∼ O(n3) imagepairs in general position.5 When the number of motions is unknown, one canestimate it as

n = min{i : rank(Li) = Mi(2)Mi(3)− 1}, (37)

whereLi ∈ CN×Mi(3)Mi(2) is a matrix whosejth row is(νi(w

j2)⊗ νi(x

j1)

)T.

4.3.2. Independent HomographiesGiven the multibody homographyH1,2 ∈ CMn(2)×Mn(3), the rest of theproblem is to recover the individual homographies{H1,2

i }ni=1 or {Hi}n

i=1.In the case of fundamental matrices discussed in Section 4.2, the key forsolving the problem was the fact that fundamental matrices are of rank 2,hence one can cluster epipolar lines based on the epipoles. Note that herecannot do the same with real homographiesHi ∈ R3×3, because in generalthey are full rank. Nevertheless, if we work with the complex homographiesH1,2

i ∈ C2×3 instead, they automatically have rank 2 and we call the onlyvector in the kernel of a complex homographyH1,2 its complex epipole,denoted bye1,2 ∈ C3. That is, we haveH1,2e1,2 = 0.

As in the case of multiple fundamental matrices which normally are as-sumed to have different epipoles, we also wish our complex homographies tohave different complex epipoles. Notice that from the definition of complexepipole, different real homographiesHi ∈ R3×3 may result in the same

5 The multibody homography constraint gives two equations per image pair, and there are(Mn(2)− 1)Mn(3) complex entries inH1,2 andMn(3) real entries (the last row).

ijcv05-motion.tex; 20/09/2004; 3:12; p.17

18 Vidal and Ma

epipole. In general, one can show that the set of complex homographiesthat share the same epipolee1,2 is a five-dimensional subset (hence a zero-measure subset) of all real homography matrices. We then want to know underwhat conditions the epipoles are guaranteed to be different. The followinglemma gives such a condition.

LEMMA 3. If the third rows of two real non-degenerate homography ma-trices H1 and H2 ∈ R3×3 are different (in a projective sense),6 then theassociated complex epipolese1,2

1 and e1,22 ∈ C3 must be different (in a

projective sense).

Proof. Let H1 andH2 ∈ R3×3 be two different homographies. LethT1 and

hT2 ∈ R3 be the two third row of the two homographies, respectively. Sup-

pose that their complex epipole is the same, saye. The complexifications ofthe first two rows ofH1 andH2 are orthogonal toe. They must be in the(complex) plane spanned byhT

1 andhT2 . Thus, all the three rows ofH1 or H2

are linearly dependent onhT1 andhT

2 . This contradicts thatH1 andH2 arenon-degenerate.

The following two examples try to provide more insight to the geometryof complex epipoles.

Example 1 [Complex Epipole of a Rotational Homography]Suppose thata homographyH is induced from a rotation, i.e.H = R ∈ SO(3). LetrT

1 , rT2 , rT

3 ∈ R3 be its three rows. The complexification gives two rowvectorsrT

1 +√−1rT

2 andrT3 . It is easy to check that the epipole ise =

r1+√−1r2, which is orthogonal to the two vectors. We also see that Lemma

3 is only sufficient but not necessary: Rotations in theXY -plane share thesame last row[0, 0, 1] but in general they lead to different epipoles.

Example 2 [One-Motion/Multi-Planes v.s. Multi-Motions/One-Plane]Ahomography is generally of the formH = R + TπT where(R, T ) is thecamera motion andπ is the plane normal. If the homographies come fromdifferent planes (differentπ) undergoing the same rigid-body motion withTz 6= 0, then the associated complex epipoles will always be different sincethe third rows depend onπT

i . However, if one plane with the normal vectorπ = [0, 0, 1]T undergoes different translational motions of the formTi =[Tix, Tiy, Tiz]T , then all the complex epipoles are equal toe1,2

i =[√−1,−1,0]T .

To avoid this problem, one can complexify the first and third rows ofHinstead of the first two. The new complex epipoles will bee1,3

i = [Tix +Tiz

√−1, Tiy,−1]T , which in general are different for different translationalmotions.

6 By “in a projective sense” we mean that two vectors or matrices are the same if theydiffer only by a non-zero scalar.

ijcv05-motion.tex; 20/09/2004; 3:12; p.18


The above observation can be further generalized as follows. LetH1,2,,H2,3,andH1,3 ∈ C2×3 be the three different complex homographies associatedwith a real homography matrixH ∈ R3×3 obtained by complexifying rows(1, 2), and rows (2, 3), and rows (1, 3), respectively. Lete1,2, e2,3, e1,3 ∈ C3

be the three corresponding epipoles.

THEOREM 1 (Complex Epipoles of Real Homographies).Any two real andnon-degenerate homography matricesH1 andH2 ∈ R3×3 are different (in aprojective sense) if and only if their three complex epipoles

(e1,2,e2,3,e1,3

)

are different (as a set).

Proof. The sufficiency is obvious according to the definition of the complexepipoles. We only have to show the necessity. Assume that the epipoles areall the same. According to Lemma 3, each of the three rows ofH1 andH2

must be equal up to a (probably different) scale. That isH2 = DH1 for somediagonal matrixD .= diag{d1, d2, d3} ∈ R3×3. Let hT

1 , hT2 , hT

3 be the threerows of H1. If d1 6= d2, the two vectors(d1h

T1 +

√−1d2hT2 , hT

3 ) span adifferent plane inC3 from that spanned by(hT

1 +√−1hT

2 ,hT3 ). Otherwise,

we haved1h

T1 +

√−1d2hT2 = α(hT

1 +√−1hT

2 ) + βhT3

for someα, β not identically zero. This givesα = d2 and(d1 − d2)hT1 =

βhT3 , which contradicts that the matrixH1 is non-degenerate. Thus, the two

epipolese1,21 ande1,2

2 must be different. Therefore, for all the epipoles to bethe same, we must haved1 = d2 = d3. That is,H1 andH2 differ only by anon-zero scalar and is the same matrix in a projective sense.

The above results make the following definition meaningful.

DEFINITION 1 (Independent Homographies).A set of real non-degeneratehomography matrices{Hi} are said to beindependentif the set of complexepipoles{e1,2

i ,e2,3i ,e1,3

i } of their complexifications{H1,2i , H2,3

i ,H1,3i } are

different (in a projective sense) from each other.

Theorem 1 implies that a set of real homographies are independent as long asthey are different.

4.3.3. Decomposition of the Multibody HomographyFor simplicity, let us first consider the simple case in which the homographieshave different complex epipoles for one of the complexifications, sayH1,2.Once the multibody homography matrixH1,2 is obtained, similar to the caseof epiplar lines of fundamental matrices (32), we can associate acomplexepipolar line

`j ∼ ∂νn(w2)TH1,2νn(x1)∂x1

∣∣∣∣∣(x1,w2)=(xj

1,wj2)

∈ CP2 (38)

ijcv05-motion.tex; 20/09/2004; 3:12; p.19

20 Vidal and Ma

with each image pair(xj1, w

j2). Given this set ofN ≥ Mn(3) − 1 complex

epipolar lines{`j}Nj=1 in general position, we can apply Algorithm 1 with

K = 3 andΠ = I3 to estimate then complex epipoles{e1,2i ∈ C3}n

i=1 upto a scale factor, similar to the case with pure translations studied in Section4.1. Therefore, since then complex epipoles are different, we can cluster theoriginal image measurements by assigning image pair(xj

1, xj2) to groupi if

i = argmin`=1,...,n |eT` `j |2. Once the image pairs have been clustered, the

estimation of each homography, either real or complex, becomes a simplelinear problem.

REMARK 1 (Direct Extraction of Homographies fromH). There is yet an-other way to obtain individualHi fromH without segmenting the image pairsfirst. Once the complex epipolesei are known, one can compute the followinglinear combination of the rows ofHi (up to scale) from the derivatives of themultibody homography constraint atei

wT Hi ∼ ∂νn(w)THνn(x)∂x

∣∣∣∣∣x=ei

∈ CP2, ∀w ∈ C2. (39)

In particular, if we takew = [1, 0]T andw = [0, 1]T we obtain the first andsecond row ofHi (up to scale), respectively. By choosing additionalw’s oneobtains more linear combinations from which the rows ofHi can be linearlyand uniquely determined.

Let us now consider the case in which the homographies are different, butnone of the complexifications givesn different complex epipoles. In this case,we can apply the above process to each one of the three complexifications,thus obtaining three possible groupings of the image measurements. Thenumber of groups is strictly less thann for each one of the three groupings. Inthe case of perfect data, the correct grouping inton motions can be obtainedby assigning two image pairs to the same motion if and only if they belongto the same group for each one of the three groupings. In the case of noisyimage measurements, the above process will result in oversegmentation andneeds to be followed by further postprocessing.

ijcv05-motion.tex; 20/09/2004; 3:12; p.20


5. Experiments on Real and Synthetic Images

In this section we evaluate our motion segmentation algorithm on both realand synthetic imagery. We compare our results with those of existing motionsegmentation algorithms and use our algorithm to initialize iterative methods.

5.1. 2-D TRANSLATIONAL MOTIONS

We first test our polynomial differentiation algorithm (PDA) on a12-framevideo sequence consisting of an aerial view of two robots moving on theground. The robots are purposely moving slowly, so that it is harder to distin-guish their optical flow from the noise. At each frame, we apply Algorithm 1with K = 2, Π = [1 0], κ = 10−6 andδ = 0.02 to the optical flow of allN =240 × 352 pixels in the image. We compute the optical flow using Black’scode, which is available at http://www.cs.brown.edu/people/black/ignc.html.The leftmost column of Figure 1 displays thex and y coordinates of theoptical flow for frames 4 and 10, showing that it is not so simple to distinguishthe three clusters from the raw data. The remaining columns of Figure 1show the segmentation of the image pixels into three 2-D translational motionmodels. The motion of the two robots and that of the background are correctlysegmented.

−0.5 0 0.5 1−0.4

0

0.4

0.8

Optical flow along x

Opt

ical

flow

alo

ng y

−0.5 0 0.5 1−0.4

0

0.4

0.8

Optical flow along x

Opt

ical

flow

alo

ng y

Figure 1. Segmenting the optical flow of the two-robot sequence by clustering lines inC2.

We also test our algorithm on two outdoor sequences taken by a mov-ing camera tracking a car moving in front of a parking lot and a building(sequences A and B), and one indoor sequence taken by a moving cameratracking a person moving his head (sequence C), as shown in Figure 2. Thedata for these sequences are taken from [20] and consists of point correspon-dences in multiple views, which are available at http://www.suri.it.okayama-u.ac.jp/data.html. For each pair of consecutive frames we apply Algorithm 1with K = 2, Π = [1 0] andδ = 0.02 to the point correspondences. For allsequences and for every pair of frames the number of motions is correctly

ijcv05-motion.tex; 20/09/2004; 3:12; p.21

22 Vidal and Ma

estimated asn = 2 for all values ofκ ∈ [2, 20] 10−7. For sequence A, ouralgorithm gives a perfect segmentation for all pairs of frames. For sequenceB, our algorithm gives a perfect segmentation for all pairs of frames, exceptfor 2 frames in which one point is misclassified. The average percentage ofcorrect classification over the 17 frames is 99.8%. For sequence C, however,our algorithm has poor performance during the first and last 20 frames. This isbecause for these frames the camera and head motions are strongly correlated,and the interframe motion is just a few pixels. Therefore, it is very difficultto tell the motions apart from local information. However, if we combineall pairwise segmentations into a single segmentation,7 our algorithms givesa percentage of correct classification of 100.0% for all three sequences asshown in Table II. Table II also shows results reported in [20] from existingmultiframealgorithms for motion segmentation. The comparison is somewhatunfair, because our algorithm uses only two views at a time and a simple 2-Dtranslational motion model, while the other algorithms use multiple framesand a rigid-body motion model for affine cameras. Furthermore, our algo-rithm is purely algebraic, while the others use iterative refinement to deal withnoise. Nevertheless, the only algorithm having a comparable performanceto ours is Kanatani’s multi-stage optimization algorithm, which is based onsolving a series of EM-like iterative optimization problems, at the expense ofa significant increase in computation.

Table II. A comparison of the percentage of correct classification givenby our two-view algebraic algorithm (PDA) with respect to that of extantmultiframe optimization-based algorithms for sequences A, B, C in [20].

Sequence A B C

Number of points 136 63 73

Number of frames 30 17 100

Costeira-Kanade 60.3% 71.3% 58.8%

Ichimura 92.6% 80.1% 68.3%

Kanatani: subspace separation 59.3% 99.5% 98.9%

Kanatani: affine subspace separation81.8% 99.7% 67.5%

Kanatani: multi-stage optimization 100.0% 100.0% 100.0%

PDA: mean over consecutive frames100.0% 99.8% 86.4%

PDA: including all frames 100.0% 100.0% 100.0%

7 As simple, though not necessarily optimal way of combining multiple segmentations intoa single one is to compute for each segmentation the probability that a point belongs to eachone of the motions and then use Bayes rule to combine those probabilities into a single one.

ijcv05-motion.tex; 20/09/2004; 3:12; p.22


0 5 10 15 20 25 3050

60

70

80

90

100

Frame

[%] o

f cor

rect

cla

ssifi

catio

n

0 5 10 1550

60

70

80

90

100

Frame

[%] o

f cor

rect

cla

ssifi

catio

n

0 20 40 60 80 10050

60

70

80

90

100

Frame

[%] o

f cor

rect

cla

ssifi

catio

n

Figure 2. Segmenting the point correspondences of sequences A (left), B (center) and C(right) in [20] for each pair of consecutive frames by clustering lines inC2. First row: firstframe of the sequence with point correspondences superimposed. Second row: last frameof the sequence with point correspondences superimposed. Third row: displacement of thecorrespondences between first and last frames. Fourth row: percentage of correct classificationfor each pair of consecutive frames.

5.2. 3-D TRANSLATIONAL MOTIONS

In this section we compare our polynomial differentiation algorithm (PDA)with the polynomial factorization algorithm (PFA) of [39] and a variation ofthe Expectation Maximization algorithm (EM) for segmenting hyperplanesin R3. For an image size of500 × 500 pixels, we generate synthetic pointcorrespondences undergoing multiple 3-D translational motions. The corre-spondences are then corrupted with zero-mean Gaussian noise with a standarddeviation between 0 and 1 pixels. Figures 3(a) and (b) show the performanceof all the algorithms as a function of the level of noise forn = 2 movingobjects. The performance measures are the mean error between the estimatedand the true epipoles (in degrees), and the mean percentage of correctly seg-

ijcv05-motion.tex; 20/09/2004; 3:12; p.23

24 Vidal and Ma

mented correspondences using 1000 trials for each level of noise. Notice thatPDA gives a translation error of less than1.3◦ and percentage of correctclassification of over96%. Therefore, PDA reduces the translation error toapproximately 1/3 and improves the classification performance by about 2%with respect to PDA. Notice also that EM with the epipoles initialized atrandom yields a nonzero error in the noise free case, because it frequentlyconverges to a local minimum. In fact, our algorithm PDA outperforms EMwhen a single random initialization for EM is used. However, if we use PDAto initialize EM (PDA+EM), the performance of both EM and PDA improves,showing that our algorithm can be effectively used to initialize iterative ap-proaches to motion segmentation. Furthermore, the number of iterations ofPDA+EM is approximately 50% with respect to EM randomly initialized,hence there is also a gain in computing time. We also evaluate the perfor-mance of PDA as a function of the number of moving objects for differentlevels of noise, as shown in Figures 3(c) and (d). As expected, the perfor-mance deteriorates with the number of moving objects, though the translationerror is still below8◦ and the percentage of correct classification is over78%.

0 0.2 0.4 0.6 0.8 10

1

2

3

4

Noise level [pixels]

Tra

nsla

tion

erro

r [d

egre

es]

PFAPDAEMPDA+EM

0 0.2 0.4 0.6 0.8 192

94

96

98

100


Cor

rect

cla

ssifi

catio

n [%

]

PFAPDAEMPDA+EM

(a) Translation errorn = 2 (b) % of correct classif.n = 2

0 0.2 0.4 0.6 0.8 10

2

4

6

8


Tra

nsla

tion

erro

r [d

egre

es]

n=1n=2n=3n=4

0 0.2 0.4 0.6 0.8 175

80

85

90

95

100


Cor

rect

cla

ssifi

catio

n [%

]

n=1n=2n=3n=4

(c) Translation errorn = 1, . . . , 4 (d) % of correct classif.n = 1, . . . , 4

Figure 3. Segmenting 3-D translational motions by clustering planes inR3. Top: compar-ing our algorithm with PFA and EM as a function of noise in the image features. Bottom:performance of PFA as a function of the number of motions for different levels of noise.

ijcv05-motion.tex; 20/09/2004; 3:12; p.24


We also test the performance of PDA on a320 × 240 video sequencecontaining a truck and a car undergoing two 3-D translational motions, asshown in Figure 4(a). We apply Algorithm 1 withK = 3, Π = I3 andδ = 0.02 to the (real) epipolar lines obtained from a total ofN = 92 pointcorrespondences, 44 in the truck and 48 in the car. The point correspondencesare obtained using the algorithm in [5]. The number of motions is correctlyestimated asn = 2 for all κ ∈ [4, 90] · 10−4. Notice that PDA gives aperfect segmentation of the correspondences, as shown in Figure 4(b). Thetwo epipoles are estimated with an error of5.9◦ for the truck and1.7◦ for thecar.

1 44 92

trk

car

(a) First frame (b) Feature segmentation

Figure 4. Segmenting two 3-D translational motions by clustering planes inR3.

5.3. 3-D HOMOGRAPHIES

In this section we test the performance of our algorithm for segmenting rigid-body motions of planar 3-D structures, as described in Section 4.3. Figure5(a) shows the first frame of a2048 × 1536 video sequence with two mov-ing objects: a cube and a checkerboard. Notice that although there are onlytwo rigid-body motions, the scene containsthree different homographies,each one associated with each one of the three visible planar structures. Fur-thermore, notice that the top side of the cube and the checkerboard haveapproximately the same normals. We manually tracked a total ofN = 147features: 98 in the cube (49 in each of the two visible sides) and 49 in thecheckerboard. We applied our algorithm in Section 4.3 to segment the imagedata and obtained a 97% of correct classification, as shown in Figure 5(b).We then added zero-mean Gaussian noise with standard deviation between 0and 1 pixels to the features, after rectifying the features in the second view inorder to simulate the noise free case. Figure 5(c) shows the mean percentageof correct classification for 1000 trials per level of noise. The percentage ofcorrect classification of our algorithm is between 80% and 100%, which gives

ijcv05-motion.tex; 20/09/2004; 3:12; p.25

26 Vidal and Ma

a very good initial estimate for any of the existing iterative/optimization/EMbased motion segmentation schemes.

0 49 98 147

1

2

3

Feature point indexG

roup

inde

x0 0.2 0.4 0.6 0.8 1

70

75

80

85

90

95

100


Cor

rect

cla

ssifi

catio

n [%

]

(a) First frame (b) Feature segmentation (c)% of correct classification

Figure 5. Segmenting 3-D homographies by clustering complex bilinear forms inC2×3.

6. Conclusions

We have presented a unified algebraic approach to 2-D and 3-D motion seg-mentation from feature point correspondences or optical flow. Contrary toextant methods, our approach does not iterate between feature segmenta-tion and motion estimation. Instead, it computes a single multibody motionmodel that is satisfied by all the image measurements and then extracts theoriginal motion models from the derivatives of the multibody one. Exper-iments showed that our algorithm not only outperforms existing algebraicand factorization-based methods, but also provides a good initialization foriterative techniques, such as EM, which are strongly dependent on correctinitialization.

References

1. S. Avidan and A. Shashua. Trajectory triangulation: 3D reconstruction of moving pointsfrom a monocular image sequence.IEEE Trans. on Pattern Analysis and MachineIntelligence, 22(4):348–357, 2000.

2. S. Ayer and H. Sawhney. Layered representation of motion video using robustmaximum-likelihood estimation of mixture models and MDL encoding. InIEEEInternational Conference on Computer Vision, pages 777–785, 1995.

3. M. Black and P. Anandan. Robust dynamic motion estimation over time. InIEEEConference on Computer Vision and Pattern Recognition, pages 296–302, 1991.

4. T.E. Boult and L.G. Brown. Factorization-based segmentation of motions. InProc. ofthe IEEE Workshop on Motion Understanding, pages 179–186, 1991.

5. A. Chiuso, P. Favaro, H. Jin, and S. Soatto. Motion and structure causally integratedover time. IEEE Trans. on Pattern Analysis and Machine Intelligence, 24(4):523–535,2002.

ijcv05-motion.tex; 20/09/2004; 3:12; p.26


6. T. Darrel and A. Pentland. Robust estimation of a multi-layered motion representation.In IEEE Workshop on Visual Motion, pages 173–178, 1991.

7. X. Feng and P. Perona. Scene segmentation from 3D motion. InIEEE Conference onComputer Vision and Pattern Recognition, pages 225–231, 1998.

8. A. Fitzgibbon and A. Zisserman. Multibody structure and motion: 3D reconstructionof independently moving objects. InEuropean Conference on Computer Vision, pages891–906, 2000.

9. M. Han and T. Kanade. Reconstruction of a scene with multiple linearly moving objects.In IEEE Conference on Computer Vision and Pattern Recognition, volume 2, pages 542–549, 2000.

10. M. Han and T. Kanade. Multiple motion scene reconstruction from uncalibrated views.In IEEE International Conference on Computer Vision, volume 1, pages 163–170, 2001.

11. J. Harris.Algebraic Geometry: A First Course. Springer-Verlag, 1992.12. R. Hartley and R. Vidal. The multibody trifocal tensor: Motion segmentation from 3

perspective views. InIEEE Conference on Computer Vision and Pattern Recognition,2004.

13. R. Hartley and A. Zisserman.Multiple View Geometry in Computer Vision. Cambridge,2000.

14. A. Heyden and K.Astrom. Algebraic properties of multilinear constraints.MathematicalMethods in Applied Sciences, 20(13):1135–62, 1997.

15. M. Irani, B. Rousso, and S. Peleg. Detecting and tracking multiple moving objects usingtemporal integration. InEuropean Conference on Computer Vision, pages 282–287,1992.

16. A. Jepson and M. Black. Mixture models for optical flow computation. InIEEEConference on Computer Vision and Pattern Recognition, pages 760–761, 1993.

17. K. Kanatani. Motion segmentation by subspace separation and model selection. InIEEEInternational Conference on Computer Vision, volume 2, pages 586–591, 2001.

18. K. Kanatani. Evaluation and selection of models for motion segmentation. InAsianConference on Computer Vision, pages 7–12, 2002.

19. K. Kanatani and C. Matsunaga. Estimating the number of independent motions formultibody motion segmentation. InEuropean Conference on Computer Vision, pages25–31, 2002.

20. K. Kanatani and Y. Sugaya. Multi-stage optimization for multi-body motion segmen-tation. In Australia-Japan Advanced Workshop on Computer Vision, pages 335–349,2003.

21. Q. Ke and T. Kanade. A robust subspace approach to layer extraction. InIEEE Workshopon Motion and Video Computing, pages 37–43, 2002.

22. H. C. Longuet-Higgins. A computer algorithm for reconstructing a scene from twoprojections.Nature, 293:133–135, 1981.

23. Y. Ma, J. Kosecka, and S. Sastry. Optimization criteria and geometric algorithms formotion and structure estimation.International Journal of Computer Vision, 44(3):219–249, 2001.

24. Y. Ma, S. Soatto, J. Kosecka, and S. Sastry.An Invitation to 3D Vision: From Images toGeometric Models. Springer Verlag, 2003.

25. O. Shakernia, R. Vidal, and S. Sastry. Multi-body motion estimation and segmentationfrom multiple central panoramic views. InIEEE International Conference on Roboticsand Automation, 2003.

26. A. Shashua and A. Levin. Multi-frame infinitesimal motion model for the reconstruc-tion of (dynamic) scenes with multiple linearly moving objects. InIEEE InternationalConference on Computer Vision, volume 2, pages 592–599, 2001.

ijcv05-motion.tex; 20/09/2004; 3:12; p.27

28 Vidal and Ma

27. J. Shi and J. Malik. Motion segmentation and tracking using normalized cuts. InIEEEInternational Conference on Computer Vision, pages 1154–1160, 1998.

28. S. Soatto and R. Brockett. Optimal and suboptimal structure from motion: local am-biguities and global estimates. InIEEE Conference on Computer Vision and PatternRecognition, pages 282–288, 1998.

29. A. Spoerri and S. Ullman. The early detection of motion boundaries. InIEEEInternational Conference on Computer Vision, pages 209–218, 1987.

30. P. Sturm. Structure and motion for dynamic scenes - the case of points moving in planes.In European Conference on Computer Vision, pages 867–882, 2002.

31. R. Szeliski and S. B. Kang. Recovering 3D shape and motion from image streams usingnon-linear least squares.Journal of Visual Communication and Image Representation,5(1):10–28, 1994.

32. P. Torr, R. Szeliski, and P. Anandan. An integrated Bayesian approach to layer extractionfrom image sequences.IEEE Trans. on Pattern Analysis and Machine Intelligence,23(3):297–303, 2001.

33. P. H. S. Torr. Geometric motion segmentation and model selection.Phil. Trans. RoyalSociety of London, 356(1740):1321–1340, 1998.

34. R. Vidal. Generalized Principal Component Analysis (GPCA): an Algebraic GeometricApproach to Subspace Clustering and Motion Segmentation. PhD thesis, University ofCalifornia, Berkeley, August 2003.

35. R. Vidal and R. Hartley. Motion segmentation with missing data by PowerFactor-ization and Generalized PCA. InIEEE Conference on Computer Vision and PatternRecognition, 2004.

36. R. Vidal and Y. Ma. A unified algebraic approach to 2-D and 3-D motion segmentation.In European Conference on Computer Vision, 2004.

37. R. Vidal, Y. Ma, S. Hsu, and S. Sastry. Optimal motion estimation from the multiviewnormalized epipolar constraint. InIEEE International Conference on Computer Vision,volume 1, pages 34–41, Vancouver, Canada, 2001.

38. R. Vidal, Y. Ma, and J. Piazzi. A new GPCA algorithm for clustering subspaces byfitting, differentiating and dividing polynomials. InIEEE Conference on ComputerVision and Pattern Recognition, 2004.

39. R. Vidal, Y. Ma, and S. Sastry. Generalized principal component analysis (GPCA). InIEEE Conference on Computer Vision and Pattern Recognition, volume 1, pages 621–628, 2003.

40. R. Vidal, Y. Ma, S. Soatto, and S. Sastry. Two-view multibody structure from motion.International Journal of Computer Vision, 2004.

41. R. Vidal and S. Sastry. Segmentation of dynamic scenes from image intensities. InIEEEWorkshop on Motion and Video Computing, pages 44–49, 2002.

42. R. Vidal and S. Sastry. Optimal segmentation of dynamic scenes from two perspectiveviews. In IEEE Conference on Computer Vision and Pattern Recognition, volume 2,pages 281–286, 2003.

43. J. Wang and E. Adelson. Layered representation for motion analysis. InIEEEConference on Computer Vision and Pattern Recognition, pages 361–366, 1993.

44. Y. Weiss. A unified mixture framework for motion segmentation: incoprporating spatialcoherence and estimating the number of models. InIEEE Conference on ComputerVision and Pattern Recognition, pages 321–326, 1996.

45. Y. Weiss. Smoothness in layers: Motion segmentation using nonparametric mixtureestimation. InIEEE Conference on Computer Vision and Pattern Recognition, pages520–526, 1997.

46. J. Weng, N. Ahuja, and T. Huang. Optimal motion and structure estimation.IEEETransactions PAMI, 9(2):137–154, 1993.

ijcv05-motion.tex; 20/09/2004; 3:12; p.28


47. L. Wolf and A. Shashua. Affine 3-D reconstruction from two projective images of in-dependently translating planes. InIEEE International Conference on Computer Vision,pages 238–244, 2001.

48. L. Wolf and A. Shashua. Two-body segmentation from two perspective views. InIEEEConference on Computer Vision and Pattern Recognition, pages 263–270, 2001.

49. Y. Wu, Z. Zhang, T.S. Huang, and J.Y. Lin. Multibody grouping via orthogonal sub-space decomposition. InIEEE Conference on Computer Vision and Pattern Recognition,volume 2, pages 252–257, 2001.

50. L. Zelnik-Manor and M. Irani. Degeneracies, dependencies and their implications inmulti-body and multi-sequence factorization. InIEEE Conference on Computer Visionand Pattern Recognition, volume 2, pages 287–293, 2003.

ijcv05-motion.tex; 20/09/2004; 3:12; p.29

A Uniﬁed Algebraic Approach to 2-D and 3-D Motion ...rvidal/publications/ijcv05-motion.pdf ·...

Documents

Transcript of A Uniﬁed Algebraic Approach to 2-D and 3-D Motion ...rvidal/publications/ijcv05-motion.pdf ·...