On the kinetic depth effect

Biol. Cybern. 60, 445-455 (1989) Biological Cybernetics �9 Springer-Verlag 1989

On the Kinetic Depth Effect

J. Aloimonos 1 and C. M. Brown 2

1 Computer Vision Laboratory, Center for Automation Research, University of Maryland, College Park, MD 20742-3411, USA 2 Department of Computer Science, University of Rochester, Rochester, NY 14627, USA

Abstract. The problem of the kinetic depth effect is revisited. We study how many points in how many views are necessary and sufficient to recover structure. The constraints in the cases where the velocities of the image points are known, and the positions of the image points are known with the correspondence between them established, are different and they have to be studied separately. In the case of two projections of any number of points there are infinitely many solutions, but if we regularize the problem we get a unique solution under some assumptions. Finally, an algorithm is discussed for learning this particular kind of regularization.

I Introduction

The interpretation of visual motion by humans and other biological organisms is an exciting field in the study of perception. An issue here is what kinds of mathematical analysis are adequate and lead to bio- logically plausible models of computation for the task. The ability of the human visual system to discern structure from a motion stimulus was demonstrated by experiments by Wallach and O'Connell in the 1950's (WaUach and O'Connell 1953). Subsequently, Joh- ansson (Johansson 1973) discovered our ability to recognize the human form from the projected motion of as few as ten points on the body, including various joints such as elbows, shoulders and knees.

It would seem that the perception of rigid structure from motion should not require the detection of the projected trajectories of too many points.

One of the first rigorous mathematical treatments of this problem was given by Ullman (Ullman 1979). In his classical paper on the computation of structure from motion, Ullman showed how structure was

determined uniquely (up to a reflection) from the projected locations of four noncoplanar points, obtained at three different instants of time. His analysis is based on the orthographic projection model. The treatment also considered the correspondence of the four projected points between the three frames, as available. In our analysis, we too work with orthographic projection and assume the point correspondences already given. While it is true that the perspective or central projection model is more appropriate for image formation, we will argue that orthographic projection is a realistic simplification for this specific problem. One reason is that at small retinal eccentric- ities perspective effects are small. Another reason is that in Ullman's schemes, as well as ours, only a small number of points are considered at a time and so orthography will serve as an adequate model.

Ullman's analysis shows that three orthographic projections of four noncoplanar points are sufficient to recover structure. To our knowledge there has been no attempt to investigate lower bound results in relation to this problem. In other words, the question as to whether four points are necessary to recover structure from three views is yet to be answered. Also, what can be inferred from two views of any number of points is an interesting question.

The next section discusses previous work and the motivation for this research. Section 2 introduces the mathematical formulation and some lower bound results. Section 3 describes some mathematical preliminaries and the development of the constraints that will be used later. Section 4 continues with the es- tablishment of the lower bounds, and Sect. 5 is devoted to the proof of the fact that two views with the assumption of smoothness uniquely define structure under some assumptions. Section 6 discusses learning of the algorithm introduced in Sect. 5. Finally, Sect. 7 describes relevant experiments, and the final section gives the summary and conclusions.

446

1.1 Previous Research and Motivation

The problem of structure from motion under orthographic projection has received a lot of attention. There have been two approaches towards its solution: the differential (small) motion and the discrete (large) motion approaches. In the differential case, the optic flow field (image velocity) is used for subsequent analysis, whereas in the discrete case the measurement of the retinal motion relies upon isolating feature points (tokens) and tracking them (correspondence) through time.

In the case of differential motion, where the velocities of image points are given, there is the work of Hoffman (1982), Sugihara and Sugie (1984), Sugie and Inagaki (1984), and Ullman (1983). Hoffman proved that optic flow cannot recover surface orientation, but if the temporal derivative of the flow is known (acceleration), the structure of the surface in view may be computed. It has to be understood that to know the temporal derivative of the flow field would require at least two optic flow fields. Sugihara and Sugie showed in a different way that optic flow cannot recover surface orientation, but two optic flow fields induced by the motion of the object in view admit a finite number of interpretations for the structure of the surface in view. Actually, in their experiments they only found two solutions for the structure (in fact, one solution plus its Necker reflection), even though their theoretical analysis showed that there should be more than two solutions. So, they developed a conjecture that stated that from two orthographically projected optic flow fields the structure is uniquely determined. In this paper we analyze the conjecture for the discrete case and we show why it is very probable that a unique solution is obtained. (There is more than one solution in some degenerate cases.) In other words we analyze this conjecture for the case of three distinct projections (instead of two consecutive optic flow fields). Sugie and Inagaki worked under the assumption of a fixed axis (the rotational velocity component is constant throughout the obervation period). They recovered structure from two points and their velocities in three views, or three points and their velocities in two views. Finally, Ullman showed that optic flow can recover surface orientation only up to a scale factor, and there are two solutions for planar surfaces but one solution for nonplanar ones (solutions for the scaled version of the structure, not the structure itself, because structure cannot be computed from orthographically projected flow).

In the discrete case, there is only the work of Ullman (1979). Ullman dealt with the problem in the case where only the positions of the points are observable in the different frames, with the correspondence between

the different points established, and he showed that four noncoplanar points in three views are sufficient to recover structure. To our knowledge there has been no attempt to determine whether four points in three views are necessary to determine structure uniquely. In this paper we investigate this question, and we develop lower bounds on the number of points and views that are necessary and sufficient to determine structure. We prove that four points in two views cannot determine structure, but if we require the surface in view to be smooth, then a unique solution may be obtained (regularization). Also, we present a learning method for obtaining the parameters needed in the regularization algorithm. We should mention in passing that the problem of interpretation of Johansson's "biological motion" was analyzed by Hoffman and Flinchbaugh (1982), Hoffman and Bennett (1985), and Bennett and Hoffman (1985). Their analysis is for othographic projection with the additional assumption that the axis of rotation is fixed for the entire period of observation (i.e., the motion is planar). On the other hand, we consider rigid motion and so we do not require a fixed axis assumption. Also there is a great deal of research on this problem under the perspective projection model (see for example Koenderink and Van Door 1977; Ullman 1979; Tsai and Huang 1984; Longuet-Higgins and Prazdny 1984; Longuet- Higgins 1981; Waxman and Ullman 1985; Adiv 1984; Aloimonos 1986; Bandyopadhyay 1986), but here we address the problem under orthographic projection.

2 Mathematical Formulation and Lower Bound Arguments

Consider the Cartesian representation of a point in 3-D space. This is the vector (X, Y, Z). A quartet of four such points can be written as (X i, Yi, Zi), i = 1, 2, 3, 4. Let these points move and take up new positions (X'/, Y/, Z'i). Considering rigidity, we have the fact that the motion can be represented by an affine transformation:

(x; ~',Z;)T=R(X. ~,Z,)r +(.~X,A~AZV, (1)

where R is a three by three rotation matrix and (AX, A Y, AZ) the translation vector. Taking the orthographic projection of the above we have

X'i=rllXiq-r12Yi q-r13Zi d- d X , (2)

Yi'=rElXi+r22Yiq-r23Zi+ A Y , (3)

where the elements rq of the rotation matrix depend upon three independent parameters - the axis of rotation and the angle of rotation about the axis. Now if we take two views of three points, we obtain six equations in the seven variables - three for the

rotation, two depth variables (we have three depths but only relative depth can be recovered) and two for the translation. Thus, we cannot solve the problem in this case. A similar argument holds for three views of two points and two views of four coplanar points. So, according to the above equation-counting argument, the following theorem has been proved.

Theorem 1. In general it is impossible to recover the structure of

(1) three points, given two distinct orthographic projections of these points;

(2) two points, given three distinct orthographic projections of these points; and

(3) four coplanar points, given two distinct orthographic projections of these points.

3 Mathematical Preliminaries

In this section we develop the constraints that were mentioned in the previous section in a series of lemmas.

Lemma 1. Given two distinct orthographic projections of three points in a rigid configuration, the gradient (p, q) of the plane that the three points define (with respect to the coordinate system of the first frame) lies on a conic section in gradient space. The coefficients of this conic section depend entirely on the inter frame displacements of the above points.

Proof Let the three points in space be O, A, B in their first position and 0', A', B' in their second position and their projections in the two frames be 01, A1, B1 and 02, A2, B2, respectively. Let also the gradient of the plane OAB be G = (p, q). Furthermore, let

01A1 = ah = (xl, Yl), (4)

O1B1 = 111 = (cl, dl) , (5) O 2 A 2 ---- ~ = (x2 , Y2) , (6)

O 2 B 2 = 112 = (C2, d2). (7)

Considering the geometry of the first projection (OAB to 01AIB1) and the second projection (OAB to 02A2B2), w e have

OA=(x~,y l , G ' a 0 , (8)

O B = (q , dx, G" 110, (9)

O'A' = (x2, Y2, 2), (10)

O'B'= (c2, d2, ~), (11)

where 2 and # are to be determined. But because of rigid motion, we have IIOA II = IIO'A' II, IIOBII = IIO'B' II and OA'. OB = O'A'. O'B', where I1" II denotes length and "-" dot product operation. Considering the above

447

constraints with Eqs. (8)-(11) we get

(11~ -- 11~) (G" al ) 2 + (a~ - ate) (G" Iil) 2

- 2 ( a d L - a~P~)(G- a , ) (G . px)

+ ( a ~ - ~ ) ( p ~ - 1 1 ~ ) - ( a , . p , - a ~ p ~ ) ~ = 0 . (12)

Given that G. ~1 ~ pXl d- qYl and G. 11x = pc1 + qdl, Eq. (12) is of the form

cop2 +flq2 + Tqp+6 = 0 ,

where the coefficients e, fl, 7, 6 depend on the image vectors 0 h, ot 2, II1, and 112. (q.e.d.)

We now state and prove a second lemma, that relates the depth differences of the world points with the interframe displacements.

Lemma 2. Given two distinct orthographic projections of three points O, A, B, with depths z o, z A, z B (with respect to the coordinate system of the f irst frame), the tuple (z 1, z2), with z 1 = z 0 - z A and z 2 = z 0 - z B, lies on a conic section on the plane. The coefficients of this conic depend entirely on the interframe displacements of the above points.

Proof. It is obvious that this statement is equivalent to the previous lemma. The reason that we state it is that we will use this form of the constraint in our subsequent analysis. Using the nomenclature of the previous lemma, we observe that G .or I =z 1 and G. 111 = z 2 and so Eq. (12) becomes ( p ~ _ ~ ~ ~ . , . ,

P2)zl + ( = 1 - = 2 ) z 2 - 2(=1Pl - (xz112)ZlZ2 2 2 2 +(=1- =2)(111-11~)-(art" 111- =2112) 2 = 0 . (13)

The above equation proves the claim. The above lemmas relate the structure (shape) of

three points to their two distinct orthographic projections. Whether the points move or the projection plane moves (moving observer) or both of them move, the analysis remains the same.

We now proceed with the third lemma, which analyzes the same problem but for the case of differential motion.

Lemma 3. The optic flow field at every point (x, y) of an image under orthography constrains the gradient (19, q) of the surface point whose image is the point (x, y) to lie on one of two straight lines that pass through the origin of the gradient space.

Proof This result was known to Ullman (1983) and Hoffman (1982). So it will not be proved here. A consequence from this result is that optic flow cannot recover surface orientation under orthography. The question that arises, then, is the following: is the discrete case different from the continuous one? In other words, two close in time views (optic flow) cannot

448

recover surface orientation, but what happens when the views are not close in time (discrete displacements)? The following sections examine this question.

4 Lower Bound Results

So far we have established the facts that optic flow (two close in time views) cannot recover surface orientation and that two orthographic views of less than four points cannot recover the structure of those points. Here we study the same problem when the number of points in two views is four, and when the number of points in three views is three. The fact that we have two views and discrete motion does not necessarily imply that we will have infinitely many solutions as in the differential case, because now the constraints are different (hyperbolas in the discrete case, lines in the continuous case). So, the problem needs to be inves- tigated. The following two theorems answer these questions.

Theorem 2. Two orthographic projections of four rigidly linked noncoplanar points are compatible with in-

finitely many interpretations of their relative 3-D positions. A third view yields a unique interpretation of the structure of the four points.

Proof Let the four points in space be O, A, B, C. Let also the projections of the four points in the two frames be 01, A1, B1, C1, and 0 2 , A2, BE, C 2 respectively, and the gradients of the planes OAB, OBC, and OCA be G1 =(Pl, q0, G2=(P2,q2), and G3=(p2,q3 ) respectively (with respect to the first frame).

Using the projections O~A1, O1B 1 and their corresponding ones O2A2, O2B 2 and utilizing Lemma 1 we get

2 2 0~lPl + f l lq l + ~] l P l q l + 61 = 0 , (14)

where the coefficients depend entirely on the image vectors. Similarly, considering the projections O~B1 and O~C~ and their corresponding ones in the second frame and the projections O~C1 and OIA 1 and their corresponding ones in the second frame, we get

2 2 ~2P2 + f12q2 + Y2P2q2 + 62 = 0, (15)

2 2 0~3P3 + f13q3 + Y3Paq3 + 63 = 0 . (16)

At this point we should mention that the above equations are independent because they come from the rigidity of the three rods OA, OB, OC. In other words the fact that the three lengths OA, OB, and OC in space remain constant and the two angles AOB and BOC in space remain constant between the two frames does not imply that the third angle COA will remain the same.

Proceeding, we note that we have more information about the gradients G1, G2, G3 from the well known Mackworth constraints (Mackworth 1973):

G1 "OIB1 = G 2 " O1B1, (17)

G2 �9 O1C 1 = G 3 " OIC1, (18)

G 3 "OIA 1 = G 1 "OIA 1 . (19)

Equations (14)--(19) constitute a system of six equations in the six unknowns Pl, qx, P2, q2, P3, q3. These equations seem intuitively independent, but they might not be independent enough to guarantee that there exist only a finite number of solutions for the above system.

Before we proceed with a rigorous proof, we shed some light on the form and information content of (14)-(19). Equations (17)-(19) simply express the fact that the gradients G1, G2, G3 of the three planes make a triangle the direction of whose sides are known, but we don't know its position and its scaling. On the other hand, (14)-(16) state that each of the gradients G 1, G2, G 3 lies on a conic section in gradient space. So, in order to solve the problem (i.e. to find the three gradients) we have to put a triangle on graident space, such that its sides have the orientation defined by the Mackworth constraints (17)-(18) and each one of its vertices lies on each one of the three conic sections (defined b y (14)-(16)). 1

The simple fact that we have six equations in six unknowns does not mean that this system will have a finite number of solutions. To find out if there are a finite number of solutions we apply the inverse function theorem (Richards etal. 1983). This theorem allows us to conclude that whenever the Jacobian of these equations is nonsingular, the mapping defined by these equations is locally one to one and onto. Hence, any roots at points where the Jacobian is nonsingular are isolated and not part of a continuum of solutions. It is a simple exercise to compute the Jacobian of the above system and prove that it is identically zero. So, there are infinitely many solutions and they are given by the solutions of the system of (14)-(19). A trivial argument extends this proof to any number of points. To conclude the proof of the theorem, if we add one more view, then the solution is unique and the proof is immediate from the structure from motion theorem of Ullman (1979).

1 At this point we should say that several important problems in vision have been solved in a very similar way. Horn (1977) solved the problem of determining the shape of a polyhedral object from intensity information and the Mackworth constraints, and Kanade (1981) solved the same problem (shape of polyhedral objects) using skewed symmetry and the Mackworth constraints

Theorem 3. Three orthographic projections of three rigidly linked points are compatible with at most one interpretation (plus reflection) of their relative 3-D positions, in general. Furthermore, when a certain testable condition holds then there at most two interpretations (plus reflections). Adding a fourth view yields a unique interpretation of the structure of the four points.

Proof Let the three points in space be O, ,4, B with depths Zo, za, z B (with respect to the coordinate system of the first view), and their projections on the three frames be 0~, A1, B~ for i= 1, 2, 3 respectively.

Let also

ZI=Zo--ZA, Z2=Zo--ZB,

O i A i = " i , O 1Bi = 11i

for i= 1,2,3. Now applying Lemma 2 to frames 1 and 2 and then

to frames 1 and 3 we get the following equations:

2 2 2 2 2 2 (Pl - P2)zl + ("1 - ' 2 ) z 2 - 2 ( ' lP l - '2P2)ZlZ2

+ .111-'2112)z=0, (20) 2 2 2 2 2 2 (111 - 113)zl + ( '1 - "3)z2 - 2('~ 111 - ' 3 P 3 ) z lz2

--'3)(111 -113 ) - ( ' 1 " 111 - - ' 3113 ) 2 : 0 . (21)

The above equations constitute a system Z1 of two equations in the two unknowns z 1, z 2. The Jacobian of this system has rank two in general, and so by applying the inverse function theorem we conclude that the system has finitely many solutions. Using Bezout's theorem we conclude that the system has at most four solutions (actually two solutions, plus the Necker reflections).

In the sequel we prove that, in general, the above system Z 1 has a unique solution (plus reflection).

After eliminating the constant terms from (20) and (21) we get

(K2N 1 -- KIN2)z21 + (M2N1 - M1N2)z lz 2

+ (L2N 1 -- LIN2)z 2 = 0 (22)

with

2 2 2 2 L 1 = ' 1 - - ' 2 , L 2 = ' 1 - ' 3 ,

M1 = --2('1111 - '2112), M2 = -- 2 ( ' d i l - - '3P3),

N1 = KIL* 4 ' N2 = K 2 L 2 - 4 - "

Equation (22) is homogeneous in z 1, z 2 and by �9 Z 1

dividing by z 2 and setting - - = x we get the following equation: z2

(KENt - K , N 2 ) x 2 + (M2NI - MxN2)x

+(L2N 1 - L I N 2 ) = 0 . (23)

449

The solution of the above equation is given by

- - (M2N 1 -- M I N 2 ) _ ] / ~ x = (24)

2(K2N 1 - K1N2)

where Disc is the discriminant of (23). On the other hand, if the lengths of the vectors OA,

OB are r and # respectively, then from the geometry of the projection on the first, it is obvious that

z2=_+

Consequently, x = + - ~ "

Thus, if x has two solutions then these solutions must have the same absolute value and opposite sign if both are to be valid. We conclude that x will have two valid solutions if

M 2 N 1 = M I N 2 . (25)

So far, we have concluded that if condition (25) holds, then the problem has two solutions (plus reflections), because there will be two solutions for

Z 1 x-- - - , and so four solutions for (Zl, z2) (actually two

Z 2 solutions, plus reflections). If condition (25) does not hold, then there is only one solution for x and consequently two solutions for (zl,z2) (actually one solution, plus reflection). 2

Finally, to conclude the proof we have to prove that if we add one more view, then we get a unique result. If we call 04, A4, B4 the projections in the fourth view, and let O4A4 = ' 4 and O4B4 = 114, then considering the first and the fourth frame we get the equation

(112 _ ll2)z~ + ('~ _ '2 )z2 _ 2(" 1111 -'4114) z a z2 2 2 2 2

Jr-(" 1 - " 4 ) (111-114) - ( " 1 " 1 1 1 - - ' 4 1 1 4 ) 2 = 0 . (26)

Equations (20), (21) and (26) constitute a system of three equations in two unknowns. So, this system, barring degeneracy, will have at most one solution.

It is worth stating at this point that the above theorem does an analysis of the conjecture developed by Sugihara. There the problme of computing structure from two optic flow fields was examined, and even though a theoretical analysis indicated that there is more than one solution (plus reflection), in the experi- mental results it was observed that only one solution was possible. This led the authors in (Sugihara and Sugie 1984) to formulate a conjecture that two flow fields uniquely define structure. The above theorem

2 In addition, the above description can be used to actually find the structure of three points from three projections, by develop- ing (23), solving for x and then using this value in conjunction with (20) and (21) to solve for z 1, z2 rejecting the imaginary roots

450

explains this fact for the discrete case, since we have proved that in general only one solution is possible, except if (25) holds.

Up to now we have established the facts that two views of any number of points admit infinitely many interpretations for the structure of those points, and that three views of at least three points admit finitely many solutions for the structure. But what if we have two views and we impose a smoothness condition on the surface on which the points lie? In that case, we prove (in the next section) that there exists a unique solution for the structure of the surface in view. This result fits into the regularization paradigm introduced into vision research by Poggio and his colleagues.

5 Employing a Smoothness Assumption

In this section we prove that two orthographic views of a "smooth" surface, with the correspondence between the two frames established, uniquely define the structure of the surface in view, when the boundary conditions are known (there is no occlu- sion). Of course the proof is based on the definition of "smoothness" that we will employ later.

Consider a moving surface Z = Z(x, y) and let (Au(x, y), Av(x, y)) be the discrete displacement field for two time instants t 1 and t 2 with tl < t2, i.e. if an image point is at the position (x, y) at time t~, then at t i m e t 2 it will be at the position (x+Au(x,y), y+Av(x,y)). Then, assuming that the surface is locally (differentially) planar, we can prove that the gradient (p, q) at a surface point whose projection on the image plane is the point (x,y) satisfies the following conic constraint:

k i p 2 d- kEq 2 - 2kapq + k4 = 0 (27)

with

kl =(or 2 - 0t2)dx 2

k 2 =([I 2 - - [ i2)dy 2

k3 = (Otl 1~1 - - 0 t2~2)dxdy

k4 = (ot~ - ot 2) (li 2 - li 2) - (or 1112 - ot2p2) 2 ,

where

at 1 -- (0, dy), ~1 = (dx, 0),

ot 2 = (d u(x, y + dy) - Au(x, y),

dy + A v(x, y) + dy) - A v(x, y))

and

1~2 ~ (dx + Au(x + dx, y ) - AN(X, y),

A v(x + dx, y) - A v(x, y)).

The proof of (27) is immediate from the applica- tion of Lemma 1 to the points (x, y), (x + dx, y) and

(x, y + dy). It has to be emphasized that the coefficients ki, i=1 , . . . , 4 are not constant; they are functions of (x, y) and so we write ki(x, y) instead of k i. Let us write M(p, q, x, y) -- kip 2 + k2q 2 - 2k3p q. Then, (27) becomes

M(p, q, x, y) + k4(x, y) = 0. (28)

What we need to prove is that (28) at every image point, coupled with smoothness, uniquely defines the structure (p, q) of the object in view. For this, we need some more background that will be given in the next sections. Also, it has to be noted that the constraint (28) is an approximation since some assumptions were used for its development.

5.1 The Stereographic Projection

Surface orientation is quantified by the surface normal, a unit vector in R 3. Let (/, m, n) be a vector denoting the direction of a surface normal. Then,

this direction is defined by (p,q)= m' "

This quantity is called the gradient of the normal and the space of all the possible gradients defines the so-called gradient space (Horn 1986). Gradient space is unbounded. But the gradient is not the only representation for surface orientation. Another representation, that results in a bounded space, is stereographic coordinates. A surface normal can be represented by a point on a unit sphere, called Gaussian sphere. The part of the surface facing us corresponds to one hemisphere, let's define it to be the northern hemisphere, while points on the occluding boundaries correspond to the points on the equator. A point n on the Gaussian sphere corresponds to a unique gradient (p,q) and vice versa. This can be seen geometrically in the following way. Consider the projection of the Gaussian sphere by rays from its center onto a plane tangent to the north pole. This plane represents the gradient space (p,q). This projection is called gnomonic projection and it has the property that great circles are mapped into lines. The north pole is mapped onto the origin of the gradient space, the northern hemisphere is mapped into the whole p - q plane, and points on the equator end up at infinity. Points on the lower hemisphere are not projected onto the p - q plane. With the gradient space formalism, the equator maps into infinity. As a result, occluding boundary information cannot be expressed in gradient space. One solution to this problem is the use of stereographic projection (Ikeuchi and Horn 1981). We can think of this projection in geometric terms also as a projection of the Gaussian sphere onto a plane tangent to the north pole. However,

this time the center of projection is the south pole, not the center of the sphere. We label the axes of the stereographic plane as f, g. It can be shown that the relation between the stereographic space coordinates f, g and the gradient space coordinates (Lee 1985) is given by:

f = 2p[V 1 + p2 -4- q2 _ l]/(p2.4_ q2), (29)

g = 2q[l/1 +p2 -4- q2 _ 1 ]/(p2 + q2), (30)

4f p = 4 _ f 2 _ g 2 , (31)

4g q= 4 _ f 2 _ g 2 . (32)

This mapping is conformal and the whole northern hemisphere is mapped onto a closed disc of radius 2 on the f - g plane. The orientations of occluding boundaries correspond to points on the circumference of that disc. The advantage of the stereographic projection is that it maps the whole northern hemisphere (visible orientations) onto a bounded space, i.e. a disc of radius 2 in the stereographic plane with center at the origin (the point where the stereographic plane is tangent to the sphere, i.e. the north pole), and the occluding boundary information can be easily expressed (the points of the circumference of the disc).

5.2 Combining Motion with Smoothness

In this section we prove that two orthographic projections of a smooth surface uniquely define the structure of the surface according to the smoothness criterion.

If we have two ortographic projections of a surface, with the correspondence between the two projections established, then (28) holds, i.e.

M(p, q, x, y) + k4(x, y) = 0. (33)

If we use (31) and (32) we can substitute for p, q in terms of f and g, and assuming that the surface does not contain points whose orientation is equal to orientation at the boundaries, we get a relation of the form

L ( f g, x, y)=O. (34)

with L a polynomial in f, g whose coefficients depend on the image data. From (34), there are infinitely many solutions for the structure of the surface in view.

Clearly, the problem of finding structure f, g from (34) is an ill-posed problem in the sense of Hadamard (1923), since the solution is not unique. To regularize the problem then and make it well-posed, we should introduce some more constraints on the problem. Specific regularization methods for solving ill-posed problems have been developed (Tichonov and Ar- senin 1977). Following the paradigm of regularization

451

theory, we require that the surface in view be smooth and we wish to minimize the quantity (Poggio et al. 1985)

e = ~S([-(fx) 2 -~-(fy)2 + (gx)2 + (gr)2] + ~.(L)2}dxdy, (34.1) D

where 2 a regularization constant. In other words, to solve this ill-posed problem (i.e.

to make it well-posed), we should restrict the class of admissible solutions by introducing suitable a priori knowledge. This a priori knowledge can be exploited, for example, under the form of variational principles that impose constraints on the possible solutions. By choosing to minimize e in (34.1) we wish to find a solution that best satisfies the constraint L = 0 and is smooth, where the relative degree of smoothness is measured by the first term in (34.1). But finding the solution is difficult, because of the nonlinearity involved. Even in this nonlinear case, methods for obtaining the solution have been developed (Morozov 1984), but the solution space is no longer convex, and so several local minima may be found in the process of minimization. Because of this, we employ a different method that can be shown to converge to a unique solution.

So, we require that the solution be a surface that best satisfies the motion constraint (34) and at the same time be as smooth as possible, according to the stabilizing functional (Tichonov et al. 1977) introduced in (34.1).

The question that then arises is to discover whether or not there exists a unique surface that minimizes e, and if it exists, to give an algorithm that finds this unique solution. In the sequel we investigate this question. Our analysis is done for the discrete case, i.e. by taking into account the discrete nature of images. We use techniques from the field of partial differential equations (Smith 1978) that have been used in vision research (Lee 1985).

5.3 Finding the Unique Solution

Consider the motion constraint Eq. (34), L(f, g, x, y) = 0, (x, y) ~ D, where D is the unit square region in the x - y plane with mesh size m, and discretize e by using difference operators instead of differential operators and summations instead of in- tegrals. We consider the boundary of the imaged object as a square for simplicity. A nonsquare boundary can be used (employing some finite element methods), but the solution is very involved and it will not be presented here.

Let n = k 2, where k + 1 = --.1 The desired surface is the one that minimizes m

e= .~,. (si, j + 21i, j) , (35) t , J

452

where and (4 S i , j = ~ { [ f / + l , j - - f / , j ] '~ [ - f / , j+ l - - f / , j ] 2 1

+ [gi+ 1 , j - - g l , j] 2 + [gi, j+ 1 -- gi, j] 2} B =

li, j = EL(f/, j, gi, j, i, j)] 2,

and where f/,j, gi,j represent the surface orientation at the regular grid point (im, jm). This minimization is subject to boundary conditions, i.e. f/,j and gi,j are known if (im, jm) belong to the boundary. We assume that the surface normal at a boundary point (i, j) is parallel to the image plane (i.e. 2 2 _ f/, j "[- gi, j -- 4). (Occlud- ing boundary).

Function e of Eq. (34.1) is defined on a compact subset K of R z~, with n = k 2, and it is continuous with respect to f/, j and gi,j. Therefore, there exists a solution to the minimization problem. Furthermore, the solution that minimizes e is the solution of the system

~e 0e - - = 0 . (36)

~f/,j -- ~gi, j

Equations (36) become or

aL f/, J = f / , ,_ 1 2m 2 [L(f/, J, gi, j, i, j)] ~f- (f/, j, g,, j, i, j)

(37) , 1 2 c3L

gi,j = gi,j-- ~ 2m [L(f/,j, gi,j, i,j)] ffg(f/,j, gl, j, i,j),

where

1 :) 4 - 1 . . . . . . . . . . . ~ R kxk

... - 1 4

--1

Equation (38) is nothing but (37) written in a compact form. Notice that (38) is the necessary condition for the solution that minimizes (35). We will now prove that (38) has a unique solution. For this we will need the fact

that the functions [L(f, g, i,j)] 8L(f, g, i,j) and ~f

[L ( f g, i, j)] OL(f, g, i, j) are Liptschitz with respect to f Og

and g. This claim will be proved later. Let 41 and 42 two solutions of (38), with 41 # 42.

Then, since �9 is invertible (Lee 1985), we have

4, -- 42 = -- 2m2~- I(q~(41)-- q~(~2)),

[141 -- 42 II 2 ~ 2m2 [I ~ - 1112 II ~b(41)-- 4(42)tl 2

fL:j(gi, j )= f /+ 1,j(gi+ 1,j)'4- f/ , j + l(gi,j+ 1)+ f / -1 , j (g i - 1,j)+f/,j- l(gi, j-1) 4

Equations (37) can be written as

44= -,~m2~(0, where

4=[f1,1 ..... fl.k . . . . . fg.k, gl.1 ..... gkk] r

~9= 1. {L( f i , j , g i , j , i , j ) } OL(f/,j, gi,j, i, j)

dL(f/,j, gi, j,i,j) 1 r { L(f/,p g,,j, i, j) } ..., Og "'"

J

(38)

and

~ and

I

A =

O) where ACR" •

- I

B - - I

. . . . . . . . . . .

... - - I B I

- - I B~

(39)

But

11 ~ - 1112 = [8 sin2 ( 7 ) ] - 1

< [2n2m2 (1 n2m2~2] -1 ~ - J d (40)

from (Smith 1978). 8L 8L Also, since {L(f,j, i,j)} ~f- and {L(f g,i,j)} ~gg are

Lipschitz with respect to f and g, we have

{L(f, . . . . 8L(f, g, i, j) dL(f ' , g', i, j) g, ', J) t ~ f {L(f ' , g', i, j)} ~ -

t2 <ci. j{( f _ f ) +(g_g,)2}1/2, (41)

{/_,(f, g, i, j)} OL(~gg, i, j) _ {L(f', g', i, j)} c3L(f', g', i, J) dg

< d i , j { ( f - f , )2 + (g _ g,)2} 1/2, (42)

and

/t = max {ci,j, di, j}. (43) i,j

453

From (41), (42), and (43) we get

[I qS(~ 1)- ~b(~2)[[ 2 -< $t [[ ~ 1 - (2 II 2. (44)

So we get

~ - - / A #[1~1-~2112.

If we choose 2 such that (45)

_1 u < l ,

then (45) leads to a contradiction. So ~1 = 42 and (38) has a unique solution. Furthermore, the sequence ~(a) defined by

~ , + 1) = __ ,~m2tp- 1~(~(~)) , ~=0,1,2, . . .

converges to the unique solution of (38). So we have proved the following theorem.

Theorem 4. If 0 < 2 < 27z2p- 1 [1 - n2m2/24 ]2, then there exists a unique surface minimizing e, which is also the unique solution of (38) and the above described algorithm converges to that solution.

To complete the proof we need to show that (41) and (42) hold. Observe that L(f, g, i, j) is a polynomial

t3L Lt3L in f, g. Consequently, L~(-_j and dg are also poly-

nomials for every i, j. So, we can take the Lipschitz constant ci.j for (41) to be such that

cu<#[ sup d-~-(L,.OL~ 2 = (Ly2+,2_<_4 of \ ' af } J

+ [ sup 1 ' '

L:~+g~__<4ag\ ' a f , / l J J '

do-<I[ sup l~f(li. C3Lu'~llZ - tL:~+.~_<, \ ' ag/I]

+ [ sup ~ & oL,,~ 7~1, = k:=+.=-<, ' -aT)lA 3 "

With the above equations we mean that c,, j and d~, exist and are finite for every i, j. Finiteness is also clear since the suprema am taken over the disc of radius 2 (compact set) and the functions are polynomials. In this context, if we replace c~.j and d~,j by their respective upper bounds of the above equations, it is possible that the resulting value of # is larger than the actual one. However, this is immaterial for the uniqueness proof, since it only results in a smaller range for 2.

6 Learning the Regularized Algorithm

Having formulated the problem of structure from motion given two orthographic projections in the

regularization paradigm given that we deal with smooth surfaces, we have obtained the result that a unique solution exists and it is the limit of a sequence. But the result depends on the choice of the regularization parameter 2, and we proved that there exist infinitely many 2's in an interval for which the problem will have a unique solution. The issue that arises, then, is for which value of 2 we obtain the correct solution, i.e. the solution that is physically realizable. To find such a 2 would mean to learn this particular algorithm.

The learning can be done from examples. In this particular situation, the system is given the actual shape in view and the displacement field, and it is asked for the value of L This can be done with the method of crossvalidation (Wahba 1984). But this method depends on a specific quadratic variational condition (41.1). An alternative learning method, first introduced by Poggio et al, (1985), is to employ learning without taking into account specific variational conditions (smoothing conditions). This method has been developed in (Aloimonos and Shulman 1988) and here we will only describe it in general terms.

Consider (38) and put the expression )~m 2 inside the matrix ~, i.e. ~1~= q~(~), where ~1 is the modified matrix. We now consider, as the learning of our problem, finding the matrix ~ r If we are given several examples, i.e. pairs (~, q~(~)) of input-output, and we arrange these vectors ~ and ~b(~)in two matrices D1 and D2, then the problem of synthesizing the regularizing operator ~x that provides the regularized solution ~ for input q~(~) is equivalent, following Poggio et al. (1985), to solving the equation ~1D1 =D2 and finding the matrix ~ r A solution of this equation is

~1 =DzD~,

where D~ is the pseudoinverse of D (BaUard and Brown 1984). As noted in (Poggio etal. 1985) the pseudoinverse can be computed in an adaptive way, by updating it and making it more accurate as more data comes in. In this way we can regularize the structure from motion problem by synthesizing the regularizing operator without need of an explicit variational prin- ciple. Experiments with this method (Aloimonos and Shulman 1986) have demonstrated the feasibility of the approach.

7 Experiments

Here we present experiments for the recovery of structure from motion given two orthographic projections with the correspondence between them established, and the assumption that the surface in view is smooth, without using the learning algorithm. Figure 1 shows the displacement field for a rotating sphere (o9z=1, c%=2, co3=3 ). Figure 2 shows the

454

Fig. 1. Displacement field Fig. 4. Displacement field

Fig. 2. Reconstructed shape Fig. 5. Reconstructed shape

x: calculated orientation

y: actual orientation

Error = S/(area of the sphere)

Fig. 3. Schematic description of the crror

shape of the object in view that was reconstructed using the iterative algorithm in Sect. 5.3. The error was negligible. The value of 2 that worked best was 0.65 (and it was in the interval required by Theorem 4). Actually, the iterative formula of Sect. 5.3 was not used in this experiment. We used the simple Eq. (37) (as in (Ikeuchi and Horn 1981)), which converged to a solution, even though we do not have any theoretical

results about the convergence of this particular itera- tion process. In particular, the equations we used were

fita.+ a)t~y(a+ 1)]__ jq.*(a)t.,(a) ~_ 1 m2 L 3L (8L'~ ,1 . . . . ' " - J ' , J

The error, was calculated to be 0.012, where error means the average percent error at each pixel. The

0 error at each pixel was taken to be ~ , where 0 the solid

angle subtended by rotating the calculated orientation about the actual orientation. Figure 3 gives a pictorial description of the error at each pixel.

Figures 4 and 5 show analogous results for a rotating cylinder. The error in the reconstruction was again negligible. 3

3 We assumed knowledge of surface normals everywhere in the boundary

455

8 Conclusions and Future Directions

We have analyzed the problem of structure f rom mot ion in the case of o r thograph ic projection. Opt ic flow and discrete displacements contain the same a m o u n t of information, even though the mathemat ica l constraints that describe the problem are different, and two views cannot recover structure, no mat ter how m a n y point correspondences are used. If smoothness is assumed, then uniqueness of the structure f rom two frames m a y be established. We are current ly working towards extending this result for any form of boundary. Also, three views of three points admit finitely m a n y interpretat ions in general. Finally, a learning a lgor i thm was discussed for learning the regularizat ion parameters , in the case of two or thographic views.

References

Adiv G (1984) Derermining three dimensional motion and structure from optical flow generated by several moving observers. COINS-TR 84-07, Unive/'sity of Massachusetts at Amherst

Aloimonos J (1986) Computing intrinsic images. Ph.D. thesis, Department of Computer Science, University of Rochester, NY

Aloimonos J, Shulman D (1988) Learning early vision compu- tations. J Opt Soc Am: (to be published)

Ballard D, Brown CM (1984) Computer vision. Prentice Hall, Englewood Cliffs, New Jersey

Bandyopadhyay A (1986) Motion perception: a computational study. Ph.D. thesis. Department of Computer Science, University of Rochester, NY

Bennett, BM, Hoffman DD (1985) The computation of structure from fixed-axis motion: nonrigid structures. Biol Cybern 51:293-300

Hadamard J (1923) Lectures on the Cauchy problem in linear partial differential equations Yale University Press, New Haven

Hoffman DD (1982) Inferring local surface orientation from motion fields. J Opt Soc Am 72:880-892

Hoffman DD, Bennett BM (1985) Inferring the relative three- dimensional positions of two moving points. J Opt Soc Am A 2:350-353

Hoffman DD, Flinchbaugh BE (1982) The interpretation of biological motion. Biol Cybern 42:195-204

Horn BKP (1977) Understanding image intensities. Artif Intell 8:201-231

Horn BKP (1986) Robot vision. McGraw-Hill, New York Ikeuchi K, Horn BKP (1981) Numerical shape from shading and

occluding boundaries. Artif Intell 17:141-184 Johansson J (1973) Visual perception of biological motion and a

model for its analysis. Percept Psychophys 14:201-211 Kanade T (1981) Determining the shape of an object from a single

view. Artif Intell 17:409-460 Koenderink J J, Van Doorn AJ (1977) Invariant properties of the

motion parallax field due to the movement of rigid bodies relative to an observer. Opt Acta 22:773 791

Lee D (1985) A provably convergent algorithm for shape from shading. Proceedings of the DARPA Image Understanding Workshop, Miami, Fla, pp 489-496

Longuet-Higgins HC (1981) A computer algorithm for re- constructing a scene from two projections. Nature 293:133-135

Longuet-Higgins HC, Prazdny K (1984) The interpretation of a moving retinal image. Proc R Soc London B 208:385-397

Mackworth AK (1973) Interpreting pictures of polyhedral scenes. Artif Intell 4:121-137

Morozov VA (1984) Methods for solving incorrectly posed problems. Springer, Berlin Heidelberg New York

Poggio T, Koch C (1985) Ill-posed problems in early vision: from computational theory to analog networks. Proc R Soc London B 226:303-323

Poggio T, et al. (1985) MIT progress in understanding images. Proceedings of the Image Understanding Workshop, Miami, Fla

Richards W, Rubin J, Hoffman DD (1983) Equation counting and the interpretation of sensory data. Perception 11:557-576

Smith GD (1978) Numerical solution of partial differential equations: finite difference methods. Oxford University Press, Oxford

Sugie N, Inagaki H (1984) Computational aspect of kinetic depth effect. Biol Cybern 50:431-436

Sugihara S, Sugie N (1984) Recovery of rigid structure from orthographically projected optic flow. CVGIP 27:309-320

Tichonov AN, Arsenin VY (1977) Solution of ill-posed problems. Winston and Wiley, Washington, DC

Tsai RY, Huang TS (1984) Uniqueness and estimation of three dimensional motion parameters of rigid objects with curved surfaces. IEEE Trans PAMI 6:13-27

Ullman S (1979) The interpretation of structure from motion. Proc R Soc London B 203:405-426

Ullman S (1983) Computational studies in the interpretation of structure and motion: summary and extension. AI Memo 706, MIT AI Laboratory

Wahba G (1984) Cross-validated spline methods for the estimation of functions from data on functionals. In: David HA, David HT (eds) Statistics: an appraisal. Iowa State University

Wallach H, O'Connell DN (1953) Kinetic depth effect. J Exp Psychol 45:205-217

Waxman A, Ullman S (1985) Surface structure and three- dimensional motion parameters from image flow kinematics. Int J Robot Res 4:79-94

Received: May 13, 1988 Accepted September 21, 1988

John Aloimonos Computer Vision Laboratory Center for Automation Research University of Maryland College Park, MD 20742 USA

On the kinetic depth effect

Documents

Transcript of On the kinetic depth effect