On the distortion of shape recovery from motion
Transcript of On the distortion of shape recovery from motion
On the distortion of shape recovery from motion
Tao Xianga, Loong-Fah Cheongb,*
aDepartment of Computer Science, Queen Mary, University of London, London E1 4NS, UKbDepartment of Electrical and Computer Engineering, National University of Singapore, Singapore, Singapore 119260
Received 12 March 2003; received in revised form 26 November 2003; accepted 12 February 2004
Abstract
Given that most current structure from motion (SFM) algorithms cannot recover true motion estimates reliably, it is important to
understand the impact such motion errors have on the shape reconstruction. In this paper, various robustness issues surrounding different
types of second order shape estimates recovered from motion cue are addressed. We present a theoretical model to understand the impact that
errors in the motion estimates have on shape recovery. Using this model, we focus on the recovery of second order shape under different
generic motions, each of these motions presenting different degrees of error sensitivity. We also show that different shapes exhibit different
degrees of robustness with respect to its recovery. Understanding such different distortion behavior is important if we want to design better
fusion strategy with other shape cues.
q 2004 Elsevier B.V. All rights reserved.
Keywords: Shape recovery; Structure from motion; Error analysis; Iso-distortion framework
1. Introduction
In spite of the best efforts of a generation of computer
vision researchers, we still do not have a practical and robust
system for accurately reconstructing shape from a sequence
of moving imagery under all motion-scene configuration.
There have been a few theoretical analysis of current SFM
algorithms in an attempt to understand these algorithms’
error behaviour [18,23,24]. Questions have also been asked
about the goals of shape reconstruction as well as the nature
of shape representation appropriate for achieving these
goals. Much of the intervening research emphasized either
the importance of a closer coupling with tasks to be
performed by the systems [2] and/or, often correlatively, a
more qualitative representation of depth [9,13]. The view is
that the physical world imposes complexity such that
adequate interpretation is impossible outside the context of
tasks. Accordingly, less metric forms of depth represen-
tation such as projective depth and ordinal depth1 might be
adequate for accomplishing these tasks.
Shape reconstruction typically occurs in a context where
different cues such as shading, contour, range data and
motion are available [7]. It has been stated that different
depth cues are suited for recovering different kinds of depth
representation. For instance, motion and stereo cues are
usually used to recover point depth values, whereas texture
and shading cues are more suited for recovering second
order depth properties such as curvatures. There have
indeed been a number of computational works which
exploited the complementary strengths of different depth
cues by using a proper fusion scheme [3,4,10,19]. There has
also been a significant increase in the number of
psychophysical studies that examine fusion strategies
utilized by human observers [3,11,21,25].
Second order shape properties such as concavity are
important for many perceptual tasks. While it is well
known that the recovery of second order shape using
motion cues is sensitive to errors, are there certain types of
second order shapes, which are more amenable to
recovery? What is the robustness of the recovered shapes
with respect to the motion types? If the confidence level of
the shape recovered from motion can be ascertained in
details against various motion-scene configurations, it then
becomes possible to adopt a proper fusion mechanism to
combine the depths recovered from different cues to
produce better results.
0262-8856/$ - see front matter q 2004 Elsevier B.V. All rights reserved.
doi:10.1016/j.imavis.2004.02.005
Image and Vision Computing 22 (2004) 807–817
www.elsevier.com/locate/imavis
* Corresponding author. Tel.: þ65-6874-2990; fax: þ65-6779-1103.
E-mail addresses: [email protected] (L.-F. Cheong), txiang@dcs.
qmul.ac.uk (T. Xiang).1 Ordinal depth representation is a shape representation which only
allows us to order the values of the scene depths.
In the past, the sensitivity of depth recovery from
structure from motion (SFM) algorithms have been
investigated from a statistical perspective, in particular,
the effects of noise [1,20,22]. However, given that true
motion estimates cannot be recovered, we expect systematic
distortion also to be present in the recovered structure. Not
many researchers have addressed explicitly the systematic
nature of the depth distortion given some errors in the 3D
motion estimates. Our investigation seeks to clarify the
impact such motion errors have on second order shape
estimates. In addition, we attempt to understand the severity
of the resultant shape distortion under different motion-
scene configurations. Are there generic motion types that
can render shape recovery more robust and reliable? As was
shown in Refs. [6,23] using the iso-distortion framework,
there exists certain dichotomy between forward (perpen-
dicular to the image plane) and lateral (in the image plane)
motion in terms of both 3D motion and point depth
recovery. For instance, useful information like depth order
is preserved under a lateral motion in a small field of view,
even though it is difficult to disambiguate translation and
rotation in such case; while under forward motion, a robust
recovery of 3D structure is much more difficult. Evidently,
not all motions are equal in terms of robust depth recovery
and these results have implications for the kinds of active
vision strategy to be adopted for improving depth estimates.
In this paper, iso-distortion framework [5] is again
employed to analyze the errors in second order shape
recovery. Our main contribution lies in the elucidation of
the impact of errors in 3D motion estimates on second order
shape descriptors such as normal curvatures and shape index
[14]. In particular, our findings show that the second order
shape is recovered with varying degrees of uncertainty
depending on the types of 3D motion executed. Further-
more, we also make clear that different shapes exhibit
different sensitivities in their recovery to errors in 3D
motion estimates. These results flesh out the intuitive
feeling that various SFM algorithms behave differently
under different motion-scene configurations; this in turn
might permit a better fusion strategy with other shape cues.
Experiments are conducted to verify various theoretical
results obtained in this paper.
2. Background
2.1. Local shape representation
Consider the canonical form of a smooth (quadratic)
surface patch:
Z ¼1
2ðkminX2 þ kmaxY2Þ þ d ð1Þ
where kmin; kmax are the two principal curvatures with
kmin , kmax: This represents a canonical case whereby the
observer is fixating straight ahead at the surface patch
located at a distance d unit away, and the X-axis and the Y-
axis are aligned with principal directions (directions of
principal curvatures). Rotating the aforementioned surface
patch around the Z-axis by u in the anti-clockwise direction,
we obtain a more general surface patch whose principal
directions do not coincide with the X –Y frame:
Z ¼1
2ðcos2uðkminX2 þ kmaxY2Þ þ sin2uðkmaxX2
þ kminY2ÞÞ þ cos u sin uðkmin 2 kmaxÞXY þ d ð2Þ
Since the focus of this paper is on the recovery of local
shape, our analysis will primarily center around the fixation
point ðX ¼ Y ¼ 0Þ: The idea is that if the perceptual tasks
require local shape information, the visual system is usually
fixated at that point.
Local representations of curved surfaces include the
classical differential invariants of Gaussian and mean
curvatures which are computed based on the principal
curvatures. A good shape descriptor should correspond to
our intuitive idea of shape: shape is invariant under
translation and rotation, and more importantly, independent
of the scaling operation. Principal curvatures, as well as the
Gaussian and mean curvatures satisfy the former but do not
satisfy the latter condition because they still contain the
information of the amount of curvature. Koenderink [14]
proposed two measures of local shape: shape index ðSÞ and
curvedness ðCÞ as alternatives to the classical differential
shape invariants. S and C are defined, respectively, as
follows:
S ¼2
parctan
kmin þ kmax
kmin 2 kmax
ð3Þ
C ¼
ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffik2
min þ k2max
2
sð4Þ
S is a number in the range of ½21;þ1� and obviously scale
invariant and C is a positive number with the unit m21:
Shape index and curvedness provide us with a description of
3D quadratic surfaces in terms of their types of shape and
amount of curvature. Fig. 1 shows the examples of quadratic
surfaces with correspondent shape index values.
When shape is reconstructed based on the motion cue, it
is more appropriate to use shape index as the local
measurement of shape type rather than other differential
invariants. Due to the well-known ambiguity of SFM, the
scale of the recovered objects and the speed of translation
can only be determined up to a common factor. The
recovered scale information is thus meaningless unless
additional information is available. With shape index, we
know that the accuracy with which it can be estimated
depends on the accuracy of the 3D motion estimates.
2.2. General nature of distortion
This section derives the distortion in the recovered shape
given that the motion parameters are imprecise. In the most
T. Xiang, L.-F. Cheong / Image and Vision Computing 22 (2004) 807–817808
general case, if the observer is moving rigidly with respect
to its 3D world with a translation ðU;V ;WÞ and a rotation
ða;b; gÞ; the resulting optical flow ðu; vÞ at an image
location ðx; yÞ can be expressed as the well-known
equation [15]:
u ¼ utrans þ urot ¼ ðx 2 x0ÞW
Zþ
axy
f2 b
x2
fþ f
!þ gy
v ¼ vtrans þ vrot ¼ ðy 2 y0ÞW
Zþ a
y2
fþ f
!2
bxy
f2 gx
ð5Þ
where ðx0; y0Þ ¼ ðf UW; f V
WÞ is the focus of expansion (FOE),
Z is the depth of a scene point, utrans; vtrans are the horizontal
and vertical components of the flow due to translation, and
urot; vrot the horizontal and vertical components of the flow
due to rotation, respectively.
Since the depth can only be derived up to a scale factor,
we set W ¼ 1 for the case of general motion; the case of
pure lateral motion ðW ¼ 0Þ will be discussed separately
where required. Then the scaled depth of a scene point
recovered can be written as:
Z ¼ðx 2 x0; y 2 y0Þ·n
ðu 2 urot; v 2 vrotÞ·n
where n is a unit vector which specifies a direction.
If there are some errors in the estimation of the intrinsic
or extrinsic parameters, this will in turn cause errors in the
estimation of the scaled depth, and thus a distorted version
of the space will be computed. In this paper, we denote the
estimated parameters with the hat symbol (^) and errors in
the estimated parameters with the subscript e (where the
error of any estimate p is defined as pe ¼ p 2 pÞ: The
estimated depth Z can be readily shown to be related to
the actual depth Z as follows:
Z ¼ Zðx 2 x0; y 2 y0Þ·n
ðx 2 x0; y 2 y0Þ·n þ ðurote; vrote
Þ·nZ
!ð6Þ
From Eq. (6) we can see that Z is obtained from Z through
multiplication by a factor given by the terms inside the
bracket, which we denote by D and call the iso-distortion
factor. The expression for D contains the term n whose
value depends on the scheme we use to recover depth. In the
forthcoming analysis, we will take the ‘epipolar reconstruc-
tion’ scheme [5], i.e. n is along the estimated epipolar
direction with
n ¼ðx 2 x0; y 2 y0Þffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi
ðx 2 x0Þ2 þ ðy 2 y0Þ
2p ;
based on the intuition that the epipolar direction contains the
strongest translational flow. The detailed statistical and
geometrical reasons for choosing this scheme of recon-
structing shape were explained in Ref. [6].
If we denote, with slight abuse of notation, the
homogeneous co-ordinates of a point P3 by ðX; Y ; Z;WÞ;
and the estimated position P3 by ðX; Y; Z; WÞ; the distortion
transformation f : P3 ! P3 can be written down. Note that
to obtain the estimated X; we use the back-projection given
by xZf¼ D xZ
f¼ DX; similarly, Y ¼ DY : In general, the
transformation from physical to perceptual space belongs to
the family of Cremona transformations [5]. The resulting
transformation f is more complicated than a projective
transformation, and very little statements can be made about
the nature of the distorted shape. Indeed, some readers may
observe that at certain points, the distortion D is not defined.
These points are known as the fundamental elements of the
Cremona transformation and indeed characterize properties
of the transformation. Despite the complex nature of the
distortion, the recovered depths may nevertheless possess
good properties under certain types of generic motions. For
instance, ordinal depth information is preserved under
lateral motion, even though it is difficult to disambiguate the
translation and the rotation in such case [6,23].
Many biological organisms often judge structure by
making lateral motion, although the lateral translations are
usually coupled with certain amount of rotation. For an
artificial active vision system, forward and lateral motions
Fig. 1. Examples of quadratic surfaces with correspondent shape index values. S ¼ ^1 : umbilic shape (spherical cap and cup); S ¼ ^0:5 : cylindrical shapes;
S [ ½21:0;20:5� or S [ ½0:5; 1:0� : elliptic shapes; S [ ½20:5; 0:5� : hyperbolic shapes.
T. Xiang, L.-F. Cheong / Image and Vision Computing 22 (2004) 807–817 809
are often purposefully executed for various tasks such as
depth recovery, which means that the agent executing such
motions is at least aware of the generic type of motion being
executed. Thus it is reasonable to assume the following:
W ¼ W ¼ 0 when lateral motion is executed
U ¼ U ¼ V ¼ V ¼ 0 when forward motion is executed
(
ð7Þ
Furthermore, the rotation around the optical axis is usually
not executed for an active visual system. Thus we can
further assume that g ¼ g ¼ 0:
3. Distortion of shape recovery under generic motions
3.1. Lateral motion
We consider an agent performing a lateral translational
motion ðU;V ; 0Þ; coupled with a rotation ða;b; gÞ: Under the
aforementioned assumption of W ¼ W ¼ g ¼ g ¼ 0; the
estimated epipolar direction would be a fixed direction for
all the feature points with
n ¼ðU; VÞffiffiffiffiffiffiffiffiffiffiffiU2 þ V2
p :
Without loss of generality, we can set the estimated epipolar
direction to be along the positive X-axis direction, so we
have V ¼ 0 (though V is not necessarily zero). We can
easily show that for any other direction, the distortion
expression will have identical form with only changes in the
values of some parameters. Modifying Eq. (6) somewhat to
take into account W ¼ 0; the distortion factor can readily
shown to be:
from which the mapping from a point ðX; Y ;ZÞ in the
physical space to a point in the reconstructed space ðX; Y; ZÞ
can be established.
Consider that a surface sðX;YÞ ¼ ðX;Y ;ZðX;YÞÞ; para-
meterized2 by X and Y ; is transformed under the mapping
to the reconstructed surface sðX; YÞ ¼ ðX; Y; ZðX;YÞÞ; which
is parameterized by X and Y: Using Eq. (8), the perceived
surface can also be parameterized by X and Y as follows:
sðX;YÞ ¼ ðDX;DY ;DZÞ ð9Þ
where D is defined in Eq. (8) and for brevity, ZðX;YÞ has
been written as Z: Given this parameterization,
Mathematica3 can be used to perform algebraic
manipulation to obtain the expressions for quantities such
as kmin and kmax; as well as to visualize the distortion of the
local shape in the following sections.
3.1.1. Curvatures
For the local surface patch given by Eq. (2), we obtain
the perceived principal curvatures kmin and kmax at the
fixation point as follows:
kmin ¼1
2
ðkmin þ kmaxÞU 2 2be
U2
ffiffiffiffiffiQ
U2
s0@
1A
kmax ¼1
2
ðkmin þ kmaxÞU 2 2be
Uþ
ffiffiffiffiffiQ
U2
s0@
1A
ð10Þ
where Q¼U2ðkmin 2 kmaxÞ2 þ4ða2
e þb2eÞ24Uðkmin 2 kmaxÞ
ðbecos 2uþae sin2uÞ: Notice that the parameter d which
indicates the distance between the surface patch and the
observer does not affect the distorted principal curvature
(and other shape measures based on principal curvatures).
From Eq. (10), it is obvious that Gaussian and mean
curvatures are not preserved under the distortion; even the
signs of principal curvatures are not necessarily preserved.
The principal directions of the perceived surface patch can
be obtained by computing the eigenvectors of the Wein-
garten matrix [8] of the parametrical shape representation. It
was found that generally the principal directions are not
preserved. Only when the principal directions are aligned
with the X- and the Y-axis and ae ¼ 0; they are preserved,
though the directions of the maximum curvature and that of
the minimum curvature may swap.
We are also interested in the distortion of the normal
curvatures, especially the normal curvatures along the
horizontal and vertical directions which are denoted as
ZXX and ZYY ; respectively, and given by:
ZXX ¼ kmin cos2uþ kmax sin2u
ZYY ¼ kmin sin2uþ kmax cos2u
ð11Þ
The distorted normal curvatures along the horizontal and
vertical directions can be obtained as:
ZXX ¼U
UZXX 2
2be
U
ZYY ¼U
UZYY
ð12Þ
Interestingly, ZXX and ZYY are not affected by ae: Eq. (12)
shows that normal curvatures along different directions have
different distortion properties.
D ¼ðU2 þ V2ÞZ
ððUU þ VVÞZ þ UbeðX2 þ Z2Þ2 VaeðY
2 þ Z2Þ þ ðVbe 2 UaeÞXYÞð8Þ
2 Such parametric representation may not be possible for more complex
global shapes.3 Mathematica is a registered trademark of Wolfram Research, Inc.
T. Xiang, L.-F. Cheong / Image and Vision Computing 22 (2004) 807–817810
3.1.2. Shape index
From the expression of the distorted principal curvatures,
the distorted shape index at the fixation point can readily be
obtained by substituting Eq. (10) into Eq. (3):
S ¼2
parctan
ðkmin þ kmaxÞU 2 2be
U
ffiffiffiffiffiQ
U2
r0BBB@
1CCCA ð13Þ
As the expression for S is very complex, we have assorted
some diagrams to graphically illustrate the error in the shape
index estimate with respect to the true shape index. The
curvedness is fixed for each diagram so as to factor out the
influence of curvedness. Fig. 2 shows the distortion obtained
by varying principal directions while fixing all the other
parameters. Fig. 2(a)–(c) show the sensitivity of shape
recovery to the motion estimation errors for three principal
directions: u ¼ 0; u ¼ p4
and u ¼ p2: Fig. 2(d) show the
average sensitivity for different principal directions. Note
that the sudden jump of Se in some of the curves corresponds
to the case where the direction of the maximum curvature
and that of the minimum curvature have swapped. It can
also be seen that the principal direction does affect the
robustness of the shape perception. We can also infer from
Fig. 2 that different shapes have different distortion
properties. The average curve shown in Fig. 2(d) gives us
a rough idea on the overall robustness of the perception of
different shapes: the saddle-like shapes (also known as
‘hyperbolic shapes’ with kmin and kmax having different
signs) are more sensitive to the errors in 3D motion
estimation than the concave and convex shapes (‘elliptic
shapes’ or ‘umbilic shapes’ with kmin and kmax having the
same sign).
Other factors that may have an impact on the sensitivity
of shape perception are studied in Fig. 3. be has a
significant influence on the shape perception. Comparing
Fig. 3(a) with Fig. 2(a), we can see that the error in the
estimated shape index increases with increasing value of
lbel: The sign of be will determine whether the shape index
will be under-estimated or over-estimated for each shape
type (comparing Fig. 3(a) and (b)). On the contrary, ae
seems to have little influence on the recovered shape index,
as shown in Fig. 3(c). This anisotropy between ae and be is
related to the directional anisotropy that exists in the
distortion of the normal curvatures, derived earlier in Eq.
(12). This anisotropy is of course a result of the estimated
translation being in the horizontal direction. Finally, our
analysis seems to support the view that the more curved
surfaces will be perceived with high accuracy (see
Fig. 3(d)).
3.2. Distortion under forward motion
In this section, we assume pure forward translation is
performed. According to Eq. (7), we have U ¼ U ¼ V ¼
V ¼ 0: We also assume g ¼ g ¼ 0 as we did before.
Fig. 2. Distortion of shape index with different principal directions, plotted in terms of Se against Su ¼ 0 for (a), u ¼ p4
for (b) and u ¼ p2
for (c); (d) was
obtained by averaging curves with u ¼ 0 þ n p8
where n ¼ 0; 1;…8: All the other parameters are identical: ae ¼ 0:005; be ¼ 20:02; U ¼ 0:2 and U ¼ 0:18:
T. Xiang, L.-F. Cheong / Image and Vision Computing 22 (2004) 807–817 811
Adopting ‘epipolar reconstruction’ approach, the distortion
factor can be expressed as:
Following the same procedure of deriving the distorted
curvatures as in the lateral motion case, we can obtain the
distorted principal and normal curvature expressions on any
point of the distorted surface. The distorted shape index can
also be computed thereafter. These expressions are very
complex and are thus not presented here. It is noted that at
the fixation point, the distorted local shape measurements
are undefined due to the undefined distortion factor (see
Eq. (14)). Our analysis in Ref [6] shows that the distortion
factor varies wildly around the fixation point, which implies
that the distorted surface patches are also not smooth.
Therefore, the distorted local shape measurements at any
particular point do not make much sense.
3.3. Graphical illustration
To have an idea on how curved surface patches will be
distorted under different generic motions, we use Mathe-
matica to obtain the distorted shapes and graphically display
them. Fig. 4 compares the distortion in the original and
recovered surfaces under lateral and forward motions. We
consider a vertical cylinder (upper row) and a horizontal
cylinder (lower row), given by the equations Z ¼ 12
X2 þ 4
and Z ¼ 12
Y2 þ 4 respectively. For the lateral motion case
and given the parameters stated in the caption of Fig. 4, it is
easy to show that from Eqs. (10) and (12) that at the fixation
point a vertical cylinder will be perceived as a less curved
vertical cylinder and a horizontal cylinder will be perceived
as a saddle-like shape. The principal directions remain
unchanged. Fig. 4(b) and (e) show the full reconstructed
surfaces over the entire cylinders; clearly the distorted
surface patches are smooth and the distortion behaviour at
the fixation point seems to be qualitatively representative of
the global shape distortion even with a large view angle. The
recovered surface patches under forward motion, by
contrary, are not smooth and show large distortion. Even
when the errors in the rotation estimates are much smaller
than those in the case of lateral motion, the perceived
surface patches are obviously not quadrics with major
distortion occurring around the central region of attention.
4. Experiments
In this section, we conducted computational experiments
on computer generated and real images to further verify the
theoretical predictions.
Fig. 3. Other factors that affect the distortion of shape index (see text). be ¼ 20:03 for (a), be ¼ 0:02 for (b) and be ¼ 20:02 for (c) and (d); ae ¼ 20:05 for
(c) and ae ¼ 0:005 for (a), (b) and (d); C ¼ 20 for (d) and C ¼ 10 for (a), (b) and (c). u ¼ 0; U ¼ 0:20 and U ¼ 0:18 for all the diagrams.
D ¼X2 þ Y2
X2 þ Y2 þ aeZðX2Y þ YZ2 þ Y3Þ2 beZðXY2 þ XZ2 þ X3Þð14Þ
T. Xiang, L.-F. Cheong / Image and Vision Computing 22 (2004) 807–817812
4.1. Computer generated image sequences
SOFA4 is a package of nine computer generated image
sequences designed for testing research works in motion
analysis. It includes full ground truth on the motion and
camera parameters. Sequence 1 and 5 (henceforth abbre-
viated as SOFA1 and SOFA5) were chosen for our
experiments, the former depicting a lateral motion and the
latter a forward motion. Both of them have an image
dimension of 256 £ 256 pixels, a focal length of 309 pixels
and a field of view of approximately 458. The focal length
and principal point of the camera were fixed for the whole
sequence. Optical flow was obtained using Lucas and
Kanade’s method [16], with a temporal window of 15
frames. Depth was recovered for frame 9. We assumed all
the intrinsic parameters of the camera were estimated
accurately throughout the experiments and concentrated on
the impact of errors in the extrinsic parameters.
The 3D scene for SOFA1 consisted of a cube resting on a
cylinder (Fig. 5(a)). Clearly the entire feature points on the
top face of the cylinder in SOFA1 (which are delineated in
Fig. 5(a)) lie on a plane. The reconstructed shape of this
plane was used to testify our theoretical predictions in the
lateral motion case. The camera trajectory for SOFA1 was a
circular route on a plane perpendicular to the world Y-axis,
with constant translational parameters ðU;V ;WÞ ¼
ð0:8137; 0:5812; 0Þ and constant rotational parameters
ða;b;gÞ ¼ ð20:0203; 0:0284; 0Þ: For the reasons we dis-
cussed in Section 2, we assume that current SFM algorithms
can estimate W accurately (i.e. W ¼ 0). Given the 3D
motion estimation, depth is estimated as:
Z ¼2fðU; VÞ·n
ðu; vÞ·n 2 ðurot; vrotÞ·nð15Þ
Different SFM algorithms may give different errors in the
estimated 3D motion parameters. Instead of being constraint
by one specific SFM algorithm, let us consider all the
possible configurations in the 3D motion estimation errors.5
The resultant recovered depths are illustrated in Fig. 5 using
3D plot viewed from the side. In particular, Fig. 5(b) shows
Fig. 4. Distortion of cylinders under lateral and forward motion: (a) and (d) are the original vertical and horizontal cylinders, respectively; (b) and (e) are the
distorted vertical and horizontal cylinders under lateral motion respectively, with U ¼ 0:9; U ¼ 1:0; V ¼ 0:5 and V ¼ 0: (c) and (f) are the distorted vertical
and horizontal cylinders under forward motion, respectively. The errors in the estimated rotation are: ae ¼ 0 and be ¼ 0:05 for (b) and (e), and ae ¼ 0 and
be ¼ 0:001 for (c) and (f). The field of view is 1278 for all the diagrams.
4 Courtesy of the Computer Vision Group, Heriot-Watt University
(http://www.cee.hw.ac.uk/~mtc/sofa).
5 For our experiments, if the subspace algorithm proposed by Heeger and
Jepson [12] was used, we obtained the reconstruction shown in Fig. 5(c); if
the algorithm proposed by Ma et al. [17] was adopted, we obtained the
reconstruction shown in Fig. 5(d).
T. Xiang, L.-F. Cheong / Image and Vision Computing 22 (2004) 807–817 813
Fig. 5. Computer generated lateral motion sequence and shapes recovered when different 3D motion estimates were used for reconstruction. (a) SOFA1 frame 9
with the top face of the cylinder delineated; (b) reconstruction with true motion parameters. (c) Reconstruction with ðU; VÞ ¼ ð21; 0Þ and ða; b; gÞ ¼
ð20:03; 0:01; 0Þ: (d) Reconstruction with ðU; VÞ ¼ ð21; 0Þ and ða; b; gÞ ¼ ð20:01; 0:04; 0Þ: (e) Reconstruction with ðU; V ¼ ð22; 0Þ and ða; b; gÞ ¼
ð20:01; 0:04; 0Þ: (f) Reconstruction with ðU; V ¼ ð21; 0Þ and ða; b; gÞ ¼ ð20:01; 0:06; 0Þ.
Fig. 6. Computer generated forward motion sequence and shape recovered when different 3D motion estimates were used for reconstruction. (a) SOFA5 frame
9. (b) Reconstruction with true motion parameters. (c) Reconstruction with ða; b; gÞ ¼ ð20:001;20:001; 0Þ: (d) Reconstruction with ða; b; gÞ ¼
ð0:001; 0:001; 0Þ: (e) Reconstruction with ða; b; gÞ ¼ ð20:001; 0:001; 0Þ: (f) Reconstruction with ða; b; gÞ ¼ ð0:001;20:001; 0Þ:
T. Xiang, L.-F. Cheong / Image and Vision Computing 22 (2004) 807–817814
that, with motion parameters accurately estimated, the plane
(top of the cylinder) remained as a plane, although the noise
in the optical flow estimates made some points ‘run away’
from the plane. Fig. 5(c) depicts the shape recovered when
be . 0 and U , 0: According to Eq. (12), we have ZXX . 0:
It can be seen from Fig. 5(c) that the plane was indeed
reconstructed as a convex surface. Conversely, when be ,
0 and U , 0; concave surfaces were perceived (Fig.
5(d)–(f)). Comparing Fig. 5(d) with Fig. 5(e), we find that
larger U in general resulted in smaller curvature distortion,
whereas the results of Fig. 5(d) with Fig. 5(f) show that large
be resulted in larger curvature distortion. Fig. 5(c)–(f) show
that the reconstructed surfaces were not curved in the Y
direction. All these results confirm to our prediction made in
Section 2.
The SOFA5 sequence was used in the next set of
experiments to verify predictions in the case of forward
motion. The 3D scene for SOFA5 comprised of a pile of four
cylinders stacking upon each other and in front of a frontal-
parallel background (Fig. 6(a)). The camera trajectory
for SOFA5 was parallel to the world Z-axis and
the corresponding translational and rotational parameters
were ðU;V ;WÞ ¼ ð0; 0; 1Þ and ða;b;gÞ ¼ ð0; 0; 0Þ respec-
tively. Current SFM algorithms have no difficulty in
estimating the translation accurately for the SOFA5
sequence [23]. Therefore, the equation for reconstructing
depth for each feature point would be:
Z ¼x2 þ y2
ðu; vÞ·ðx; yÞ2 ðurot; vrotÞ·ðx; yÞ
The resultant recovered depths, given different typical errors
in the 3D motion estimates, are shown in Fig. 6. Fig. 6(b)
depicts the case of no errors in the motion parameters. It can
be seen that the background plane was preserved roughly.
However, when there was small amount of errors in the
rotation estimates, a complicated curved surface with
significant distortion was reconstructed (shown in
Fig. 6(c)). It is in accordance with our prediction that
large depth distortion is expected when forward motion is
performed.
Fig. 7. Real lateral motion sequence and shapes recovered when different 3D motion estimates were used for reconstruction. (a) BASKET frame 9. (b)
Reconstruction with ðU; VÞ ¼ ð21; 0Þ and ða; b; gÞ ¼ ð0:05; 0:05; 0Þ: (c) Reconstruction with ðU; VÞ ¼ ð21; 0Þ and ða; b; gÞ ¼ ð0:05; 0:1; 0Þ: (d) Reconstruction
with ðU; VÞ ¼ ð20:5; 0Þ and ða; b; gÞ ¼ ð0:05; 0:05; 0Þ:
T. Xiang, L.-F. Cheong / Image and Vision Computing 22 (2004) 807–817 815
4.2. Real image sequence
Shape recovery was also performed on a real image
sequence BASKET.6 It was taken by a stationary video
camera on an upturned basket being rotated on an office
chair (Fig. 7(a)). The true intrinsic and extrinsic parameters
are not available. However, we know that the basket was
rotating about a vertical axis which was approximately
parallel to the Y-axis of the camera co-ordinate system and
located along the Z-axis of the camera co-ordinate system.
The equivalent egomotion can thus be expressed as
ðU;V ;WÞ ¼ ðb0Z0; 0; 0Þ and ða;b;gÞ ¼ ð0;2b0; 0Þ where
b0 was the rotational velocity of the basket (b0 . 0 in this
case)and Z0 the distance between the optical center of the
camera and the centroid of the basket. Depth was recovered
for frame 9 using Eq. (15). Notice that since the camera was
not calibrated, we have arbitrarily set the value of the focal
length as f ¼ 300: Fortunately, we have proven in Ref. [6]
that under lateral motion the errors in the intrinsic
parameters would not affect the depth recovery results
qualitatively. These recovered shapes, given different errors
in the 3D motion estimates, were shown in Fig. 7. Fig. 7(b)
depicts the shape recovered when U , 0 and be , 0
(be , 0 since b , 0 and b . 0Þ: Substituting the signs of
these terms into Eq. (12), we obtained ZXX , ZXX ; which
means that the recovered surface should be less convex than
the true shape. Comparing the results of Fig. 7(b) with the
true shape measured by us manually, this is indeed the case.
Furthermore, according to Eq. (12), a smaller be or a larger
U (with the signs of these terms unchanged) would result in
an even less convex shape being recovered. Fig. 7(c) and (d)
clearly support our prediction.
5. Conclusions
This paper presented a geometric investigation on the
reliability of shape recovery from motion cue given errors in
the estimates for motion parameters. Distortion in the local
shape recovered was considered under different generic
motions. Specifically, our distortion model has shown that
the types of motions executed are critical for the accuracy of
the shape recovered. In the case of lateral motion, the
accuracy of curvature estimates exhibits anisotropy with
respect to the estimated translation direction, in that the
surface curvature in the estimated translational direction is
better perceived than that in the orthogonal direction.
Correlatively, the recognition of different shapes (classified
in terms of shape index) exhibits different degrees of
sensitivity to noise. In the case of forward motion, the
reconstructed shape experiences larger distortion compared
with the case of lateral motion and is not visually smooth
any more.
These findings suggest that shape estimates are recovered
with varying degrees of uncertainty depending on the
motion-scene configuration. When designing a visual system
to perform shape recovery, these findings can help us choose
the best motion strategy. In the meantime, if the confidence
level of the shape estimates under various motion-scene
configurations can be ascertained, the robustness of shape
perception can be enhanced by including additional static
cues to restrict the space of possible interpretation. The work
presented in this paper has taken a step towards this direction.
More works need to be done to understand the robustness of
shape recovery under more general configurations. The
investigation should also be extended to the other cues.
Practical and robust fusion algorithms would not be possible
until a full understanding of the robustness of the shape
recovered based on various cues can be achieved.
References
[1] G. Adiv, Inherent ambiguities in recovering 3-D motion and structure
from a noisy flow field, IEEE Trans PAMI 11 (5) (1989) 477–489.
[2] J. Aloimonos, I. Weiss, A. Bandopadhay, Active vision, Int J Comput
Vision 2 (1988) 333–356.
[3] H.H. Bulthoff, H.A. Mallot, Integration of depth modules: stereo and
shading, JOSA-A(5) (1988) 1749–1758.
[4] P. Cavanagh, Reconstructing the third dimension: interactions
between color, texture, motion, binocular disparity, and shape,
CVGIP 37 (2) (1987) 171–195.
[5] L-F. Cheong, K. Ng, Geometry of distorted visual space and Cremona
Transformation, Int J Comput Vision 32 (2) (1999) 195–212.
[6] L.-F. Cheong, T. Xiang, Characterizing depth distortion under
different generic motions, Int J Comput Vision 44 (3) (2001)
199–217.
[7] J.E. Cutting, P.M. Vishton, Perceiving layout and knowing distances:
the integration, relative potency, and contextual use of different
information about depth, in: W. Epstein, S. Rogers (Eds.), Perception
of space and motion, Academic Press, San Diego, 1995.
[8] A. Davies, P. Samuels, An introduction to computational geometry for
curves and surfaces, Clarendon Press, Oxford, 1996.
[9] O.D. Faugeras, Stratification of three-dimensional vision: projective,
affine, and metric representations, J Opt Soc Am A 12 (1995)
465–484.
[10] P.V. Fua, Y.G. Leclerc, Object-centered surface reconstruction:
combining multi-image stereo and shading, IJCV 16 (1) (1995)
35–56.
[11] R.A. Jacobs, Optimal integration of texture and motion cues to depth,
Vision Res 39 (1999) 3621–3629.
[12] D.J. Heeger, A.D. Jepson, Subspace methods for recovering rigid
motion I: algorithm and implementation, Int J Comput Vision 7
(1992) 95–117.
[13] J.J. Koenderink, A.J. van Doorn, Affine structure from motion, J Optic
Soc Am 8 (2) (1991) 377–385.
[14] J.J. Koenderink, A.J. van Doorn, Surface shape and curvature scales,
Image Vision Comput 10 (8) (1992) 557–564.
[15] H.C. Longuet-Higgins, A computer algorithm for reconstruction of a
scene from two projections, Nature 293 (1981) 133–135.
[16] Lucas BD. Generalized image matching by the method of differences.
PhD Dissertation. Carnegie-Mellon University; 1984.
[17] Y. Ma, J. Kosecka, S. Sastry, Linear differential algorithm for motion
recovery: a geometric approach, Int J Comput Vision 36 (1) (2000)
71–89.
6 Courtesy of Dr Andrew Calway of Department of Computer Science,
University of Bristol.
T. Xiang, L.-F. Cheong / Image and Vision Computing 22 (2004) 807–817816
[18] J. Oliensis, A critique of structure-from-motion algorithms, Comput
Vision Image Understand 80 (2000) 172–214.
[19] P.R. Schrater, D. Kersten, How optimal depth cue integration depends
on the task, IJCV 40 (1) (2000) 71–89.
[20] J.I. Thomas, A. Hanson, J. Oliensis, Understanding noise: the critical
role of motion error in scene reconstruction, Proceedings of the
DARPA Image Understanding Workshop, 1993, p. 691–5.
[21] J.S. Tittle, J.F. Norman, V.J. Perotti, F. Phillips, The perception of
scale-dependent and scale-independent surface structure from
binocular disparity, texture, and shading, Perception 26 (1997)
147–166.
[22] J. Weng, T.S. Huang, N. Ahuja, Motion and structure from image
sequences, Springer, Berlin, 1991.
[23] T. Xiang, L-F. Cheong, Understanding the behavior of structure from
motion algorithms: a geometric approach, Int J Comput Vision 51 (2)
(2003) 111–137.
[24] G.S. Young, R. Chellapa, Statistical analysis of inherent ambiguities
in recovering 3-D motion from a noisy flow field, IEEE Trans PAMI
14 (1992) 995–1013.
[25] M.J. Young, M.S. Landy, L.T. Maloney, A perturbation analysis of
depth perception from combinations of texture and motion cues,
Vision Res 33 (1993) 2685–2696.
T. Xiang, L.-F. Cheong / Image and Vision Computing 22 (2004) 807–817 817