Plenoptic video geometry

10
1 Introduction Plenoptic video geometry Jan Neumann, Cornelia Fermüller Center for Automation Research, University of Maryland, College Park, MD 20 742-3275, USA E-mail: {jneumann,fer}@cfar.umd.edu Published online: 2 July 2003 c Springer-Verlag 2003 More and more processing of visual infor- mation is nowadays done by computers, but the images captured by conventional cam- eras are still based on the pinhole princi- ple inspired by our own eyes. This principle though is not necessarily the optimal image- formation principle for automated process- ing of visual information. Each camera sam- ples the space of light rays according to some pattern. If we understand the structure of the space formed by the light rays passing through a volume of space, we can determine the camera, or in other words the sampling pattern of light rays, that is optimal with re- gard to a given task. In this work we an- alyze the differential structure of the space of time-varying light rays described by the plenoptic function and use this analysis to re- late the rigid motion of an imaging device to the derivatives of the plenoptic function. The results can be used to define a hierarchy of camera models with respect to the structure from motion problem and formulate a lin- ear, scene-independent estimation problem for the rigid motion of the sensor purely in terms of the captured images. Key words: Multi-view geometry – Spatio- temporal image analysis – Camera design – Structure from motion – Polydioptric cam- eras When we think about vision, we usually think of in- terpreting the images taken by (two) eyes such as our own – that is, images acquired by camera-type eyes based on the pinhole principle. This is advantageous because it enables an easy interpretation of the vi- sual information by a human observer. Despite the fact that many image interpretation tasks are auto- mated nowadays, most commercially available cam- eras are still based on the same pinhole principle. One considers a point in space and the light rays passing through that point. Then the rays are cut with an imaging surface, and a subset of them forms an image. This image captures a view of the world that looks very similar to the view we capture with our own eyes, but this might not be the optimal image to solve a given task. Of course these are not the only types of eyes that exist; the biological world reveals a large variety of eye designs. It has been es- timated that eyes have evolved no fewer than forty times, independently, in diverse parts of the animal kingdom [8], and these eye designs, and therefore the images they capture, are highly adapted to the tasks the animal has to perform. This suggests that we should not just focus our efforts on designing al- gorithms that optimally process a given visual input, but also optimize the design of the imaging sensor with regard to the task at hand, so that the subsequent processing of the visual information is facilitated. This focus on sensor design has already begun. Technological advances make it possible to con- struct integrated imaging devices using electronic and mechanical micro-assembly, micro-optics, and advanced data buses, not only of the kind that exist in nature such as log-polar retinas [5], but also of many other kinds such as catadioptric cameras [15]. Nev- ertheless, a general framework to relate the design of an imaging sensor to its usefulness for a given task is still missing. In this work we will present such a framework by studying the relationship between the subset of light rays captured by a generalized camera and its perfor- mance with regard to the task of egomotion estima- tion. To analyze these general cameras, we will study the most complete visual representation of a scene, namely the plenoptic function as it changes differ- entially over time [1]. In free space the plenoptic function reduces to the five-dimensional space of time-varying light rays. Since any imaging device captures a subset of this space, the choice of the sub- set determines how well a task can be performed. We will use our study of the differential structure of the The Visual Computer (2003) 19:395–404 Digital Object Identifier (DOI) 10.1007/s00371-003-0203-5

Transcript of Plenoptic video geometry

Page 1: Plenoptic video geometry

1 Introduction

Plenoptic video geometry

Jan Neumann,Cornelia Fermüller

Center for Automation Research, University ofMaryland, College Park, MD 20 742-3275, USAE-mail: {jneumann,fer}@cfar.umd.edu

Published online: 2 July 2003c© Springer-Verlag 2003

More and more processing of visual infor-mation is nowadays done by computers, butthe images captured by conventional cam-eras are still based on the pinhole princi-ple inspired by our own eyes. This principlethough is not necessarily the optimal image-formation principle for automated process-ing of visual information. Each camera sam-ples the space of light rays according to somepattern. If we understand the structure ofthe space formed by the light rays passingthrough a volume of space, we can determinethe camera, or in other words the samplingpattern of light rays, that is optimal with re-gard to a given task. In this work we an-alyze the differential structure of the spaceof time-varying light rays described by theplenoptic function and use this analysis to re-late the rigid motion of an imaging device tothe derivatives of the plenoptic function. Theresults can be used to define a hierarchy ofcamera models with respect to the structurefrom motion problem and formulate a lin-ear, scene-independent estimation problemfor the rigid motion of the sensor purely interms of the captured images.

Key words: Multi-view geometry – Spatio-temporal image analysis – Camera design –Structure from motion – Polydioptric cam-eras

When we think about vision, we usually think of in-terpreting the images taken by (two) eyes such as ourown – that is, images acquired by camera-type eyesbased on the pinhole principle. This is advantageousbecause it enables an easy interpretation of the vi-sual information by a human observer. Despite thefact that many image interpretation tasks are auto-mated nowadays, most commercially available cam-eras are still based on the same pinhole principle.One considers a point in space and the light rayspassing through that point. Then the rays are cut withan imaging surface, and a subset of them forms animage. This image captures a view of the world thatlooks very similar to the view we capture with ourown eyes, but this might not be the optimal imageto solve a given task. Of course these are not theonly types of eyes that exist; the biological worldreveals a large variety of eye designs. It has been es-timated that eyes have evolved no fewer than fortytimes, independently, in diverse parts of the animalkingdom [8], and these eye designs, and thereforethe images they capture, are highly adapted to thetasks the animal has to perform. This suggests thatwe should not just focus our efforts on designing al-gorithms that optimally process a given visual input,but also optimize the design of the imaging sensorwith regard to the task at hand, so that the subsequentprocessing of the visual information is facilitated.This focus on sensor design has already begun.Technological advances make it possible to con-struct integrated imaging devices using electronicand mechanical micro-assembly, micro-optics, andadvanced data buses, not only of the kind that exist innature such as log-polar retinas [5], but also of manyother kinds such as catadioptric cameras [15]. Nev-ertheless, a general framework to relate the design ofan imaging sensor to its usefulness for a given task isstill missing.In this work we will present such a framework bystudying the relationship between the subset of lightrays captured by a generalized camera and its perfor-mance with regard to the task of egomotion estima-tion. To analyze these general cameras, we will studythe most complete visual representation of a scene,namely the plenoptic function as it changes differ-entially over time [1]. In free space the plenopticfunction reduces to the five-dimensional space oftime-varying light rays. Since any imaging devicecaptures a subset of this space, the choice of the sub-set determines how well a task can be performed. Wewill use our study of the differential structure of the

The Visual Computer (2003) 19:395–404Digital Object Identifier (DOI) 10.1007/s00371-003-0203-5

Page 2: Plenoptic video geometry

396 J. Neumann, C. Fermüller: Plenoptic video geometry

a b c

Fig. 1a–c. Design of a polydioptric camera (a) capturing parallel rays (b) and simultaneously capturing a pencil of rays (c)

time-varying plenoptic function captured by a rigidlymoving imaging sensor to analyze how this structureis related to the rigid motion of the sensor.The idea of studying the space of light rays wasalready mentioned by Leonardo da Vinci [20] andfurther studied in the context of photometry at thebeginning of the 20th century (for an overviewsee [14]). In computer graphics and computer vi-sion recent work uses non-perspective subsets of theplenoptic function to represent visual information tobe used for image-based rendering. Some examplesare light fields [13] and lumigraphs [9], multiple cen-ters of projection images [19] that have been used incell animation already for quite some time [23], andmulti-perspective panoramas [18]. Non-perspectiveimages have also been used by several people to re-construct the observed scene from video sequences(for some examples see [3, 6, 22]).In general terms, a camera is a mechanism that formsimages by focusing light onto a light-sensitive sur-face (retina, film, CCD array, etc.). Different camerasare obtained by varying three elements: (1) the ge-ometry of the surface, (2) the geometric distributionand optical properties of the photoreceptors, and (3)the way light is collected and projected onto the sur-face (single or multiple lenses, or tubes as in com-pound eyes).A theoretical model for a camera that captures theplenoptic function in some part of the space is a sur-face S that has at every point a pinhole camera(for other possible parameterizations for general-ized cameras see [10, 17, 21]). We call the actualimplementation of this theoretical concept a poly-dioptric camera. A ‘plenoptic camera’ has been pro-

posed in [2], but since no physical device can cap-ture the true time-varying plenoptic function, we pre-fer the term polydioptric to emphasize the differ-ence between the theoretical concept and the imple-mentation. With a polydioptric camera we observepoints in the scene in view from many different view-points (theoretically, from every point on S) andthus we capture many rays emanating from thosepoints.A polydioptric camera can be implemented by ar-ranging ordinary cameras very close to each other(Fig. 1b and c). This camera has an additional prop-erty arising from the proximity of the individualcameras: it can form a very large number of or-thographic images, in addition to the perspectiveones.Indeed, consider a direction r in space and then con-sider in each individual camera the captured ray par-allel to r. All these rays together, one from each cam-era, form an image with rays that are parallel. Fordifferent directions r different orthographic imagesare formed. For example, Fig. 1b shows that we canselect one appropriate pixel in each camera to forman orthographic image that looks to one side (bluerays pointing to the left) or another (red rays pointingto the right). Figure 1c shows all the captured rays,thus illustrating that each individual camera collectsconventional pinhole images.In conclusion, a polydioptric camera has the uniqueproperty that it captures, simultaneously, a largenumber of perspective and orthographic images(projections). We will demonstrate later that this ca-pability enables us to solve the structure from motionproblem using a linear, scene-independent equation.

Page 3: Plenoptic video geometry

J. Neumann, C. Fermüller: Plenoptic video geometry 397

2 Plenoptic video geometry

Let the scene surrounding the image sensor be mod-eled by the signed distance function f(x; t) : R3 ×R+ →R, where f(x; t) > 0 if the world point x is notoccupied by a scene object, and f(x; t) ≤ 0 if x liesinside a scene object.At each location x in free space ( f(x; t) > 0), the ra-diance, that is the light intensity or color observedat x from a given direction r at time t, is measuredby the plenoptic function L(x; r; t); L : R3 ×S2 ×R+ → Rd, where d = 1 for intensity, d = 3 for colorimages, and S2 is the unit sphere of directions inR3.The image irradiance that is recorded by an imag-ing device is proportional to the scene radiance [11];therefore, we assume that the intensity recorded byan imaging sensor at position x and time t point-ing in direction r is equal to the plenoptic functionL(x; r; t).Since a transparent medium such as air does notchange the color of the light, the radiance along theview direction r is constant:

L(x; r; t) = L(x+λr; r; t)if ∀α ∈ [0, λ] f(x +αr) > 0,

which implies that

∇xLTr = ∇rLTr = 0 ∀x ∈ R3 with f(x; t) > 0,

where ∇xL and ∇rL are the partial derivatives ofL with respect to x and r. Therefore, the plenopticfunction in free space reduces to five dimensions –the time-varying space of directed lines for whichmany representations have been presented (for anoverview see Camahort and Fussel [4]).Let us assume that the albedo of every scene point isconstant over time and that we observe a static worldunder constant illumination. In this case, the radi-ance of a light ray does not change over time, whichimplies that the total time derivative of the plenopticfunction vanishes:

d

dtL(x; r; t) = 0.

If at the intersection point y ∈ R3 of the ray φ(φ(λ) = x +λr) with the scene surface ( f(y) = 0)the albedo ρ(y; t) is continuously varying and noocclusion boundaries are present (rT∇ f(y, t) �= 0),then we can develop the plenoptic function L in theneighborhood of (x; r; t) into a Taylor series (Lt is an

abbreviation for ∂L/∂t):

L(x +dx; r+dr; t +dt) = L(x; r; t)+ Ltdt

+∇xLTdx+∇rLTdr+O(‖dr, dx, dt‖2). (1)

This expression now relates a local change in viewray position and direction to the first-order differen-tial brightness structure of the plenoptic function.We define the plenoptic ray flow as the differencein ray position and orientation between the two raysthat are captured by the same imaging element at twoconsecutive time instants. This allows us to use thespatio-temporal brightness derivatives of the lightrays captured by an imaging device to constrain theplenoptic ray flow by generalizing the well-knownimage brightness constancy constraint to the plenop-tic brightness constancy constraint:

d

dtL(r; x; t) = Lt +∇rLT dr

dt+∇xLT dx

dt= 0. (2)

3 Plenoptic motion equations

The set of imaging elements that make up a cam-era each capture the radiance at a given positioncoming from a given direction. If the camera under-goes a rigid motion, then we can express this motionby a rigid coordinate transformation of the ambientspace of light rays. This rigid transformation, param-eterized by the rotation matrix R(t) and a translationvector q(t), results in the exact identity (see the ex-ample in Fig. 2)

L(x; r; 0) = L(R(t)x +q(t); R(t)r; t),

since the rigid motion maps the time-invariant spaceof light rays upon itself. Thus, the problem of esti-mating the rigid motion of a sensor has become animage-registration problem that is independent of thescene and only depends on the rigid-motion parame-ters!We assume that the imaging sensor can capture im-ages at a rate that allows us to use the small-angleapproximation for the rotation matrix R ≈ I +[ω]x ,where [ω]x is a skew-symmetric matrix parameter-ized by the axis of the instantaneous rotation ω, torepresent the motion of the camera between frames.Now we can define the plenoptic ray flow for the raycaptured by the imaging element located at location

Page 4: Plenoptic video geometry

398 J. Neumann, C. Fermüller: Plenoptic video geometry

a b

c

d

Fig. 2. a Sequence of images captured by a horizontally translating camera. b Epipolar-plane image volume formed by theimage sequence where the top half of the volume has been cut away to show how a row of the image changes when thecamera translates. c A row of an image taken by a pinhole camera at two time instants (red and green) corresponds to twonon-overlapping horizontal line segments in the epipolar-plane image, while in d the collection of corresponding ‘rows’ ofa polydioptric camera at two time instants corresponds to two rectangular regions of the epipolar image that do overlap (yellowregion). This overlap enables us to estimate the rigid motion of the camera purely based on the visual information recorded

x and looking in direction r as

drdt

= ω× r anddxdt

= ω× x+ t, (3)

where t is the instantaneous translation (t = dq/dt).In stark contrast to the projection of the rigid-motionflow on the imaging surface of a conventional pin-hole camera, which depends on an infinite number ofparameters due to its dependence on the scene-depthstructure, we see that the plenoptic ray flow is com-pletely specified by the six rigid-motion parameters.This regular global structure of the rigid plenopticray flow makes the estimation of the rigid-motion pa-rameters very well-posed.Combining Eqs. (2) and (3) leads to the plenopticmotion constraint

−Lt = ∇xL · (ω× x+ t)+∇r L · (ω× r)= ∇xL · t + (x×∇xL + r×∇rL) ·ω, (4)

which is a linear, scene-independent constraint in themotion parameters and the plenoptic partial deriva-tives.This formalism can also be applied if we observea rigidly moving object with a set of static cameras.In this case, we attach the world coordinate system tothe moving object and we can relate the relative mo-tion of the image sensors with respect to the object to

the spatio-temporal derivatives of the light rays thatleave the object. It is to be noted, though, that weneed to account for illumination effects due to therelative motion between object and light sources.It is important to realize that the derivatives ∇rLand ∇xL can be obtained from the image informa-tion captured by a polydioptric camera. Recall thata polydioptric camera can be envisioned as a surfacewhere every point corresponds to a pinhole camera.∇rL , the plenoptic derivative with respect to direc-tion, is the derivative with respect to the image co-ordinates that one finds in a traditional pinhole cam-era. One keeps the position and time constant andchanges direction (Fig. 1c).The second plenoptic derivative, ∇xL , is obtained bykeeping the direction of the ray constant and chang-ing the position along the surface (Fig. 1b). Thus,one captures the change of intensity between paral-lel rays. This is similar to computing the derivativesin an orthographic camera. In Sect. 1 we mentionedthat a polydioptric camera captures perspective andorthographic images. ∇rL is found from the perspec-tive images and ∇xL from the orthographic images.The ability to compute all the plenoptic derivativesdepends on the ability to capture light at multipleviewpoints coming from multiple directions. Thiscorresponds to the ability to incorporate stereo in-formation into motion estimation, since multiple

Page 5: Plenoptic video geometry

J. Neumann, C. Fermüller: Plenoptic video geometry 399

a Static World – Moving Cameras b Moving Objects – Static Cameras

Fig. 3a,b. Hierarchy of cameras. We classify the different camera models according to the field of view (FOV) and the numberand proximity of the different viewpoints that are captured (dioptric axis). This in turn determines if structure from motion es-timation is a well-posed or an ill-posed problem, and if the motion parameters are related linearly or non-linearly to the imagemeasurements. The camera models are, clockwise from the lower left: small FOV pinhole camera, spherical pinhole camera,spherical polydioptric camera, and small FOV polydioptric camera

rays observe the same part of the world. For single-viewpoint cameras this is inherently impossible, andthus it necessitates non-linear estimation over bothstructure and motion to compensate for this lack ofmulti-view (or equivalently depth) information.To illustrate this idea further, we translate a cameraalong the horizontal image axis and stack the imagesto form an image volume (Fig. 2a and b). Due to thehorizontal translation, scene points always projectinto the same row in each of the images. Such animage volume is known as an epipolar image vol-ume [3] since corresponding rows all lie in the sameepipolar plane. A horizontal slice through the imagevolume (Fig. 2c and d) is called an epipolar-planeimage and illustrates the structure of the visual spacein dependence on view position and direction. A rowof an image taken by a pinhole camera correspondsto a horizontal line segment in the epipolar-plane im-age (Fig. 2c), while the collection of corresponding‘rows’ of a polydioptric camera captures a rectan-gular area of the epipolar image (Fig. 2d). Here weassumed that the viewpoint axis of the polydioptriccamera is aligned with the direction of translationused to define the epipolar image volume. If this isnot the case, we can warp the images to align the op-tical centers. We see that a camera rotation around

an axis perpendicular to the epipolar-image planecorresponds to a horizontal shift in the epipolar im-age, while a translation of the camera parallel to animage row causes a vertical shift. These shifts canbe different for each pixel depending on the rigidmotion of the camera. If we want to recover thisrigid transformation based on the images captured,we see in Fig. 2c that, when using a pinhole cam-era, we have to match two non-overlapping subsetsof the visual space (shown in green and red). At eachtime a pinhole camera captures by definition only theview from a single viewpoint. Therefore, it is neces-sary for an accurate recovery of the rigid motion thatwe estimate the scene structure, since the correspon-dence between pixels in image rows taken from dif-ferent viewpoints depends on the depth of the scene.In contrast, we see in Fig. 2d that for a polydiop-tric camera the matching can be based purely onthe captured image information, as we have a ‘true’brightness constancy in the region of overlap (yel-low), since we match a light ray with itself. In thiscase the correspondence of pixels (light rays) de-pends only on the motion of the camera, not on thedepth of the scene, thus enabling us to estimate therigid motion of the camera in a completely scene-independent manner.

Page 6: Plenoptic video geometry

400 J. Neumann, C. Fermüller: Plenoptic video geometry

In previous work [16], we developed a hierarchy ofcameras design with regard to the linearity of the es-timation and the stability due to the field of view (seeFig. 3).We see in the figure that although the estimation ofstructure and motion for a single-viewpoint spheri-cal camera is stable and robust, it is still non-linear,and the algorithms which give the most accurate re-sults are search techniques, and thus rather elaborate.A spherical polydioptric camera combines the stabil-ity of full field of view motion estimation with thelinearity and scene independence of the problem, andis therefore the camera of choice to solve the struc-ture from motion problem.

4 Light field video geometry

Due to the difficulties involved when using signal-processing operators in a spherical coordinate sys-tem, we will choose the two-plane parameterizationthat was used in [9, 13] to represent the space oflight rays. All the lines passing through some spaceof interest can be parameterized by surrounding thisspace (that could contain either a camera or an ob-ject) with two nested cubes and then recording theintersection of the light rays entering the camera orleaving the object with the planar faces of the twocubes. We only describe the parameterization of therays passing through one pair of faces; the extensionto the other pairs is straightforward. Without loss ofgenerality we choose both planes to be perpendicu-lar to the z axis and separated by a distance of f . Wedenote one plane as focal plane Πf indexed by coor-dinates (x, y) and the other plane as image plane Πiindexed by (u, v), where (u, v) is defined in a localcoordinate system with respect to (x, y) (see Fig. 4a).Both (x, y) and (u, v) are aligned with the (X, Y )axes of the world coordinates and Πf is at a distanceof ZΠ from the origin of the world coordinate sys-tem.This enables us now to parameterize the light raysthat pass through both planes at any time t using thetuples (x, y, u, v, t) and we can record their inten-sity in the time-varying light field L(x, y, u, v, t). Forfixed location (x, y) in the focal plane, L(x, y, ·, ·, t)corresponds to the image captured by a perspectivecamera. If instead we fix the view direction (u, v), wecapture an orthographic image L(·, ·, u, v, t) of thescene (· denotes the varying parameters).

Using this light field parameterization we can rewritethe plenoptic motion equation (Eq. (4)) by settingx = [x, y, ZΠ]T and r = [u, v, f ]T/‖[u, v, f ]T‖. Tobe able to use this equation we need to convertthe spatial partial derivatives of the light field Lx =∂L/∂x, . . . , Lv = ∂L/∂v into the three-dimensionalplenoptic derivatives ∇xL and ∇rL .We consider the projections of ∇xL and ∇r on threedirections cx , cr , and r to obtain the following lin-ear system which we solve for ∇xL and ∇rL (cx =[1, 0, 0]T, cy = [0, 1, 0]T, and nr = ‖[u, v, f ]T‖):

cTx

cTy

rT

[∇xL,∇rL] =

Lx nrLu

L y nrLv

0 0

.

This results in the following expressions for theplenoptic derivatives:

∇xL =[

Lx, L y,− u

fLx − v

fL y

]T

,

∇rL = nr

[Lu, Lv,− u

fLu − v

fLv

]T

.

If we now substitute these expressions into Eq. (4),we can define the plenoptic motion constraint for theray indexed by (x, y, u, v, t) as ([·; ·] denotes the ver-tical stacking of vectors):

−Lt = ∇xL · t + (x×∇xL + r×∇rL) ·ω=

[Lx, L y,− u

fLx − v

fL y

]t

−[

Lx, L y,− u

fLx − v

fL y

]([x, y, ZΠ]T ×ω)

−[

Lu, Lv,− u

fLu − v

fLv

]([u, v, f ]T ×ω)

= [Lx, L y, Lu, Lv][Mt , Mω][t; ω], (5)

where

Mt =

1 0 − uf

0 1 − vf

0 0 00 0 0

,

Page 7: Plenoptic video geometry

J. Neumann, C. Fermüller: Plenoptic video geometry 401

a b c

Fig. 4. a Light field parameterization (here shown only for the light field slice spanned by axes x and u). b The ratio between themagnitude of perspective and orthographic derivatives is linear in the depth of the scene. c The scale at which the perspectiveand orthographic derivatives match depends on the depth (matching scale in red)

Mω =

− uyf

uxf + ZΠ − y

−(

vyf + ZΠ

)vxf x

−uvf

u2

f + f −v

−(

v2

f + f)

vuf u

.

By combining the constraints across the light field,we can form a highly over-determined linear systemand solve for the rigid-motion parameters.As said before, the light field derivatives Lx, . . . , Ltcan be obtained directly from the image informa-tion captured by a polydioptric camera. To con-vert the image information captured by this col-lection of pinhole cameras into a light field, foreach camera we simply have to intersect the raysfrom its optical center through each pixel with thetwo planes Πf and Πi and set the correspondinglight field value to the pixel intensity. Since ourmeasurements are in general scattered, we have touse appropriate interpolation schemes to computea continuous light field function (for an example thepush–pull scheme in [9]). The light field derivativescan then easily be computed by applying standardimage-derivative operators to the continuous lightfield. The plenoptic motion constraint is extended to

the other faces of the nested cube by premultiply-ing t and ω with the appropriate rotation matricesto rotate the motion vectors into local light fieldcoordinates.The relationship between the structure from motionformulation for conventional single-viewpoint cam-eras and the formulation for polydioptric camerascan easily be established.If we consider a two-dimensional slice of the lightfield corresponding to the indices (x, u) that form anepipolar image [3] and assume that the surface hasslowly varying reflectance, we can apply a simpletriangulation argument to get the following identity(du is the change in view direction and dx the cor-responding change in view position as illustrated inFig. 4b):

L(x, y, u +du, v+dv, t) =

L

(x, y, u + f

Zdx, v+ f

Zdy, t

)=

L(x +dx, y +dy, u, v, t). (6)

If we now compute the directional derivative of thelight field with respect to the u axis (the derivation

Page 8: Plenoptic video geometry

402 J. Neumann, C. Fermüller: Plenoptic video geometry

with respect to the v axis is analogous), we get

Lu = dL

du= lim

∆u→0

L(x, u +∆u, t)− L(x, u, t)

∆u

= lim∆u→0

L(

x + Zf ∆u, u, t

)− L(x, u, t)

∆u

= lim∆u→0

L(x +∆u, u, t)− L(x, u, t)fZ ∆u

= Z

f

dL

dx= Z

fLx

and we see that depth is encoded as the ratio betweenthe positional and directional derivatives, where Zdenotes the depth of the scene point imaged by theray with index tuple (x, y, u, v).From Eq. (6) we can conclude now that we can re-place any flow in the view-direction variables (u, v)that is inversely proportional to the depth of thescene, e.g. the projection of the translational flow,by flow in the viewpoints (x, y) that is indepen-dent of the scene. This enables us to formulate thestructure from motion problem independent of thescene in view as a linear problem. This is identicalto the relation between depth and differential im-age measurements used in differential stereo and inthe epipolar-plane image analysis [3]. Thus, we caninterpret plenoptic motion estimation as the integra-tion of differential motion estimation with differen-tial stereo.

4.1 Scale dependence ofthe plenoptic derivatives

One issue we will not address in this paper but thatneeds to be considered is how densely the plenopticspace needs to be sampled by a polydioptric camerato be able to compute accurate structure and motioninformation. A similar problem has been studied incomputer graphics under the name plenoptic sam-pling [7] with the aim of improving view interpola-tion in image-based rendering. The sampling densitynecessary to avoid aliasing depends of course on thebrightness profile and depth structure of the scene;thus the choice of the correct smoothing filter de-pends very much on prior knowledge. The questionof how to design this optimal smoothing filter formotion estimation is beyond the scope of this paper.The accuracy of the linearization of the time-varyinglight field (Eq. (1)) also depends on the compatibil-

ity of the plenoptic derivatives. This means that theestimation of the perspective, orthographic, and tem-poral plenoptic derivatives needs to be based uponsimilar subsets of the scene radiance.Notice that if we combine information from neigh-boring measurements in directional space at a fixedposition, we integrate radiance information over a re-gion of the scene surface whose area scales with thedistance from the sensor to the scene. In contrast,if we combine information over neighboring mea-surements in positional space for a fixed direction,we integrate information over a region of the scenesurface whose area is independent of the depth (il-lustrated in Fig. 4c). Unless the brightness structureof the scene has enough similarity across scales (e.g.if the local scene radiance changes linearly on thescene surface), so that the derivative computation isinvariant to our choice of filter size, we have to makesure when we compute the plenoptic derivatives withrespect to time, direction, and position that the do-mains of integration of our derivative filters relativeto the scene are as similar as possible. One way toadjust the integration domains of the filters wouldbe to compute the temporal, directional, and posi-tional derivatives at many scales and use Eq. (7) asa constraint to find the best relative shift in scalespace between the positional, directional, and tempo-ral derivatives.

5 Experiments

To examine the performance of an algorithm usingthe plenoptic motion constraint, we did experimentswith synthetic data. We distributed spheres, texturedwith a smoothly varying pattern, randomly in thescene so that they filled the horizon of the camera(see Fig. 5a). We then computed the light fields forall the faces of the nested cube surrounding the cam-era through ray tracing, computed the derivatives,stacked the linear equations (Eq. (5)) to form a lin-ear system, and solved for the motion parameters.Even using derivatives only at one scale, we foundthat the motion is recovered very accurately as seenin Fig. 5c. As long as the relative scales of the deriva-tives were similar enough (scene not too far away)the error in the motion parameters varied between1% and 3%. This accurate egomotion estimate couldthen be used to compute depth from differential mea-surements using the following four formulas (Mi

x de-notes the ith row of the coefficient matrix Mt or Mω

Page 9: Plenoptic video geometry

J. Neumann, C. Fermüller: Plenoptic video geometry 403

a b c

Fig. 5. a Subset of an example scene, b the corresponding light field, c accuracy of plenoptic motion estimation. The plot showsthe ratio of the true and estimated motion parameters (vertical axis) in dependence on the distance between the sensor surfaceand the scene (horizontal axis) for f = 60 and spheres of unit radius

in Eq. (5)):

z = Lu

fLx= Lv

fL y(7)

= −[Lu M1t + LvM2

t ]t +[Lu M1ω + Lv M2

ω]ωf(Lt +[Lu M3

ω + LvM4ω]ω)

= − Lt +[Lx M1t + L y M2

t ]t +[Lx M1ω + L y M2

ω]ωf [Lx M3

ω + L y M4ω]ω ,

where the first two equations describe the well-known constraints of differential stereo, while thelast two equations have been used in the commu-nity to compute depth maps based on structure frommotion constraints.The analysis suggests the following plenoptic struc-ture from motion algorithm. Using the proposedplenoptic motion framework, one can envisiona feedback-loop algorithm, where we use all theplenoptic derivatives to compute an estimate of thecamera motion using Eq. (4). Since we are solv-ing a linear system, the computation of the motionparameters is fast and we do not have any con-vergence issues as in the non-linear methods nec-essary for single-viewpoint cameras. Then we canuse the recovered motion together with the plenop-tic derivatives to compute a scene-depth estimate.If the four estimates in Eq. (7) do not match, weadapt the integration domains of the temporal, di-rectional, and positional derivative filters until wecompute consistent depth and motion estimates.This is repeated for each frame of the input video,while simultaneously we use the computed motion

trajectory to integrate and refine the instantaneousdepth maps in a large-baseline stereo optimizationto construct accurate three-dimensional descriptionsor image-based representations (e.g. [12]) of thescene.

6 Conclusion

According to ancient Greek mythology Argus, thehundred-eyed guardian of Hera, the goddess ofOlympus, alone defeated a whole army of Cyclopes,one-eyed giants. The mythological power of manyeyes became real in this paper, which proposeda mathematical analysis of the differential structureof the space of light rays. We introduced a frame-work to systematically study the relationship be-tween the shape of an imaging sensor and the taskperformance of the entity using this sensor. In thispaper we focused on the relation between the localdifferential structure of the time-varying plenopticfunction and the egomotion estimation of an imag-ing sensor. Of course many more tasks are imag-inable (for example shape and three-dimensional-motion estimation from non-perspective imagery).The application of this framework to the struc-ture from motion problem resulted in a novel lin-ear and scene-independent constraint between thedifferential structure of the plenoptic function andmotion parameters describing the instantaneous mo-tion of the sensor, which was used to define guide-lines for an optimal structure from motion cameradesign.

Page 10: Plenoptic video geometry

404 J. Neumann, C. Fermüller: Plenoptic video geometry

References

1. Adelson EH, Bergen JR (1991) The plenoptic functionand the elements of early vision. In: Landy M, MovshonJA (eds) Computational models of visual processing. MITPress, Cambridge, Mass, pp 3–20

2. Adelson EH, Wang JYA (1992) Single lens stereo witha plenoptic camera. IEEE Trans PAMI 14:99–106

3. Bolles RC, Baker HH, Marimont DH (1987) Epipolar-planeimage analysis: an approach to determining structure frommotion. Int J Comput Vision 1:7–55

4. Camahort E, Fussell D (1999) A geometric study of lightfield representations. Technical Report TR99-35, Depart-ment of Computer Sciences, The University of Texas atAustin

5. Capurro C, Panerai F, Sandini G (1996) Vergence and track-ing fusing log-polar images. In: Proceedings InternationalConference on Pattern Recognition

6. Chai J, Shum H (2000) Parallel projections for stereo recon-struction. In: Proceedings IEEE Conference on ComputerVision and Pattern Recognition, Vol 2, pp 493–500

7. Chai J, Tong X, Shum H (2000) Plenoptic sampling. In:Proceedings ACM SIGGRAPH, pp 307–318

8. Dawkins R (1996) Climbing Mount Improbable. Norton,New York

9. Gortler S, Grzeszczuk R, Szeliski R, Cohen M (1996) Thelumigraph. In: Proceedings ACM SIGGRAPH, pp 43–54

10. Grossberg MD, Nayar SK (2001) A general imaging modeland a method for finding its parameters. In: ProceedingsInternational Conference on Computer Vision, pp 108–115

11. Horn BKP (1986) Robot vision. McGraw Hill, New York12. Koch R, Pollefeys M, Heigl B, VanGool L, Niemann

H (1999) Calibration of hand-held camera sequences forplenoptic modeling. In: Proceedings International Confer-ence on Computer Vision, pp 585–591

13. Levoy M, Hanrahan P (1996) Light field rendering. In: Pro-ceedings ACM SIGGRAPH, pp 161–170

14. Moon P, Spencer DE (1981) The photic field. MIT Press,Cambridge, Mass

15. Nayar S (1997) Catadioptric omnidirectional camera. In:Proceedings IEEE Conference on Computer Vision and Pat-tern Recognition, pp 482–488

16. Neumann J, Fermüller C, Aloimonos Y (2002) Eyes fromeyes: new cameras for structure from motion. In: IEEEWorkshop on Omnidirectional Vision 2002 (in conjunctionwith ECCV 2002), pp 19–26

17. Pajdla T (2002) Stereo with oblique cameras. Int J ComputVision 47(1/2/3):161–170

18. Peleg S, Herman J (1997) Panoramic mosaics by manifoldprojection. In: CVPR’97, pp 338–343

19. Rademacher P, Bishop G (1998) Multiple-center-of-projec-tion images. Proceedings ACM SIGGRAPH, pp 199–206

20. Richter JP (ed) (1970) The notebooks of Leonardo da Vinci,Vol 1. Dover, New York, p 39

21. Seitz S (2001) The space of all stereo images. In: Proceed-ings International Conference on Computer Vision

22. Shum HY, Kalai A, Seitz SM (1999) Omnivergent stereo.In: Proceedings International Conference on Computer Vi-sion

23. Wood DN, Finkelstein A, Hughes JF, Thayer CE, SalesinDH (1997) Multiperspective panoramas for cell animation.Proceedings ACM SIGGRAPH, pp 243–250

JAN NEUMANN did his un-dergraduate studies at the Med-ical University of Lübeck inMedical Computer Science forthree years, before he becamea PhD student in Computer Sci-ence at the University of Mary-land in 1997. There he receiveda MSc in Computer Science in2001 and is currently pursuinghis PhD. His main research in-terests include the representationof the visual information cap-tured by many imaging devicessimultaneously and the represen-

tation and extraction of the spatio-temporal structure of theworld that is observed.

CORNELIA FERMÜLLERreceived the MS degree in Ap-plied Mathematics from the Uni-versity of Technology, Graz,Austria in 1989 and the PhD de-gree in Computer Science fromthe Technical University of Vi-enna, Austria in 1993. In 1994she has had visiting positions atKTH Stockholm, Sweden andFORTH, Crete, Greece. Since1994, she has been a ResearchScientist with the Computer Vi-sion Laboratory of the Instituteof Advanced Computer Studies,

University of Maryland. Her research interests include com-putations involved in the interpretation of the geometry of theworld from multiple views or video, the study of mathematicalconstraints relating images to the visible world, and the devel-opment of algorithms for robotic systems. She also pursues theexplanation of biological vision processes using her theoreticalfindings. On these subjects she has authored numerous papers ininternational journals and books.