ParallelTracking,Depth Estimation,andImage ......ages have a dynamic range of 51 dB, the event...

Department of Informatics

Timo Horstschafer

Parallel Tracking, Depth

Estimation, and Image

Reconstruction with an Event

Camera

Depth Estimation Image Reconstruction

Poses

Poses, Depth

Depth,Im

age

Tracking

Master esis

Robotics and Perception GroupUniversity of Zurich

Supervision

Dr. Guillermo GallegoElias Mueggler

Prof. Dr. Tobi DelbruckProf. Dr. Davide Scaramuzza

May 2016

Contents

Abstract iii

Nomenclature v

1 Introduction 1

1.1 Event Cameras . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11.2 Visual odometry . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31.3 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31.4 Contribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4

2 Event-Based Vision 6

2.1 Event Frames . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72.2 SLAM for Event-Based Vision . . . . . . . . . . . . . . . . . . . . . . 7

2.2.1 Simultaneous Depth Estimation and Image Reconstruction . . 8

3 Image Stabilization 10

3.1 IMU Complementary Filter . . . . . . . . . . . . . . . . . . . . . . . . 103.1.1 Prediction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113.1.2 Correction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113.1.3 Online Bias Estimation . . . . . . . . . . . . . . . . . . . . . . 12

3.2 Image Stabilization . . . . . . . . . . . . . . . . . . . . . . . . . . . . 133.3 Event Stabilization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 153.4 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17

3.4.1 e Rotation Camera . . . . . . . . . . . . . . . . . . . . . . . 17

4 Tracking 18

4.1 Inverse Compositional Lucas-Kanade Method . . . . . . . . . . . . . 194.2 3D Rigid-Body Warp . . . . . . . . . . . . . . . . . . . . . . . . . . . 204.3 Random Sparse Tracking . . . . . . . . . . . . . . . . . . . . . . . . . 214.4 Map Extraction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 224.5 Reference Frame . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 234.6 Event Frame . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24

5 Depth Estimation 27

5.1 Epipolar Line Search . . . . . . . . . . . . . . . . . . . . . . . . . . . 275.2 Linear Triangulation . . . . . . . . . . . . . . . . . . . . . . . . . . . 285.3 Bayesian Depth Filter . . . . . . . . . . . . . . . . . . . . . . . . . . . 295.4 Event Frame . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29

i

5.5 Keypoints . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32

6 Image Reconstruction 33

6.1 Pixel-wise Extended Kalman Filter . . . . . . . . . . . . . . . . . . . . 346.2 Poisson Image Reconstruction . . . . . . . . . . . . . . . . . . . . . . 346.3 Depthmap Propagation . . . . . . . . . . . . . . . . . . . . . . . . . . 37

7 Parallel Tracking, Depth Estimation, and Image Reconstruction 38

8 Experiments 40

8.1 Tracking . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 408.1.1 Precision . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 448.1.2 Performance . . . . . . . . . . . . . . . . . . . . . . . . . . . . 488.1.3 Real World Seing . . . . . . . . . . . . . . . . . . . . . . . . 48

8.2 Depth Estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 518.3 Reconstruction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 548.4 Visual Odometry . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56

9 Discussion 61

9.1 Tracking . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 619.1.1 Event Frame . . . . . . . . . . . . . . . . . . . . . . . . . . . . 619.1.2 IMU integration . . . . . . . . . . . . . . . . . . . . . . . . . . 62

9.2 Depth Estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 629.3 Image Reconstruction . . . . . . . . . . . . . . . . . . . . . . . . . . . 63

A Appendix 65

A.1 Sub-Pixel Bilinear Interpolation . . . . . . . . . . . . . . . . . . . . . 65A.2 Sub-Pixel Bilinear Drawing . . . . . . . . . . . . . . . . . . . . . . . . 65

Abstract

Visual odometry, i.e. estimating a robot’s trajectory from visual sensing, has practi-cally been solved for standard (e.g., frame-based) cameras. However, by nature it isunable to overcome the physical limitations of standard cameras, most importantlythe slow reaction time, low dynamic range, and from motion blur.

Event cameras in contrast work by entirely dierent principle, instead of synchronouslymeasuring a complete image at a xed frame rate, event cameras output pixel-levelbrightness changes at the time they occur. e advantages are fast reaction time,high dynamic range, and no motion blur. Visual odometry on event cameras howeverrequires completely new algorithms since the output of such sensors is completelydierent from that of standard cameras.

is thesis proposes a techniques for image and event stabilization using an accelerom-eter and a gyroscope, as well as a set of novel and purely event-based methods fortracking, depth estimation and image reconstruction, which together form a completevisual odometry pipeline. In contrast to previous work, the proposed methods workin three-dimensional environments with no constraints on the photometric contentand can handle camera motions in all six degrees of freedom. We test our algorithmson both synthetic and real datasets and demonstrate the high-quality results of ourmethods.

iii

Nomenclature

Acronyms and Abbreviations

AER Address-Event Representation

DAVIS Dynamic and Active-pixel Vision Sensor

DoF Degree of Freedom

DVS Dynamic Vision Sensor

EKF Extended Kalman Filter

HDR High Dynamic Range

IMU Inertial Measurement Unit

ROS Robot Operating System

RPG Robotics and Perception Group

SLAM Simultaneous Localization and Mapping

VO Visual Odometry

v

Chapter 1

Introduction

Today, many robots make use of visual sensors to localize themselves in the environ-ment. is works without any additional sensing, as for example a GPS signal andmakes the robots truly autonomous. Visual navigation is, however, limited by thephysical capabilities of standard cameras. ey can not handle high-speed motionand do not work in environments which exhibit strong illumination changes. Unfor-tunately, these are the exact conditions we want our ying robots and autonomousmachines to function.

Event cameras, in contrast, work by a new concept, which outperforms standard inmany ways, but does not allow us to use the same algorithms developed for stan-dard vision. It is thus an desirable to nd new ways, to bring common techniques ofmachine vision to this new type of camera.

We explain the method of operation of event cameras in the next section. Next is anintroduction to visual odometry. We give a short overview over the work that hasalready been done in that eld before we nally summarize the contributions of thisthesis.

1.1 Event Cameras

Traditional cameras acquire the visual information of a scene as a stream of inten-sity images at a constant rate. In contrast, event cameras, also called Dynamic VisionSensors (DVS) [1], only report pixel-level brightness changes (called “events”) asyn-chronously, at the time they occur (with microsecond resolution); thus there is nonotion of images or frames since each pixel operates independently. Consider a xedcamera viewing a static scene. A traditional camera will output the same amountof data in every image, even in case that there is no motion or illumination changein the scene. By contrast, an event camera will produce no output when viewing atthe same, static scene. However, as soon as as there is a small change in the scene,the event-based camera will report exactly that change. For standard cameras furtherprocessing of the images is needed in order to detect such scene changes.

1

2 1.1. Event Cameras

is natural “compression” of information or redundancy suppression allows stan-dard cameras to beer sample the visual content of the scene, thus providing moreinformation than standard cameras, since the laer do not provide information in theblind time between frames. While a standard cameras deliver frames at a rate of about30 Hz, typical event cameras can output more than 1,000,000 events per second, withmicrosecond latency. is high rate of asynchronous information allows us to treatthe event stream as continuous.

Each event consists of three pieces of information: the pixel location at which thebrightness change occurred, its time stamp and its polarity (i.e., sign). An event iscaused by a xed amount of brightness change in a pixel, and the polarity deneswhether the brightness change was positive or negative, i.e., whether the pixel gotbrighter or darker, respectively. ese three pieces of information are also called theAddress-Event Representation (AER) [1].

is principle of operation of the event camera is shown in Fig. 1.1a, where a rotatingdot causes a continuous stream of events, as well as some background noise.

(a) (b)

Figure 1.1: (a) e principle of operation of event cameras. A rotating dot causesa continuous stream of events. (b) Demonstration of the high dynamic range up to130 dB. Le: Visualization of the events. Right: Classical camera images. (Imagescourtesy of [1])

Another consequence of the event-based operating paradigm is that it allows for amuch higher dynamic range than that of standard cameras since a global exposuretime is not imposed on every pixel; instead, each pixel works independently, self-adjusting to the local brightness level of the scene. For example, the DAVIS [2] con-tains both an event sensor and a frame sensor on the same pixel array. While the im-ages have a dynamic range of 51 dB, the event stream shows a range of 130 dB. us,the dynamic range of the events outperforms the images by a factor of 100,000,000.is is illustrated in Fig. 1.1b.

eses properties make event cameras interesting for application in robotics, whereshort reaction time is needed and robots have to operate in environments with largeillumination variations.

Chapter 1. Introduction 3

1.2 Visual odometry

e goal of this thesis is to develop algorithms for visual odometry using event cam-eras. Visual odometry is the task of estimating the pose of a robot based on visualinput of a camera. e term originates from the so-called wheel odometry, a precur-sor to visual odometry in which the amount of rotation of a wheel is measured inorder to compute its position. e input gathered from visual sensors is much moreabundant than that from wheels and thus it allows for more precise and robust track-ing. However, the task of using vision for odometry is not trivial and has only beensolved for standard cameras in the last decade [3].

Event cameras, however, pose a dicult challenge for visual odometry since state-of-the-art algorithms are designed to work on intensity images, which are not directlyavailable from the event stream. erefore, new methods have to be developed thatwork on the event stream and exploit the advantages of these novel vision sensors.

In this thesis, we divide the problem of visual odometry into three distinct sub-tasks.e rst one is tracking (or localization) of the event camera with respect to a giventhree-dimensional map. e second is estimating the depth of the scene given thetrajectory of the camera. e last is reconstructing the actual intensity image giventhe camera trajectory and the depth of the scene. is last step is considered to bean important part of visual odometry in contrast to standard cameras. Recent pro-posals for pose estimation with event cameras all rely on additional intensity imageinformation in order to extract usable maps for the tracking process.

1.3 Related Work

Some progress towards visual odometry has already been achieved. However, all ofthem use additional sensing or work only in constrained environments (constrainedcamera motion and/or constrained scene).

e rst results on the estimation of image intensities, optical ow and the rotationalmotion from an event camera was shown by Cook et. al. [4]. Instead of engineeringa classical visual pipeline, they solved the joint optimization problem by designing anetwork of interconnected maps that mimic how the brain of humans and animalsprocess visual stimuli. eir work does not estimate camera translation or depth ofthe scene; it has not been extended to the six degrees-of-freedom (6 DoF) case yet.

First results on Simultaneous Localization and Mapping (SLAM) in only 2D have beenreported by Weikersdorfer et al. [5]. Limiting the localization as well as the mappingto two dimensions, however skip the challenging task of depth estimation using asingle event camera. eir work was extended in [6] to work in 6 DoF. However, inthis work the event camera is coupled with a depth sensor. As the depth sensor workssimilar to a standard camera, their approach suers from the same problems and isunable to utilize the advantages of event cameras like low latency and high dynamicrange.

Mueggler et. al. [7] presented a method that can track all 6 DoF of an event cameramounted on a quadrotor, with no additional sensing. However, the map was givenby a black square on a white wall and the tracking algorithm was especially tailored

4 1.4. Contribution

for that particular type of map. e work was extended in [8] to t a continuouscamera trajectory (using cubic splines) to the measured events, taking into accounttheir asynchronous nature. is resulted in a smoother camera trajectory.

Localization with an event camera for short time periods (e.g., the time between cam-era frames) was proposed by Censi et. al. [9]. e approach relies on additional in-tensity information from a standard camera and thus can not fully take advantageof the properties of event cameras. Although they report good results for rotationestimation, the translation of the camera has not been estimated very accurately.

A novel method based on the so-called generative event model has been applied in[10] to estimate the pose in all arbitrary DoF motions with respect to a given intensityimage and a three-dimensional map. ey propose an implicit Extended Kalman Filter(EKF) to update the camera state for each event in a Bayesian inference framework.

A compelling demonstrations of the capabilities of event cameras in acquiring visualinformation has been done byKim et. al.[11], where the camera rotation and the imagegradients were estimated simultaneously. e image gradients are used to reconstructthe original image intensities by solving the Poisson equation. e method recoversa mosaic of the scene by combining the estimated image intensities according to therotating camera motion. Restriction to rotation only, however avoids estimating thedepth of the scene and the camera translational motion, which greatly simplies theprocess.

Another novel approach for image reconstruction from event cameras has been pro-posed by Bardow et. al. [12]. ey formulate a joint error function on the imagegradients and the optical ow. A preconditioned primal-dual formulation is used tominimize the error. e result is a rst “video-like” sequence generated using only theevent stream and no additional sensing. eir approach, however, does not estimatecamera motion or scene depth.

Event-based depth estimation is currently in a very early stage. A method using astereo event camera has been proposed bu Schraml et. al [13]. e two event camerasare mounted on a rotating platform which results in a simple trajectory that can bedescribed analytically. eir method can compete with state-of-the-art methods butis limited on the availability of two event cameras and the given, constrained, cameratrajectory.

Most recently a visual odometry solution for the DAVIS has been proposed in Kuenget. al. [14]. It makes use of the image frames to extract features which are then trackedusing the event stream.

1.4 Contribution

is thesis consists of ve dierent contributions. First, we present a framework toelectronically stabilize images and even single events using a gyroscope and an ac-celerometer. Single event stabilization can remove the motion blur due to rotation ofthe camera. e method and results are presented in Chapter 3.

e second contribution is a method for tracking of an event camera with respect toa given 3D map (Chapter 4). e approach diers from previous work as it tracks

Chapter 1. Introduction 5

the camera motion in all 6 DoF and has no constraints on the map: (geometrically orphotometrically). Previous methods either assume only a 2D map or a scene given instrong black and white contrasts. e proposed algorithm further achieves more thanreal-time performance and can compute up to 10,000 poses per second.

e third contribution is a purely event-based depth estimation method using a sin-gle camera. No prior work has been published on this topic by the time writing.e method is capable of computing precise depthmaps, but still relies on sucientamount of linear motion to be present in the camera trajectory. is part is describedin Chapter 5.

e fourth method, presented in Chapter 6, is about image reconstruction in a xedreference frame given the camera trajectory and the depth of the scene. ese two bitsof information are used to reproject every single event into the xed frame and thecamera pose taken into account to compute the motion eld. is allows to estimatethe gradient from which the original image intensity can be reconstructed by solvingthe Poisson equation. So far, image reconstruction has only been demonstrated ineither rotational motion only or for the image plane without taking into account theconsistency of the optical ow with a camera motion.

Finally, we fuse the three processes of tracking, depth estimation and image recon-struction into a full visual odometry pipeline. Further details about this process aregiven in Chapter 7.

e main dierence of this thesis with respect to previous approaches is that all pro-posedmethods require information other than the events to address the 6-DoFmotionand unconstrained 3D map environment. is should clearly be the ultimate goal forevent-based visual odometry, as every pipeline is limited by its weakest component,which in this case are standard intensity frames suering from motion blur, latencyand low dynamic range.

Chapter 2

Event-Based Vision

Event-based vision or dynamic vision, is inspired by how animals and humans per-ceive the environment. Instead of describing the world as a series of complete picturestaken at a static frequency, event cameras see changes in the image plane at the timethey occur, with very accurate timing (microseconds).

is is in fact a very ecient method to reduce the amount of data from visual sensing.It is used extensively in video compression schemes, where only the changes withrespect to keyframes are stored in the resulting le. It also explains why it is almostimpossible for humans to see an animal hiding in the woods but it can be immediatelyspoed and without any eort, when something begins to move. Human perceptionis very susceptible to motion cues. us, an event from the Dynamic Vision Sensor(DVS) is given by the tuple ek = 〈uk, tk,±k〉. Each event ek , which occurred atlocation uk at time tk relates to a relative brightness change of ±kC , with the signdened by the polarity.

Considering an intensity image I(u), where we denote the spatial intensity gradientsas g := ∇I and the pixel velocity or apparent motion (also called motion eld oroptical ow) as v := u ≡ du/dt. en, one way to interpret the event camera is as agradient sensor, or more specically a g · v sensor. is can be understood in terms ofthe image brightness constancy assumption or equation of continuity at a pixelu, whichsays that the total derivative of the brightness function is zero along the apparentmotion trajectories u:

dI

dt= ∇I · u+

∂I

∂t= 0. (2.1)

is is equivalent to

g · v + ∂tI = 0. (2.2)

us, any measure in the camera frame, which corresponds to a change in the imageintensity ∂tI , is related to the gradient g and the optical ow vector v:

∂tI = −g · v. (2.3)

is has an interesting consequence: as we describe later in Section 6.2, it is possibleto reconstruct the original intensity image I only from knowledge about its gradient

6

Chapter 2. Event-Based Vision 7

g. Some application allows us to measure the change of intensity ∂tI and the motioneld v, which allows us to estimate the gradient g and thus the original image I .

2.1 Event Frames

Oen in this thesis, we do not work in a per-event basis. Instead, we operate on acollection of events, which we call an event frame. We dene this frame as integrationover a small period of time ∆t = t1 − t0. From the equation of continuity, we canderive

∂tI = −g · v (2.4)

⇔ ∂tI (t1 − t0) = −g · v (t1 − t0) (2.5)

⇔ (I1 − I0) ≈ −g · (u1 − u0), (2.6)

where the last approximation comes from using only the rst term in the Taylor seriesI(t+∆t) = I(t) + (∂tI)(t)∆t+O((∆t)2).

From here we arrive at the integrated equation of continuity:

δI = −g · δu. (2.7)

where we have dened the increments in space δu := u1 − u0 and intensity, δI :=I1 − I0. Eq. (2.7) states that the dierence of brightness in the camera frame dur-ing time ∆t corresponds, to rst order, to a measurement of the gradient times thedisplacement of the image point u during ∆t. ese two quantities can actually bemeasured with an event camera given that we already know the motion of the cameraas well as the depth of the scene.

Given a set of events k from ek = 〈uk, tk,±k〉 with tk ∈ [t0, t1], we can formulatethe delta image δI in (2.7) as

δI(x) = C∑

k

±k δ(x− uk), (2.8)

where δ(x) on the right hand side is the Dirac delta. It is zero everywhere, except atthe origin δ(0).

It is worth to note that if we choose∆t to be the time span between each event k, theformulation here in fact reduces to a per-event style, as the sum in δI simply consistsof one single event.

e formulation here is thus comparable to processing each event individually butapplying a smoothing over time∆t in order to be more robust to noise and also reducethe computation time drastically.

2.2 SLAM for Event-Based Vision

For standard (i.e., frame-based) cameras, the process of visual odometry mainly con-sists two independent process running simultaneously. e rst one, localization, as-sumes a perfect 3D map is known and computes the camera pose of new frames with

8 2.2. SLAM for Event-Based Vision

respect to the given map. e second one, mapping, on the other hand assumes aperfect set of images with poses is given and computes the depth for parts of the mapthat have not been triangulated yet.

To summarize, the process looks as follows:

Input Output

Localization depth, images → trajectoryMapping trajectory, images → depth

In contrast, event cameras output events instead of intensity images. Yet, so far allapproaches to localization rely on at least one intensity image in the keyframe. eintensity image is then used to extract gradients or edges, which are visible as events.

We do not aempt to breakwith this tradition. So, in order to arrive at visual odometryusing only the event-based stream, we need to reformulate our approach to visualodometry. We propose a three-fold approach, which splits the mapping part into twosub-problems: depth estimation and image reconstruction.

e goal of image reconstruction is to estimate the gradients in a xed, virtual cameraframe, in our case the keyframe. To get there, we need to transform each event from itscurrent camera frame into this xed frame. us, this requires the knowledge of thecamera pose (rotation and translation) and the corresponding depth for each singleevent. is suggests that, to avoid mutual dependence between the two sub-problems,depth estimation must only rely on the event stream and the camera pose, as no otherinformation is available at that point.

Our proposed pipeline for event-based visual odometry consists of the following com-ponents:

Input Output

Localization depth, image, events → trajectoryMapping

• Depth Estimation trajectory, events → depth• Image Reconstruction trajectory, depth, events → image

2.2.1 Simultaneous Depth Estimation and Image Reconstruc-tion

As recently proposed by Bardow et. al [12], it is to some degree possible to esti-mate g and v simultaneously, by jointly optimizing an error function in terms of bothunknowns. eir method works in the image plane, without taking into account thecameramotion or scene depth, and they do not provide comparisonwith ground truth,so it is not clear whether v is a good approximation to the optical ow. Nevertheless,one could, in principle, follow a similar optimization approach to jointly estimate thegradient g and the depth (instead of the optical ow) for an event camera moving ina static environment, since in this case the motion eld v comprises the informationof the camera motion and the depth.

It is yet to nd out which process yields a beer solution. is thesis focuses onsolving depth estimation and reconstruction independently, but time may show if

Chapter 2. Event-Based Vision 9

other approaches get to the solution in a more elegant and robust way.

Chapter 3

Image Stabilization

Fusing themeasurement of a gyroscope and an accelerometers, one can obtain the ori-entation of the camera with respect to gravity. ese sensors are typically found in aninertial measurement unit (IMU), which is present on the DAVIS. Assuming rotationalmotion only we can apply a perspective transformation on the image coordinates tomake it appear as seen from another orientation. Knowing the relative rotation be-tween the images we can thus warp the images on top of each other. e resultingimages stream look stabilized, as lmed using a “steadicam”.

e method to compute the orientation from the IMU is based on the so called com-plementary lter [15]. In the next section, we aempt to give a short outline of thislter. However, we restrict the description on correcting the pose roll and pitch com-ponent using the gravity vector from the accelerometer measurement. e originalformulation also allows for correction of the yaw by using a magnetometer. But thisis not available on the DAVIS.

3.1 IMU Complementary Filter

e method implements a quaternion-based complementary lter to solve Wahba’sproblem. Complementary lters are an alternative to extended Kalman lters and areoen preferred for fusion of gyroscopes and accelerometers, as they are eective andsimple to implement1.

It applies a high-pass lter to remove low-frequency noise from the gyroscope anda low-pass lter to remove high-frequency noise from the accelerometers. As thecut-o frequencies for both lters are the same, they are called “complementary”.

A quaternion, describing a rotation from the local frame L to the global frame G isgiven as

GLq = (q0, q1, q2, q3)

T =(cos

α

2, ex sin

α

2, ey sin

α

2, ez sin

α

2

)T(3.1)

1We use the implementation for ROS, which is available at http://wiki.ros.org/imucomplementary filter

10

http://wiki.ros.org/imu_complementary_filter

http://wiki.ros.org/imu_complementary_filter

Chapter 3. Image Stabilization 11

with α the rotation angle around the axis e = (ex, ey, ez)T . e quaternion inverse

and multiplication are given by:

q−1 = q∗ = (q0, − q1, − q2, − q3)T , (3.2)

p⊗ q =

p0q0 − p1q1 − p2q2 − p3q3p0q1 + p1q0 + p2q3 − p3q2p0q2 − p1q3 + p2q0 + p3q1p0q3 + p1q2 − p2q1 + p3q0

(3.3)

Aer initialization, the following procedure is applied to compute the new state ofthe system at time tk , given the previous state at tk−1 and the measurements Lω ofthe angular velocity and La of the acceleration.

3.1.1 Prediction

e derivative of the orientation quaternion can be wrien in terms of the angularvelocity ω as

GL qω,tk

=1

2GLqtk−1

⊗ Lωq,tk (3.4)

with ωq,tk = (0, ωx, ωy, ωz)T the pure quaternion of ω.

e inverse is then

LGqω,tk

=1

2(GLqtk−1

⊗ Lωq,tk)∗ (3.5)

= −1

2Lωq,tk ⊗

LGqtk−1

(3.6)

And the prediction results as integration over the interval ∆t = tk − tk−1

LGqω,tk

= LGqtk−1

+ LGqω,tk

∆t (3.7)

3.1.2 Correction

Aer the prediction of the pose using the angular velocity, the roll and pitch compo-nents of the orientation are updated using

GLq = L

Gqω ⊗ ∆q (3.8)

Denoting the 3 × 3 rotation matrix of quaternion q by R(q), we have the followingconstraint:

R(GLqω)La := Ggp

!= R(∆qacc)

Gg (3.9)

which relates the predicted gravity Ggp with real gravity vector Ggp using the deltaquaternion ∆qacc.

Ggp = (gx, gy, gz)T (3.10)

Gg = (0, 0, 1)T (3.11)

12 3.1. IMU Complementary Filter

e system has an innite number of solutions, as the gravity vector gives no con-straint on the yaw component. Seing∆q3acc = 0 it has the closed-form solution

∆qacc =

(√gz + 1

2, −

gy√2(gz + 1)

,gx√

2(gz + 1), 0

)T

(3.12)

e high-pass lter is then implemented by interpolation between the quaternion∆qacc and the unit quaternion qI. e degree of interpolation is dened by the gainfactor α.

As the deviation from the unit quaternion is generally very small, the simple LinearintERPolation (LERP) is used for performance reason.

∆qacc = (1− α)qI + α∆qacc (3.13)

∆qacc =∆qacc

||∆qacc||(3.14)

Only if the deviation is large, themore complex Spherical Linear intERPolation (SLERP)is applied.

∆qacc =sin[(1− α)Ω]

sinΩqI +

sin(αΩ)

sinΩ∆qacc (3.15)

cosΩ = p · q = p0q0 + p1q1 + p2q2 + p3q3 (3.16)

e two interpolation methods are visualized in Fig. 3.1.

SLERP

Ω

qintqa

qb

LERP

Ω

qintqa

qb

Figure 3.1: Visualization of the linear (LERP) and spherical linear (SLERP) interpo-lation methods for quaternion interpolation in 2D. LERP does not conserve the unitnorm. (Image from [15])

3.1.3 Online Bias Estimation

While the low-pass lter is applied on the accelerometer measurements, the high-passlter is applied to the gyroscope measurements in terms of an online bias estimation.

e system is tested for a steady state by comparing the linear acceleration, the angularvelocity and the angular acceleration against some thresholds. If these result in asteady state, the angular velocity bias is updated. A high-pass lter is applied duringthe bias update using

ωbias ← αbias(ω − ωbias). (3.17)


3.2 Image Stabilization

First, we explain the stabilization of complete images as opposed to single events2. Forthis, we use the orientation computed from the IMU and keep two reference states.One is the current pose of the camera, the other the steady pose, which is the currentpose with a low-pass lter applied.

To stabilize the image, we use the transformation from the current pose to the steadypose and apply a perspective warp to transform the image. With each new image kthe steady pose qs is updated with the current orientation quaternion qk in a low-passlter fashion

qs ← Slerp(qs, qk, α) = (qsq−1k )αqs, (3.18)

where α ∈ [0, 1] is the damping factor, usually kept at a value of 0.1.

We then apply a perspective warp on image k, using the matrix

M = KRs,kK−1, (3.19)

where K is the intrinsic camera matrix and Ri,j the matrix describing the rotationfrom j to i.

To make the output appear more natural, we align the steady pose qs with the gravity.is means, we x the roll component of the orientation. anks to the complemen-tary lter, which uses the gravity vector as a reference to correct the roll and pitchcomponent, we achieve high accuracy in the gravity alignment.

In Fig. 3.2we demonstrate the results of the image stabilization. e le column showsthe raw output from the camera. e right common shows the image aer warping.Although the camera undergoes heavy rotation, the scene on the right almost appearsstatic. It is also clearly visible, that every image on the right appears perfectly alignedto the gravity. To get a beer feeling of the stabilization, we recommend to check outour demonstration video on YouTube3 [16].

2Since no loop over image coordinates is needed, the project can be implemented in Python and stillachieve real-time performance

3https://youtu.be/U61INjUiMTU

https://youtu.be/U61INjUiMTU

14 3.2. Image Stabilization

(a) before stabilization (b) aer stabilization

Figure 3.2: Camera frames from the DAVIS before and aer stabilization and align-ment to gravity using the IMU.


3.3 Event Stabilization

Aer image stabilization we will describe how the same approach can be applied toevery single event generated from the DAVIS camera. e general concept has alreadybeen demonstrated in [17]. Again, we use the same model with a current and a steadypose. But instead of transforming complete images, we now transforms every singleevent4.

If we visualize a set of events by showing the accumulated change in brightness, weusually see the same motion blur artifacts that are visible on standard cameras. einteresting thing about event stabilization is now, that it is possible to practically re-move that particular motion blur, if it results from the rotation of the camera. is isnot possible on frame-based camera, where an individual pixel has no additional tim-ing information. In contrast, in event-based vision, each event has its own timestampand thus allows to apply a dierent rotation on a per-pixel basis.

As for the image stabilization, we present some samples of the event stabilization inFig. 3.3, showing the raw output on the le and the stabilized events on the right. eimages show drastically the eectiveness of the stabilization to remove motion blur.Again, we provide a video5 [16] to demonstrate our approach in a real scene.

4For this purpose we have to switch from Python to C++ in order to achieve real-time performance.5https://youtu.be/tqONIVgxAvg

https://youtu.be/tqONIVgxAvg

16 3.3. Event Stabilization

(a) before stabilization (b) aer stabilization

Figure 3.3: Visualization of the event stream before and aer stabilization.


3.4 Conclusion

e experiments have shown that already a simple approach by using measurementsfrom a gyroscope and an accelerometer can be used to achieve impressive stabilizationof camera images in soware. e eect is even more dramatic for the event stream.By applying a correction on a per-event basis, we can practically overcome the blurcaused by rotation of the camera. is is very important, as most of the motion bluris in fact caused by rotation. us, removing that part will increase the performanceof subsequent processes as motion tracking or depth estimation dramatically.

Extending the approach to account for translational motion as well, is however nottrivial. First, we need a measurement of the translation. is can be computed fromthe accelerometer, but the result is not very precise and heavily infected by dri, asthe measurements have to be integrated twice in order to get from acceleration toposition. In addition, we need to know the depth of the scene, since a 3D point in thecamera frame moves fast if it is close to the camera or very slowly if it lies far awayfrom it (e.g., at innity, the sky or very distant background).

3.4.1 The Rotation Camera

Taking into account that image reconstruction from gradients is possible (see detailsin Section 6.2), we propose to build an intensity camera from the DVS, that can “see”the world, as long as it is being rotated.

Using the orientation at the beginning and the end of an event frame, we can computethe motion eld v. Correcting the events for that rotation, we get an estimate of ∂I

∂tin

a xed reference frame, e.g., the frame at the beginning of the event stream. Using theequation of continuity g · v = −∂I

∂t, we can directly solve for the gradient g and use

the Poisson image reconstruction method to recover the original intensity image. Asthis approach assumes only rotational motion, the results obviously will not be veryusable when the camera undergoes translational motion.

Chapter 4

Tracking

We start with describing the tracking method for our odometry pipeline. As describedearlier in Section 2.2, the assumption to this problem is that there is a depthmap aswell as an intensity image available in a reference frame. Since the event camera onlycaptures gradients in the scene, we use the gradients of the intensity image to generatea mask for the depthmap. Each valid point in this mask is then converted into a 3Dmap point.

ese 3D points are selected in a semi-dense manner, comparable to LSD-SLAM [18].We compute the gradients of the given intensity image and operate on the magnitudeof the gradient, ||g||2. A set of points in this absolute intensity image is then selectedbased on a xed minimal gradient value. is is very similar to edge detection, justwithout the subsequent renement to make edges thin. e method is described indetail in Section 4.4.

We begin this chapter with a general description of the Lucas-Kanade method, whichis the method we used to align images using a given warp function W (u;p). enfollows a description of the applied warp in Section 4.2. As we want to track themotion of the camera in all six degrees of freedom, we use the so-called 3D rigid-body warp, which is mathematically described as by SO(3), the special orthogonalgroup in three dimensions. As the Lucas-Kanade tracker has so far only been usedon classical frame-based camera, we introduce a novel method for image alignment,the random sparse tracking, in Section 4.3. It reduces the complexity of the Lucas-Kanade method by some orders of magnitude and is thus able to process up to 10,000frames per second an event camera. e nal three sections handle the details of theimplementation, namely the extraction of 3D points from the given intensity image(Section 4.4), the formation of the reference frame (Section 4.5) and the event frame(Section 4.6).

18

Chapter 4. Tracking 19

4.1 Inverse Compositional Lucas-Kanade Method

Weuse the inverse compositional formulation of the Lucas-Kanademethod [19], whichminimizes the following photometric error:

∑

u

(T(W (u;∆p)

)− I(W (u;p)

))2, (4.1)

where the sum is over all candidate pixel coordinates u in the image.

e minimization of the error term is achieved by Gauss-Newton method. e rstorder Taylor approximation of the residual (4.1) gives the approximate photometricerror

∑

u

(T(W (u;0)

)+∇T

∂W

∂p∆p− I

(W (u;p)

))2

=∑

u

(J∆p− δI)2, (4.2)

where we dened the residual δI and the Jacobian J as follows, using the fact thatthe warp at the origin, W (u;0) = u, is equal to the identity:

δI := I(W (u;p))− T (u), (4.3)

J := ∇T∂W

∂p. (4.4)

e approximation (4.2) is quadratic in ∆p; its minimizer satises the necessary op-timality condition ∂

∂∆p= 0, i.e.,

∑

u

(J∆p− δI)TJ = 0 (4.5)

⇔(∑

u

JTJ)∆p =

∑

u

JT δI (4.6)

⇔(∑

u

H)∆p =

∑

u

JT δI, (4.7)

where H := JTJ is the Gauss-Newton approximation to the Hessian of the photo-metric error corresponding to point u.

e inverse compositional formulation of the Lucas-Kanade method has the advan-tage that the quantities J andH only depend on the template image T and the warpW . us, once computed for the reference image, they can be used to align all fol-lowing images that are tracked with respect to the current reference frame.

So, for each iteration, the steps are as follows:

1. Calculate the residuals δI for all u using I(W (u;p))

2. Solve the linear equation Ax = b, withA =

∑u H , x = ∆p and b =

∑u JT δI

3. Update the warpW (u;p)←W (u;p) W (u;∆p)−1

20 4.2. 3D Rigid-Body Warp

To calculate the intensity values at subpixel coordinates W (u;p) in the image I weuse the bilinear interpolation. More details about bilinear interpolation can be foundin Appendix A.2.

4.2 3D Rigid-Body Warp

A 3D point kp in the camera frame k is projected into image coordinates through

u = π(kp) (4.8)

with the standard projection π dened as

π(p) = Kp

pz(4.9)

π−1(u, du) = du K−1(ux, uy, 1)T (4.10)

whereK is the intrinsic camera matrix [20], dened as

K =

fx 0 cx0 fy cy0 0 1

(4.11)

e camera position with respect to the previous frame (in our case, the template T )is described as the rigid-body transformation T k,k−1 ∈ SE(3)

T k,k−1(p) = Rp+ t. (4.12)

For the parameter update, we use the twist coordinates of the rigid-body transforma-tion, which is the Lie algebra se(3). It maps to SE(3) as [21]

ξ = (v, ω)T (4.13)

T (ξ) = exp(ξ) (4.14)

where the hat operator ˆ represents the liing operator from twist coordinates ontothe Lie algebra se(3).

e rigid-body warp is then given as

W (u, du; ξ) = π(T (ξ) · π−1(u, du)). (4.15)

is shows that for the error term in equation (4.1) we can only consider those u thatalso have an associated depth estimate du to compute an update to the transformation.

Writing in more detail (4.1), we seek to nd the transformation that satises

ξ∗ = argminξ

∑

u

(Ik(π(T k,k−1 · p

)− Ik−1

(π(T (ξ) · p

))2, (4.16)

where p = π−1(u, du) is the 3D point obtained as back-projection of u. Here, weused a dierent notationwith respect to (4.1), ξ ≡∆p; Ik−1 ≡ T ; Ik ≡ I; T k,k−1 ≡ p.


To compute the Jacobian J , we need the image gradient and the derivative of thewarping function. e 3D rigid-body warp is characterized by the interaction matrix

[22]. With x = (u, v) := K−1u, Z := du it is given as [23]

∂W

∂p= B =

(− 1

Z0 u

Zuv −(1 + u2) v

0 − 1Z

vZ

1 + v2 −uv −u

). (4.17)

It relates the twist coordinates ξ with the temporal derivative of the image point x by

x = Bξ (4.18)

Aer each iteration we update the current transformation in the warp W with theoptimal update step:

T k,k−1 ← T k,k−1 · (T (ξ∗))−1. (4.19)

4.3 Random Sparse Tracking

e Lucas-Kanade method has only recently become popular for visual odometry. Forstandard cameras, operating at 30 Hz, a GPU is needed to achieve real-time tracking[24].

e complexity for each iteration of the inverse compositional is O(nN + n3) [19],where n denotes the number of parameters of the warp andN is the number of pixelsin the image, e. g. the number of elements in u. As the warp parameters are xedto 6 for the 3D rigid-body warp, we can only try to decrease the number of relevantpixels for the tracking.

e following methods have been proposed recently to limit the number of activepixels in an image.

Dense For dense methods, all pixels of the image are being evaluated in each iter-ation. e rst real-time implementation for this method was [24].

Semi-dense Only points with a minimum absolute gradient are considered fortracking. is results in an edge tracking as implemented in LSD-SLAM [18].

Sparse Features are extracted in the reference frame and only pixels in the neigh-borhood of these feature dene the candidate pixels. An example for this is SVO [25].

Random sparse We take a semi-dense map as reference, but only select a randomset of points for each iteration.

is has the advantage of spreading the candidate points naturally across the image(as long as gradients are present) while reducing the complexity of each iterationdrastically. While sparse methods like SVO can achieve a performance of about 300frames per second, empirical evaluation has shown that a small number of 50–500

22 4.4. Map Extraction

samples can lead to reasonable tracking result and is able process up to 10,000 framesper second.

e following table summarizes the number N of pixels used for pose estimation inthe described methods. e values are based on a sensor similar to the resolution ofthe DVS camera (200× 200).

N

Dense 40000Semi-Dense 10000Sparse 2000Random sparse ∼300

Figure 4.1 gives a visual comparison between the four mentioned methods. e rel-evant pixels are colored in green. While the dense method uses the whole image, insemi-dense methods this is reduced to edges, i. e. pixels with a high gradient. Sparsemethods use patches of a xed size around good features using a feature extractionalgorithm. e random sparse method selects sample points from the set of pixels ofthe semi-dense method.

is shows that random sparse tracking is up to 100 times faster than the classicaldense method for the Lucas-Kanade algorithm.

ere is one drawback, however. In general, one can precompute the matrix A =∑u JTJ . Since we randomly select a dierent set of points u in each iteration, all

we can do is caching the matrices JTJ and then constructA from the selected points.

Another diculty arises when dening a proper convergence criterion. Usually, onesets a lower threshold for both the parameter update∆p and the residuals δI . How-ever, for a random set of points, we should not trust these values and thus just limitthe algorithm by a xed number of iterations.

4.4 Map Extraction

e input to the tracking algorithm is a dense intensity image. However, only pointswith a strong gradient spike a signicant amount of events to be considerable for poseestimation. To remove the parts of the image that are likely to give no informationfor tracking, we evaluate the gradient magnitude of the intensity image, inspired byLSD-SLAM [18].

Given a gradient map ∇I(u), we compute its magnitude image ||∇I(u)||. We thennormalize and threshold the absolute gradient to extract the relevant map points.us, we need to dene a threshold gradient gmin ∈ [0, 1]. A lower value resultsin a more dense map, whereas a high value makes the map more sparse.

Of course, the algorithm also works if only the absolute gradients are provided. eactual intensity image is not necessary. is fact is being used later in the fused visualodometry process. Each point that has a gradient higher than the threshold is thenconverted into a 3D point by taking the corresponding depth value into account.

In Fig. 4.2 we show the three steps to select the candidate map points in an im-age. Starting with the original intensity image, we compute the gradient magnitude


(a) Dense (b) Semi-dense

(c) Sparse (d) Random sparse

Figure 4.1: Comparison of the dierent image alignment methods. Green shows theactive image points used for evaluating the photometric error term in (4.1). Randomsparse tracking uses more than 100 times less pixels than the dense method.

||∇I(u)||. en, we threshold the resulting gradient to select the candidate points.Lastly, each candidate point is turned into a 3D point using the associated depth value.e resulting map is shown in Fig. 4.3.

(a) (b) (c)

Figure 4.2: Map extraction step for the tracking pipeline. (a) e original intensityimage I . (b) e gradient magnitude ||∇I||. (c) e thresholded gradients, whichshow the points to be used in the 3D map.

4.5 Reference Frame

We dene the reference frame, or in terms of the Lucas-Kanade framework the tem-plate image T (u), as a projection of the 3D map into the current camera pose.

For this process we use two images, the template T (u) and a projected depth mapdu. Both images are set to zero in the beginning. We then project each map pointinto the camera frame. e corresponding pixel in T is set to 1, whereas the pixel in

24 4.6. Event Frame

Figure 4.3: ree dierent views on the resulting map from the mask in Fig. 4.2c. Eachpoint in the mask is converted into a 3D point using the associated depth value. ecolor encodes the distance as seen from the camera frame. Red: close, blue: far.

the depthmap is set to the depth value of the projected point. is means that if twomap points project to the same location, we simply discard the previous depth value.In the future, this could be improved by handling occlusions correctly. In the end wehave a sparse depthmap and a binary projection of the map points in T .

To increase the convergence radius, and thus making it possible to apply the Lucas-Kanade method at all, we blur the reference frame using a Gaussian lter. Blurringthe depthmap may sound complicated, but can be implemented very easily since wehave the binary template T , which also serves as a mask. We just apply the Gaussianblur to both T (u) and du and then obtain a blurred depthmap by assigning

du ←du

T (u)(4.20)

We show the two steps in Fig. 4.4. e le column shows the template T (u), whereasthe right column shows the depthmap du. e top and boom row show the pro-jected map before and aer the Gaussian blur, respectively. e images show that, byapplying the Gaussian blur, the visible area of the template shrinks, whereas the thedepthmap grows. is increases the number of pixels with a valid depth value andthus leads to a larger convergence radius.

4.6 Event Frame

e event frame, or the image I(u), is generated by accumulating a set of events.We choose a xed event size criterion to determine the size of each frame. To nda reasonable number of events to make up a frame we dene the event-map ratio

rev,map, typically in the range between 0.2 and 0.4. If nmap is the number of non-zero pixels in the blurred reference image, the number of events per frame is denedas

nev = rev,map nmap (4.21)

Choosing a xed event size has some interesting consequences when comparing witha frame-based camera. Instead of xed frame rate, we now have an adaptive framerate, depending on the amount of intensity change going on in the camera frame. In astatic environment and considering the amount of gradient in the scene, as we do byusing the event-map ratio, this only depends on the camera motion. us, if we keep


(a) (b)

(c) (d)

Close Far

Figure 4.4: e two images (columns) and two steps (rows) for map projection. (a)ebinary image T before blurring. (b) e corresponding depthmap for T . (c) TemplateT aer blurring. (d) Blurred depthmap.

Figure 4.5: Two sample frames showing the binary representation of the event stream,which serve as the new image I .

26 4.6. Event Frame

the camera still, we generate no frames at all, but if we move the camera fast instead,many frames are produced.

From the given events we create a binary image. Starting with a zero-value image, weset each pixel at which an event occurs to 1. Two example images of the event frameare show in Fig. 4.5.

At this point it is important to compare the dierence of the blurred reference image,which makes up the template T (u) in the error term in (4.1) and the event frame ofthe new image I . We show these images side-by-side in Fig. 4.6 and it is easy to seethat they give a good input for an image alignment technique and thus motivate theuse of the Lucas-Kanade method for pose estimation on an event-based camera.

(a) Template T (b) Image I

Figure 4.6: Comparing the template T and the image I in the Lucas-Kanade frame-work.

Chapter 5

Depth Estimation

Mapping turns out to be the most challenging of all three parts. Many standard meth-ods use a reference image and track points on every subsequent image. e positionof the same point in two dierent camera frames gives a 3D measurement by trian-gulation. For event-based mapping, however, there are no images.

e work for this thesis uses a proof-of-concept approach by again dividing eventsinto frames, of which one is used as a reference “image”. For this reference image, weselect keypoints based on areas in which many events occurred and which we use toestimate the depth.

Referring to the integrated equation of continuity (2.7), we denote the k-th eventframe by δIk , of which δIr is the reference frame. rough the relation δI = −g · δuwe interpret our approach as gradient matching.

To reduce the complexity we select a set of keypoints in the reference frame. ese arethen matched with the gradients of all other frames by searching along the epipolarline using the Lucas-Kanade method in one dimension. Since we know the camerapose for each event and taking into account that the rotation of an event does notdepend on the depth, we “derotate” every event in the event frame.

To incorporate many depth measurements for each keypoint, we use the concept ofthe Bayesian depth lters from [26]. It is a mathematical model to describe a prob-abilistic depth camera. One of the advantages is that it provides an estimate of thedepth at each time. is estimate can be used as a starting point for the gradientmatching along the epipolar line.

5.1 Epipolar Line Search

Tomatch the gradients along the epipolar line, we use again the Lucas-Kanademethodwe introduced for tracking in Chapter 4.

As a reminder, the error term is given as

∑

u

(T(W (u;∆p)

)− I(W (u;p)

))2. (5.1)

27

28 5.2. Linear Triangulation

e warping function W is now reduced to a one dimensional line search given as

W (u; p) =

(uinit,x + p cos θuinit,y + p sin θ

), (5.2)

uinit := π(T k,r · π−1(u, du)), (5.3)

where θ is a xed parameter dening the epipolar line and uinit denes the startingpoint of the epipolar line search. It is obtained from the projection of the keypoint ufrom the reference frame r into the current frame k using the current depth estimatefrom the Bayesian depth lters.

e Jacobian of the warp is then simply

∂W

∂p=

(cos θsin θ

). (5.4)

is also denes the angle of the epipolar line in the current camera frame.

5.2 Linear Triangulation

Given a 3D point X seen from two dierent viewpoints, i. e. aer we matched thegradients along the epipolar line, we can recover the position by triangulating therays originating from the two camera frames. We solve this with the Direct LinearTriangulation method described in [20].

Let f1 and f2 be the image coordinates of the observed point given by f i = K−1ui.Further, let R and t denote the rotation and translation between the frames, respec-tively. en we can dene the two 4× 3 projection matrices

P 1 = [13 | 0] (5.5)

P 2 = [R | t] (5.6)

which are related to f i and x by

f i = P iX (5.7)

Let pi(j) denote the j-th row of the projection matrix P i. en we nd

f i × (P i X) = 0 (5.8)

⇔ fi,x(pi(3)X)− (pi(1)X) = 0 (5.9)

fi,y(pi(3)X)− (pi(2)X) = 0 (5.10)

fi,x(pi(2)X)− fi,y(pi(1)X) = 0 (5.11)

using equations (5.9) and (5.10) gives us a linear equation AX = 0 with A given as

A =

f1,xp1(3)− p1(1)f1,yp1(3)− p1(2)f2,xp2(3)− p2(1)f2,yp2(3)− p2(2)

(5.12)

Chapter 5. Depth Estimation 29

We then seek to nd theX that minimizes ||AX|| subject to ||X|| = 1. e problemcan be solved with the singular value decomposition [20]. With A = UDV T , thesolutionX = (wx,wy,wz, w)T is the last column ofV . e depth of the triangulatedpoint in the rst frame is thus given by

z =V3,4

V4,4

(5.13)

As the depth estimate improves with every newmeasurement, the starting pointuinit

for the epipolar line search gets closer to the actual minimum, thus resulting in lessiterations for the Lucas-Kanademethod. e precision is higher than in usual epipolarline matching methods, as we use bilinear interpolation (see Appendix A.1) to obtainsub-pixel accuracy.

5.3 Bayesian Depth Filter

We initialize each keypoint with the average depth of the scene. Any new measure-ment from a frame δIk is then used to update the estimate in a Bayesian lter frame-work as already applied in [26, 27, 25].

e approach divides the measurements into either good measurements or outliers.A good measurement is distributed around the mean value of a Gaussian distributionN , whereas an outlier is comes from a uniform distribution U between [dmin, dmax].is results in the following mathematical description:

p(dki |di, ρi) = ρiN (dki |di, τ2i ) + (1− ρi)U(d

ki |d

mini , dmax

i ) (5.14)

with ρi the inlier probability and τ2i the variance of a good measurement. e valueτ2i arises from assuming a xed error in the image plane.

We illustrate the process in Fig. 5.1. Depth measurements are represented as the in-verse of the actual depth value to be able to even represent points that lie at innity.e non-trivial recursive update formulas for the depth lter are given in the supple-mentary material of [26].

e convergence criterion is dened as a lower threshold on the variance of the Gaus-sian distribution. However, as there are still some outliers present, we run a “radiusoutlier removal” on the resulting point cloud: each point, that does not have a certainnumber of other points in its neighborhood is removed from the map.

5.4 Event Frame

To produce beer gradient images from the events, we remove the rotational partof the motion, as it is independent of the actual depth of the image. Denoting therotation of the event k with respect to δI by Rk , the delta image is thus dened as

δI(x) = C∑

k

±k δ[x− π(R−1k · π

−1(uk, 1))]. (5.15)

30 5.4. Event Frame

Tr,k

Ir

Ik

di

uiu′i

dki

dmini

dmaxi

Figure 5.1: For each new event frame we triangulate a new 3D point with depth dki .

We update the current estimate di with the new measurement in the Bayesian lterframework. (Image from [25])

e depth du is set to one, as it plays no role when transforming the point u with apure rotationR. us, the nature of the event stream allows us to completely removethe rotational part of the camera position and leaves us only with pure translationalmotion.

With the rotated event position, we increase the corresponding point in the image byone. We could use the contrast threshold C of the DVS camera instead, but as wenormalize each image in the end, this value has actually no eect. As the reprojectionusually results in a sub-pixel location for each event, we use the bilinear drawingmethod from Appendix A.2 to improve the precision of the derotated event frame.

To nalize the event frame, we apply the logarithm δI(u) ← log(δI(u) + 1) andnormalize the image. Figure 5.2 show a comparison between unrotated event frames

δI(x) = C∑

k

±k δ(x− uk) (5.16)

on the le and the derotated from (5.15) on the right.

Chapter 5. Depth Estimation 31

(a) Before rotation (b) Aer rotation

Figure 5.2: A set of sample event frame for the mapping pipeline. (a) Image of theraw events. (b) e event stream aer rotation compensation. Only the translationaldisplacement is le, which simplies the epipolar line matching.

32 5.5. Keypoints

5.5 Keypoints

For performance reasons, we do not run a depth lter on every image point. Instead,we downsample the image and select those points with a minimum weight, denedby the intensity of the aggregated events.

We dene a regular grid in the image plane. Each cell has a size of s × s pixels.For each cell, we compute the total weight and the mass centroid. Each cell with aweight higher than a given threshold is then turned into a keypoint dened by itsmass centroid. is process is very similar to the VoxelGrid lter as found in thePoint Cloud Library, but using a 2D instead of a 3D grid. An example of the keypointselection is given in Fig. 5.3.

(a)

(b)

(c)

Figure 5.3: Visualization of the keypoint selection. (a) e original absolute gradientimage. (b) A regular grid is placed on the image. Cells with minimum intensity valueget a keypoint assigned to their center of mass. (c) e resulting keypoints on theoriginal image.

Chapter 6

Image Reconstruction

Mosaicing is the process of reconstructing the original image seen by the camera onlyby using the event stream. Using the integrated equation of continuity (see (2.7))

δI = −g · δu (6.1)

we want to obtain a measurement of the image δI and the displacement δu in orderto compute an estimate for g.

As a reminder, the original event frame in the camera frame c is given simply by (2.8):

δIc(x) = C∑

k

±k δ(x− uk). (6.2)

However, now we want to compute δI in a xed reference frame r instead of themoving camera frame c. is means, we have to reproject every event from c to r andthen compute the displacement δur also in r.

To do the reprojection, for each event k we need the camera pose T k relative to thereference frame as well as the associated depth duk

in c. Using these these quantities,the delta image δI in the reference frame is given as

δIr(x) = C∑

k

±k δ[x− π(T−1k · π

−1(uk, duk))]. (6.3)

e displacement δur for a point u in the reference frame can then be computed as

δur = u1 − u0 (6.4)

ui = π(T i · π−1(u, du)), i ∈ 0, 1 (6.5)

where T 0 and T 1 are the camera poses at the beginning and the end of the frame,respectively. We initialize the covariance matrix with a high uncertainty value P0 as

P init =

(P0 00 P0

)(6.6)

33

34 6.1. Pixel-wise Extended Kalman Filter

6.1 Pixel-wise Extended Kalman Filter

Inspired by [11], we aempt to reconstruct the gradient g by using a pixel-wise Ex-tended Kalman lter. us, we associate every pixel u in the reference frame with anestimated gradient g and a covariance matrix P .

Each point in the image δI corresponds to a measurement z, whereas the measure-ment model h is given by g · δu.

us, we have

ν = z − h = δI − g · δu (6.7)

∂h

∂g= δuT (6.8)

is gives us the equations for Kalman gain W and the innovation covariance S as

S =∂h

∂gP

(∂h

∂g

)T

+R = δuTP δu+R, (6.9)

W = P

(∂h

∂g

)T

S−1 =P δu

S. (6.10)

e measurement noise R can be chosen freely and denes how quickly the lteradapts to the new measurements, i. e., how much we trust a new measurement. isthen leads to the following EKF update equations:

g ← g +W ν (6.11)

P ← P −WSW T (6.12)

6.2 Poisson Image Reconstruction

Once the gradients g in the reference image have been measured, we can reconstructthe absolute image intensity by solving the Poisson equation [28]. We try to nd the2D function I whose gradient is close to the measured g in a least-square sense. isis described by minimizing the integral

∫∫F (∇I, g) dx dy (6.13)

F := ||∇I − g||2 =

(∂I

∂x− gx

)2

+

(∂I

∂y− gy

)2

(6.14)

e optimal I must satisfy the Euler-Lagrange equation given by

∂F

∂I−

d

dx

∂F

∂Ix−

d

dy

∂F

∂Iy= 0 (6.15)

⇔ 2

(∂2I

∂x2−

∂gx∂x

)+ 2

(∂2I

∂y2−

∂gy∂y

)= 0 (6.16)

Chapter 6. Image Reconstruction 35

Using the Laplace operator ∆ and the divergence of the vector eld g

∆I =∂2I

∂x2+

∂2I

∂y2(6.17)

div g =∂gx∂x

+∂gy∂y

(6.18)

we arrive at the Poisson equation

∆I = div g (6.19)

e Poisson equation in 2D with Dirichlet boundary conditions

∂I = const. (6.20)

can be solved eciently using a sine transform based method, which has been imple-mented by Tino Kluge.1

We sketch the process of Poisson image reconstruction in Fig. 6.1. Figure 6.1a showsthe original intensity image. Figures 6.1b and 6.1c show the two components of thegradient, gx and gy, respectively. ese are the two quantities being estimated bythe EKF from Section 6.1. From there, we compute the divergence div g, shown inFig. 6.1d.

Using the divergence image, the solution to the Poisson equation is then a reconstruc-tion of the original intensity image. is is shown in Fig. 6.1e. Except from some blurit is almost indistinguishable from the original image. Fig. 6.1f shows the residual|I − I| between the reconstructed image and the original image. However, one cannotice the constant boundary intensity, which is a result of the Dirichlet boundarycondition, that we assumed earlier.

1http://kluge.in-chemnitz.de/opensource/poisson pde/

http://kluge.in-chemnitz.de/opensource/poisson_pde/

36 6.2. Poisson Image Reconstruction

(a) Image I

(b) Gradient component gx (c) Gradient component gy

(d) Right hand side of the Poisson

equation: div g

(e) Recovered intensity image I . (f) Residual |I − I|.

Figure 6.1: Steps of the Poisson image reconstruction. (a) e original intensity imageI . (b) and (c) Gradient ∇I = (gx, gy)

T of I in the x and y directions. (d) e diver-

gence of g. (e) e solution I to the Poisson equation ∆I = div g. (f) e absolutedierence between the reconstruction I and the original image I .

Chapter 6. Image Reconstruction 37

6.3 Depthmap Propagation

We assume the depthmap to be given in the reference frame. However, for the repro-jection of the events in equation (6.3), we also need the depth values du in the cameraframe c.

Unfortunately, it is not trivial to eciently use a depthmap in dierent camera frames.e most precise solution to this is by raytracing each pixel in the new camera frameand nding the closes 3D point from the given map. is is, however, rather complexto implement and compute, so we opt for a simpler approach.

Aer processing a xed number of events, we reproject the 3D points from the refer-ence frame and compute a new depthmap for the current camera frame. is has theadvantage that a every depth lookup only has O(1) complexity and does not requireany kind of nearest neighbor matching.

We also “densify” the depthmap, as it is only given in form of a sparse set of 3Dpoints. To do this, we apply a blurring technique, similar to the one in Section 4.5.e projection uses two images, the accumulated depth valuesD(u) and the weightsW (u). To blur the projection of each point, we take a Gaussian kernelK of size k×k,where k denes the blur. For each projected map point, we update the region aroundthat particular point with our Gaussian kernel. In D we add zK , whereas in W weonly add K . Having the accumulated depth measurements and the weights in twoseparate matrices, then gives the true values by computing

du =D(u)

W (u). (6.21)

e resulting depthmap is then a mixture between Gaussian blur and linear interpo-lation through the weights. In Fig. 6.2 we show the same projected depthmap withdierent blur values. As one can see, a larger blur results in more available depthvalues in the depthmap, but also loses some ne structure of the scene.

Close Far

Figure 6.2: Dierent values of the kernel size k to blur the map. e values from leto right are 3, 7, 15 and 25 pixels.

Chapter 7

Parallel Tracking, Depth

Estimation, and Image

Reconstruction

In order to do visual odometry with an event camera, all we have to do is to fuse thethree processes of tracking, depth estimation and image reconstruction. e outlineof the procedure is as follows:

1. Tracking: Create a 3D map from the a given intensity image and depthmap,then track the poses of a new set of events, until a distance dkf is reached.

2. Mapping: Given a set of events for mapping together with the camera poses,run

(a) Depth Estimation using the events and poses

(b) Image Reconstruction using the events, poses, and depthmap

3. Start again with Tracking

Here, dkf denotes the keyframe distance, where a keyframe consists of a depthmapand a reconstructed gradients. We visualize the process in Fig. 7.1.

Depth Estimation Image Reconstruction

Poses

Poses, Depth

Depth,Im

age

Tracking

Figure 7.1: Schema of the visual odometry loop.

38

Chapter 7. Parallel Tracking, Depth Estimation, and Image Reconstruction 39

e algorithm is initialized by taking a set of poses from the ground truth trajectory.Since the tracking part only depends on the gradient and not the actual intensity im-age, we can actually skip the Poisson integration during the whole process. However,we always reconstruct the intensity image to have an appealing visualization and toeasily judge the quality of the image reconstruction.

Another thing to note is that the depth estimation and image reconstruction processthe events backwards in time. Given the set of Nmap events, the reference frame isdened by the timestamp of the last events. en, all previous events are incorpo-rated incrementally. is is needed as it is desirable to compute a new depthmap andgradient image at the most recent point in time, which is clearly the last event fromthe tracking part.

Chapter 8

Experiments

In this chapter, we want to describe the experiments we ran to asses the quality ofthe three dierent parts as well as the full visual odometry pipeline. We use both syn-thetic and real datasets, wherever it is feasible. Synthetic datasets have perfect groundtruth in both depth and pose, which allows us to intensively test the performance ofeach component quantitatively. Real datasets have mostly only been evaluated qual-itatively due to the lack of good ground truth measurements.

8.1 Tracking

We test the tracking method on three synthetic datasets consisting of textured 3Denvironment with camera motion in all six degrees of freedom.

e datasets are calledWalls, Planes, andOffice. Figures 8.1, 8.2, and 8.3 showthe three scenarios, each with two dierent views of the 3D scene and the cameratrajectory, the reference image from the beginning of the trajectory, and the groundtruth depthmap.

In table 8.1 we list properties, such as the number of events, duration, and averagescene depth for each of the three datasets.

Dataset Events [M] Duration [s] Avg. Depth [m] Traj. Length [m]

Walls 1.8 2.0 6.8 5.9Planes 0.9 1.2 2.2 2.5Office 1.6 1.0 19 7.7

Table 8.1: Properties of the three synthetic sequences used to quantitatively assess thequality of the tracking method.

40

Chapter 8. Experiments 41

(a) Trajectory and 3D map

(b) Reference image (c) Depthmap

Close Far

Figure 8.1: Visualization of the Walls sequence

42 8.1. Tracking



Figure 8.2: Visualization of the Planes sequence




Close Far

Figure 8.3: Visualization of the Office sequence

44 8.1. Tracking

∆trel [%] ∆θ [deg]Dataset µ σ max RMSE µ σ max RMSE

Walls 1.3 0.5 2.5 1.4 0.7 0.2 1.0 0.7Planes 2.2 0.4 3.0 2.2 0.2 0.1 0.5 0.2Office 1.6 1.2 4.9 2.0 1.1 0.8 3.4 1.4

Walls (fast) 2.8 1.6 6.7 3.2 1.4 0.8 3.8 1.6

Table 8.2: Error evaluation of the tracking method based based on the three syntheticdatasets.

8.1.1 Precision

In Figs. 8.4, 8.5, and 8.6 we show tracked trajectories of Walls, Planes and Of-fice, respectively. We display the shape together with the ground truth in a 3Dvisualization on the le. e errors with respect to ground truth are given on theright for both translation and rotation. We use the relative translation to make theresult independent of the scale. us, the translational error is given as

∆trel :=||t− tgt||

d, (8.1)

where t is the estimated translation, tgt the ground truth, and d the average scenedepth.

For the rotation error, we use the angle between the measured orientation quaternionq and the ground truth quaternion qgt, which is given as

∆θ := cos−1(2 〈q, qgt〉2 − 1), (8.2)

where the inner product of two quaternions is given as

〈p, q〉 = p0q0 + p1q1 + p2q2 + p3q3. (8.3)


1.52.0

2.53.0

3.54.0

4.5

−2.0−1.5

−1.0−0.5

0.00.5

1.01.5

−0.5

0.0

0.5

1.0

1.5

2.0

2.5

trajectory

ground truth

origin

(a)

0

1

2

3

4

5

6

7

∆t rel[%

]

0.5 1.0 1.5

time [s]

0.0

0.5

1.0

1.5

2.0

2.5

3.0

3.5

4.0

∆θ[deg]

(b)

Figure 8.4: antitative trajectory evaluation of the Walls sequence. (a) e es-timated translation in 3D compared to the ground truth trajectory. (b) Plot of thetranslational error ∆trel and rotational error ∆θ.

46 8.1. Tracking

−0.20

−0.15

−0.10

−0.05

0.00

0.05

0.10

0.150.00

0.050.10

0.150.20

0.250.30

0.35

−0.15

−0.10

−0.05

0.00

0.05

0.10

0.15

trajectory

ground truth

origin

(a)

0

1

2

3

4

5

6

7

∆t rel[%

]

0.2 0.4 0.6 0.8 1.0 1.2

time [s]

0.0

0.5

1.0

1.5

2.0

2.5

3.0

3.5

4.0

∆θ[deg]

(b)

Figure 8.5: antitative trajectory evaluation of the Planes sequence. (a) 3D plotof the translation. (b) Error plot of relative translation and rotation.


0.00.5

1.01.5

2.0

−0.50.0

0.51.0

−0.5

0.0

0.5

1.0

1.5

trajectory

ground truth

origin

(a)

0

1

2

3

4

5

6

7

∆t rel[%

]

0.2 0.4 0.6 0.8

time [s]

0.0

0.5

1.0

1.5

2.0

2.5

3.0

3.5

4.0

∆θ[deg]

(b)

Figure 8.6: antitative trajectory evaluation of the Office sequence. (a) 3D plotof the translation. (b) Error plot of relative translation and rotation.

48 8.1. Tracking

Processingtime [ms]

Events persec. [M]

Poses persec.Dataset

Walls 477 3.3 633Planes 317 2.9 463Office 554 2.8 837

Walls (fast) 90 204 11427

Table 8.3: Performance evaluation of the tracking method.

8.1.2 Performance

e speed of our method has been evaluated on the synthetic datasets, the results arelisted in table 8.3. In general, we use 500 samples with 30 iterations and an event-mapratio of 0.4 for the random sparse tracking. With these seings we achieve a rate ofabout 500 poses or frames per second, which corresponds to 2–3 million events persecond. However, if we trade in some accuracy for speed by using only 50 samplesand 5 iterations, while decreasing the event-map ratio to 15 %, we can still maintainreasonable quality in tracking. However, the speed can go up to 200 million eventsper second corresponding to more than 10,000 poses or frames! is by far outper-forms any available visual tracking method. e trajectory for tracking the Wallssequence using the fast seing is shown in Fig. 8.7.

8.1.3 Real World Setting

To test the method with real data, we make use of the image frames provided by theDAVIS camera and show an augmented reality demo, where the logo of the RPG isplaced inside the camera frame. is allows qualitative evaluation of the trackingaccuray.

As no depthmap is available, we assume the world to be a at surface at some givendistance.

Figure 8.8 shows four consecutive camera frames captured from the DAVIS sensor at24 Hzwhile rotating the camera fast by about 90°. e le column show the logo of theRPG imposed on the picture of some leaves. e right column shows the correspond-ing event image (green) and overlayed with the current projection of the referencemap (blue). e position of the logo in the camera frame and the events with theprojected map demonstrate and that tracking is still possible even during high-speedmotion.

In the two middle frames there is a strong motion blur visible, which would likelybreak conventional visual localization methods, thus emphasizing the power of ourapproach.

For beer visualization of the process, we also provide a video1 [16] demonstratingthe method on real data.

1https://youtu.be/MUt6tE78634

https://youtu.be/MUt6tE78634


1

2

3

4

−2.0

−1.5

−1.0

−0.5

0.0

0.5

1.0

1.5

−1

0

1

2

3

trajectory

ground truth

origin

(a)

0

1

2

3

4

5

6

7

∆t rel[%

]

0.5 1.0 1.5

time [s]

0.0

0.5

1.0

1.5

2.0

2.5

3.0

3.5

4.0

∆θ[deg]

(b)

Figure 8.7: antitative trajectory evaluation of the Walls sequence using the fastseing. (a) 3D plot of the translation. (b) Error plot of relative translation and rotation.

50 8.1. Tracking

(a) (b)

Figure 8.8: Augmented reality demo on the real world Leaves dataset. e gureshows 4 consecutive frameswhile rotating the camera by about 90°. (a)e augmentedreality, where the RPG logo is placed at a xed position in the world frame. (b) eevent stream (green) overlayed on the current projection of the 3D map (blue).


8.2 Depth Estimation

We test the depth estimation on the Walls, Dunes, and Boxes datasets. ese arewell suited as they have a good amount of linear motion which is necessary for ourdepth estimation method to work well.

e Boxes dataset is captured by the real DAVIS sensor, which was put on a linearslider in front of two textured boxes. e boxes were placed at an angle of 90° to eachother. e scene is shown in Fig. 8.13. However, as the picture was taken at a laterseing, the boxes are not set up exactly at the 90° position.

e resulting depthmaps are shown in Figs. 8.9–8.11. For the rst two datasets, whereground truth is available, we compare the measured depth values with the correctones. As the last dataset does not have ground truth depth, we show a top-down viewon the resulting point cloud instead. Figure 8.11 clearly shows the 90° angle betweenthe two visible walls of the boxes.

For further verication of the quality, we input the computed depthmap to our imagereconstruction pipeline and compare the resulting intensity image with the actualcamera image. ese are shown in Fig. 8.12.

(a) Estimated Depth (b) Ground Truth

Close Far

0 1 2 3 4 5 6 7

rel. error [%]

Figure 8.9: Depth estimation results for the Walls sequence.

52 8.2. Depth Estimation

(a) Estimated Depth (b) Ground Truth

Close Far

0.0 0.5 1.0 1.5 2.0 2.5 3.0 3.5 4.0 4.5

rel. error [%]

Figure 8.10: Depth estimation results for the Dunes sequence.

(a) Estimated Depth (b) Top down view

Close Far

Figure 8.11: Depth estimation results for the Boxes sequence. Due to the lack ofground truth, we show a top down view on the resulting point cloud. e boxes wereplaced at an angle of 90° to each other.


(a) Image reconstructions (b) Reference images

Figure 8.12: Image reconstructions using the estimated depth for theWalls, Dunes,and Boxes sequences.

Figure 8.13: Picture of the scene for the Boxes sequence. Photo taken at a later time,so the boxes are not set up at 90°.

54 8.3. Reconstruction

8.3 Reconstruction

We apply the reconstruction to the synthetic Walls and Planes datasets. eprovided depthmap is dense, which slows down the map projection signicantly, butstill serves well for evaluating the quality of the process. A sequence of reconstructionis shown for both datasets in Figs. 8.14 and 8.15.

In the Planes sequence (Fig. 8.14), the reconstruction is almost indistinguishablefrom the ground truth in Fig. 8.2. e circular movement of the camera providesgood measurements in the x and y direction, which are needed to reconstruct the fullgradient g. Edges to the planes, however, appear blurred as our method currentlydoes not account for occlusions.

eWalls sequence in Fig. 8.15 demonstrates nicely how the trajectory of the camerainuences the quality of the reconstruction. e gray spot on the right is due to lileapparent motion in that area, thus not generating sucient measurements for theimage reconstruction. A demonstration of the reconstruction process can also be seenon YouTube2 [16].

Figure 8.14: Image reconstructions for the Planes sequence. Images go row-wisefrom top-le to boom-right, showing 2 000, 10 000, 100 000, and 400 000 events.

2https://youtu.be/0kqritbvVnA

https://youtu.be/0kqritbvVnA


Figure 8.15: Same for the Walls sequence, showing 2 000, 10 000, 400 000, and1 800 000 events.

56 8.4. Visual Odometry

8.4 Visual Odometry

We tested the integrated odometry pipeline on the Dunes and the Desk dataset,which have a simple linear trajectory and are thus well suited for our an evaluationof approach. e trajectories of both datasets are displayed in Figs. 8.16 and 8.17and show only lile deviation. To show the capabilities of our approach, we comparethe depthmap and the image reconstruction with the ground truth depthmap and theactual camera image. e results are shown in Figs. 8.18 and 8.19.

is is, to the best of our knowledge, the rst algorithm capable of producing a 3Dstructure and a reconstruction of the intensity image including translational motionat the same time. However, a major problem in the current implementation is the lackof building a consistent map between keyframes. is leads to a signicant dri overtime.

Tracking is done with respect to the current keyframe. So, whenever the keyframechanges, there’s some jump in the trajectory. us, for good mapping results, wecan only use the events until the last keyframe. To increase the number of events formapping, i. e. to increase the baseline, it is desirable to increase the keyframe distance.However, the tracking quality decreases as the keyframe distance gets bigger.


−2 −1 0 1 2−2−1012

4

5

6

7

8

trajectory

ground truth

origin

(a)

0.0

0.5

1.0

1.5

2.0

2.5

3.0

3.5

4.0

4.5

translation[%

]

0.0 0.2 0.4 0.6 0.8 1.0 1.2

time [s]

0.0

0.2

0.4

0.6

0.8

1.0

1.2

1.4

1.6

rotation[deg]

deviation from ground truth

(b)

Figure 8.16: Evaluation of the trajectory from the visual odometry pipeline on theDunes sequence. (a) e estimated translation in 3D compared to the ground truthtrajectory. (b) Plot of the translational error∆trel and rotational error ∆θ.


6 8 10 12 14 16−4−2

024

−4

−2

0

2

4

trajectory

ground truth

origin

(a)

0.0

0.5

1.0

1.5

2.0

2.5

3.0

translation[%

]

0.0 0.2 0.4 0.6 0.8 1.0 1.2 1.4 1.6

time [s]

0.0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

rotation[deg]

deviation from ground truth

(b)

Figure 8.17: Evaluation of the trajectory from the visual odometry pipeline on theDesk sequence. (a) 3D plot of the translation. (b) Error plot of relative translationand rotation.


(a) Estimated depth and image reconstruction (b) Ground truth depth and reference image

Close Far

Figure 8.18: Keyframes for visual odometry on Dunes


(a) Estimated depth and image reconstruction (b) Reference image

0.45 0.95 1.45 1.95 2.45

Depth [m]

Figure 8.19: Keyframes for visual odometry on Desk. No ground truth depthmap isavailable for this real dataset.

Chapter 9

Discussion

9.1 Tracking

We tested the tracking method on a varying set of both synthetic and real datasets.e method is able to localize in many seings with high precision and very highspeed. Using the random sparse tracking method, we can achieve a frame rate ofmore than 10,000 frames per second and about 6 million events per second. is is, sofar, unmatched in any existing visual odometry pipeline.

As a result, we have a very low latency of theoretically less than 1 ms in the trackingprocess. Tracking latency today is a severe problem in many localization applications,like mobile robot navigation, but also in the entertainment industry for projects likethe Oculus Ri virtual reality headset or the Project Tango from Google. Both rely onfast tracking to provide an appealing user experience. So far, a tracking frequency inpose estimation as high as 1,000 Hz can only be achieved by interpolating the posesbetween camera frames using an IMU. As of today, there exists no solution to achievethe same speed purely vision based.

However, the algorithm still does not perform well in scenes with very ne structure.Textured surfaces without strong edges pose problems to the algorithm. is is likelydue to the internal representation of the map. Treating gradients or edges as binaryinformation ignores some signicant part of the actual image. e binary thresholdon the input image works well if there are clearly distinguishable edges visible in thescene. If the texture is more smooth or contains very detailed structure, there will beeither toomany or no edges at all. It is thus dicult to nd a good threshold parameterfor the gradient extraction.

9.1.1 Event Frame

In our method we try to minimize the number of events per frame. Usually a framecontains about 1,000 to 5,000 events. As a result, there is usually no more than asingle event for each pixel in the new image. us, we have no intensity informationassociated directly with a pixel, which is why we treat the event frame as a binaryimage.

61

62 9.2. Depth Estimation

We suggest two possible improvements for the method:

1. Increase the sensor resolution: Having a higher resolution, makes even nestructure appear, thus resulting in beer visibility for the edges.

2. Include intensity information: For the map representation, this is an easytask, as the map is extracted by thresholding the gradient magnitude of the in-put image, we could add the actual gradient value to each map point. On theother hand, for the new event image this is not trivial. One possibility would beto increase the number of events per frame. Tomaintain a high frame rate, theseframes could be overlapping, resulting in a sliding window approach. Accumu-lating toomany events, however also results in moremotion blur and thus a lessaccurate position. Another possibility would be to use the time since the lastevent red in each pixel as the instantaneous event rate. is was has alreadybeen tested in [11] and could lead to some interesting results.

9.1.2 IMU integration

Using the measurements from the IMU to derive a prior for the pose could lead to anelegant integration into the tracking method. At the moment, all tracking works onthe gradient magnitude as well as the events. e reason for this is that the sign ofthe gradient depends on the actual motion of the point, and having no prior on themotion we are unable to predict the sign of the gradient.

Given a prior for the new position, we can predict the motion eld using the depthfor the whole map. Multiplying the motion eld with the map gradients then leads toa prediction in both sign and intensity of the gradient measured by the moving DVS.

9.2 Depth Estimation

Although the results of the depth estimation look very promising, it is the part of thevisual odometry pipeline that has the strongest restrictions on general usability. Atthe time of development, it was the only method to provide accurate depth measure-ments using a single event camera without additional sensing. For example, otherapproaches make use of the camera images provided by the DAVIS.

However, to give good results two conditions have to be satised:

1. Keypoint visibility: All relevant points must be visible in the rst frame. Askeypoint extraction takes place in the rst frame only, no depth measurementswill be available for parts that spiked no events in the beginning. is poses asevere constraint, as the quality of the reference frame greatly depends on themotion during the frame integration.

2. Sucient parallel motion: Gradients are obtained from the reference frameand compared with all subsequent ones. us, the method fails as soon as themotion becomes perpendicular to the motion at the beginning, as the gradientin this direction has not been measured. To obtain a good depth estimate, itis therefore necessary to have sucient motion in the same direction as the

Chapter 9. Discussion 63

motion of the reference frame. is is of course the case when the motion islinear.

To draw a conclusion, the depth estimation provides a good foundation on how wellwe can measure the depth with event cameras. However, due to its constraints on themotion, it is unlikely to be useful in a generic seing like a visual odometry pipelinein all 6 DoF for rigid body motion.

In general, an event-based depth estimation method should be independent of a ref-erence image built from the events, since there is no guarantee for an event camerato have all parts of the scene visible in a short amount of time.

9.3 Image Reconstruction

In contrast to the tracking and the depth estimation, the image reconstruction ismostly based on previous work [11], where a 2D spherical panorama is being recon-structed while tracking the rotation of the camera at the same time. As the motion isrestricted to rotation only, no depth information is needed to project the events fromthe camera frame into a global map frame. Here, we extended the technique by in-cluding a depth measurement which allows us to transform the location of any eventto a map frame even for all 6 DoF motions.

is shows that solving the SLAM problem also turns an event camera into a tradi-tional image sensor, since the camera pose and depth can be used to reconstruct rstthe gradients in the map frame and from there on the original image intensity thatgave rise to the observed events. is is very interesting, as event-based vision has amuch superior performance in terms of dynamic range compared to current standardcameras, thus allowing high dynamic range (HDR) SLAM. It also does not suer frommotion blur in the same way a frame-based camera does.

64 9.3. Image Reconstruction

Appendix A

Appendix

A.1 Sub-Pixel Bilinear Interpolation

Oen we want to extract an intensity value from an image at a sub-pixel location. Forexample, the Lucas-Kanade method uses sub-pixel interpolation to increase precision.

e fastest method, zero-order interpolation, to compute the intensity I value at anarbitrary location u = (ux, uy)

⊤ is nearest-neighbor interpolation, which is given as

I(u) = I(⌊ux⌉, ⌊uy⌉) (A.1)

where ⌊·⌉ denotes the operation of rounding to the nearest integer.

e rst-order interpolation method is also called bilinear interpolation. It uses thefour neighboring pixel values Ik ofu, where k ∈ 1, 2, 3, 4. If we denote the fractionδi := ui − ⌊ui⌉, i ∈ x, y as the weight between pixel values in the x and y axes,the bilinear interpolation is calculated as a convex combination of all four neighboringpixels:

I(u) = I1(1− δx)(1− δy) + I2δx(1− δy) + I3(1− δx)δy + I4δxδy. (A.2)

It is easy to verify that if δx or δy is either zero or one the interpolated intensity valueis just one of the corresponding pixel values.

A.2 Sub-Pixel Bilinear Drawing

When transforming events into a virtual image plane (due to reprojection or rotationcompensation), the resulting event position is generally not an integer coordinate. Toimprove the accuracy of the warping at the pixel level, instead of rounding the result-ing position to the nearest integer location, we use a technique which is in some sensethe inverse of bilinear interpolation. While bilinear interpolation is about extractinga value in an image at a sub-pixel location, we now want to draw a given intensityvalue at a sub-pixel location. e drawing aects the four neighboring pixel in a waythat the sum of all contributions equals the value that we want to draw.

65

66 A.2. Sub-Pixel Bilinear Drawing

Calling the four neighboring pixel values Ik and the fraction of the image coordinatesδi, we draw the value v on the image by adding to the current pixel values Ik thecontributions of spliing v according to the sub-pixel distances from the resultinglocation to the four neighbors:

I1 ← I1 + v (1− δx)(1− δy), (A.3)

I2 ← I2 + v δx(1− δy), (A.4)

I3 ← I3 + v (1− δx)δy, (A.5)

I4 ← I4 + v δxδy. (A.6)

It is easy to verify that the sum of the fractional-distance weights is one:

(1− δx)(1− δy) + δx(1− δy) + (1− δx)δy + δxδy = 1. (A.7)

Bibliography

[1] P. Lichtsteiner, C. Posch, and T. Delbruck. A 128×128 120 dB 15 µs latencyasynchronous temporal contrast vision sensor. IEEE J. of Solid-State Circuits,43(2):566–576, 2008.

[2] C. Brandli, R. Berner, M. Yang, S.-C. Liu, and T. Delbruck. A 240x180 130dB 3uslatency global shuer spatiotemporal vision sensor. IEEE J. of Solid-State Circuits,49(10):2333–2341, 2014.

[3] D. Nister, O. Naroditsky, and J. Bergen. Visual odometry. In IEEE Int. Conf.

Computer Vision and Paern Recognition (CVPR), volume 1, pages 652–659, June2004.

[4] M. Cook, L. Gugelmann, F. Jug, C. Krautz, and A. Steger. Interacting maps forfast visual interpretation. In Int. Joint Conf. on Neural Networks (IJCNN), pages770–776, 2011.

[5] D. Weikersdorfer and J. Conradt. Event-based particle ltering for robot self-localization. In IEEE Int. Conf. on Robotics and Biomimetics (ROBIO), 2012.

[6] D. Weikersdorfer, D. B. Adrian, D. Cremers, and J. Conradt. Event-based 3DSLAM with a depth-augmented dynamic vision sensor. In IEEE Int. Conf. on

Robotics and Automation (ICRA), June 2014.

[7] E. Mueggler, B. BHuber, and D. Scaramuzza. Event-based, 6-DOF pose trackingfor high-speedmaneuvers. In IEEE/RSJ Int. Conf. on Intelligent Robots and Systems

(IROS), 2014.

[8] E. Mueggler, G. Gallego, and D. Scaramuzza. Continuous-time trajectory esti-mation for event-based vision sensors. In Robotics: Science and Systems (RSS),2015.

[9] A. Censi and D. Scaramuzza. Low-latency event-based visual odometry. In IEEE

Int. Conf. on Robotics and Automation (ICRA), 2014.

[10] G. Gallego, C. Forster, E. Mueggler, and D. Scaramuzza. Event-based CameraPose Tracking using a Generative Event Model. Arxiv, 1510.01972, 2015.

[11] H. Kim, A. Handa, R. Benosman, S.-H. Ieng, and A. J. Davison. Simultaneousmosaicing and tracking with an event camera. In British Machine Vision Conf.

(BMVC), 2014.

[12] P Bardow, AJ Davison, and S Leutenegger. Simultaneous optical ow and inten-

67

68 Bibliography

sity estimation from an event camera. In IEEE Int. Conf. Computer Vision and

Paern Recognition (CVPR), 2016.

[13] S. Schraml, A. N. Belbachir, and H. Bischof. Event-driven stereo matching forreal-time 3d panoramic vision. In IEEE Int. Conf. Computer Vision and Paern

Recognition (CVPR), pages 466–474, June 2015.

[14] B. Kung. Visual odometry pipeline for the DAVIS camera. Master’s thesis, Uni-versity of Zurich, 2016.

[15] R.G. Valenti, I. Dryanovski, and J. Xiao. Keeping a Good Aitude: Aaternion-Based Orientation Filter for IMUs and MARGs. Sensors, 15(8):19302–30, 1 2015.

[16] T. Horstschafer. Supplementary video material for the thesishttps://www.youtube.com/channel/UCJYtMZQZlAzeT98vNrP1j-g.

[17] T. Delbruck, V. Villanueva, and L. Longinoi. Integration of dynamic visionsensor with inertial measurement unit for electronically stabilized event-basedvision. Int. Conf. on Circuits and Systems (ISCAS), pages 2636–2639, 2014.

[18] J. Engel, J. Schops, and D. Cremers. LSD-SLAM: Large-scale direct monocularSLAM. In Eur. Conf. on Computer Vision (ECCV), 2014.

[19] S. Baker and I. Mahews. Lucas-kanade 20 years on: A unifying framework. Int.J. Comput. Vis., 56(3):221–255, 2004.

[20] R. Hartley and A. Zisserman. Multiple View Geometry in Computer Vision. Cam-bridge University Press, 2003. Second Edition.

[21] Y. Ma, S. Soao, J. Kosecka, and S. Shankar Sastry. An Invitation to 3-D Vision:

From Images to Geometric Models. Springer, 2004.

[22] P. Corke. Robotics, Vision and Control: Fundamental Algorithms in MATLAB.Springer Tracts in Advanced Robotics. Springer, 2011.

[23] F. Chaumee and S. Hutchinson. Visual Servoing and Visual Tracking. In B. Si-ciliano and O. Khatib, editors, Springer Handbook of Robotics. Springer, 2008.

[24] R. A. Newcombe, S. J. Lovegrove, and A. J. Davison. DTAM: Dense tracking andmapping in real-time. In Int. Conf. on Computer Vision (ICCV), pages 2320–2327,November 2011.

[25] C. Forster, M. Pizzoli, andD. Scaramuzza. SVO: Fast semi-directmonocular visualodometry. In IEEE Int. Conf. on Robotics and Automation (ICRA), pages 15–22,2014.

[26] G. Vogiatzis and C. Hernandez. Video-based, real-time multi view stereo. Image

Vision Comput., 29(7):434–441, 2011.

[27] M. Pizzoli, C. Forster, and D. Scaramuzza. REMODE: Probabilistic, monoculardense reconstruction in real time. In IEEE Int. Conf. on Robotics and Automation

(ICRA), pages 2609–2616, 2014.

[28] R. Faal, D. Lischinski, and M. Werman. Gradient domain high dynamic rangecompression. ACM Transactions on Graphics, 21:249–256, 2002.

https://www.youtube.com/channel/UCJYtMZQZlAzeT98vNrP1j-g

https://www.youtube.com/channel/UCJYtMZQZlAzeT98vNrP1j-g

Acknowledgements

I like to acknowledge Dr. Guillermo Gallego, Henri Rebecq and Elias Mueggler for thefruitful discussion about math and methods about implementing a visual odometrypipeline for an event camera. Without their help the thesis would not have reachedthe its nal state.

ParallelTracking,Depth Estimation,andImage ......ages have a dynamic range of 51 dB, the event...

Documents

Transcript of ParallelTracking,Depth Estimation,andImage ......ages have a dynamic range of 51 dB, the event...