Automatic Volume Estimation using Structure-from-Motion ...1172784/FULLTEXT01.pdf · Sammanfattning...

59
Master of Science Thesis in Electrical Engineering Department of Electrical Engineering, Linköping University, 2016 Automatic Volume Estimation using Structure-from-Motion fused with a Cellphone’s Inertial Sensors Marcus Fallqvist

Transcript of Automatic Volume Estimation using Structure-from-Motion ...1172784/FULLTEXT01.pdf · Sammanfattning...

Master of Science Thesis in Electrical EngineeringDepartment of Electrical Engineering, Linköping University, 2016

Automatic VolumeEstimation usingStructure-from-Motionfused with a Cellphone’sInertial Sensors

Marcus Fallqvist

Master of Science Thesis in Electrical Engineering

Automatic Volume Estimation using Structure-from-Motion fused with aCellphone’s Inertial Sensors

Marcus Fallqvist

LiTH-ISY-EX--17/5107--SE

Supervisor: Tommaso Piccini, [email protected], Linköpings universitet

Hannes Ovrén, [email protected], Linköpings universitet

Christer Andersson, [email protected] Engineering AB

Examiner: Klas Nordberg, [email protected], Linköpings universitet

Department of Electrical EngineeringLinköping University

SE-581 83 Linköping, Sweden

Copyright © 2016 Marcus Fallqvist

Sammanfattning

I rapporten framgår hur volymen av storskaliga objekt, nämligen grus-och sten-högar, kan bestämmas i utomhusmiljö med hjälp av en mobiltelefons kamerasamt interna sensorer som gyroskop och accelerometer. Projektet är beställt avEscenda Engineering med motivering att ersätta mer komplexa och resurskrä-vande system med ett enkelt handhållet instrument. Implementationen använ-der bland annat de vanligt förekommande datorseendemetoderna Kanade-Lucas-Tommasi-punktspårning, Struktur-från-rörelse och 3D-karvning tillsammans medenklare sensorfusion. I rapporten framgår att volymestimering är möjligt mennoggrannheten begränsas av sensorkvalitet och en bias.

iii

Abstract

The thesis work evaluates a method to estimate the volume of stone and gravelpiles using only a cellphone to collect video and sensor data from the gyroscopesand accelerometers. The project is commissioned by Escenda Engineering withthe motivation to replace more complex and resource demanding systems with acheaper and easy to use handheld device. The implementation features popularcomputer vision methods such as KLT-tracking, Structure-from-Motion, SpaceCarving together with some Sensor Fusion. The results imply that it is possible toestimate volumes up to a certain accuracy which is limited by the sensor qualityand with a bias.

v

Acknowledgments

I would like to thank my examiner Klas Nordberg and my two supervisors at ISY,Tommaso Piccini and Hannes Ovrén. You all helped me getting past several prob-lems and always gave me constructive feedback. I would also like to acknowledgeChrister Andersson at Escenda.

Linköping, April 2017Marcus Fallqvist

vii

Contents

Notation 1

1 Introduction 31.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31.2 Purpose . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31.3 Problem statements . . . . . . . . . . . . . . . . . . . . . . . . . . . 41.4 Delimitations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4

2 Theory 72.1 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7

2.1.1 Recognition and volume estimation of food intake using amobile device . . . . . . . . . . . . . . . . . . . . . . . . . . 9

2.1.2 Estimating volume and mass of citrus fruits by image pro-cessing technique . . . . . . . . . . . . . . . . . . . . . . . . 9

2.1.3 Shape from focus . . . . . . . . . . . . . . . . . . . . . . . . 92.2 Sensor Data and Sensor Fusion . . . . . . . . . . . . . . . . . . . . . 9

2.2.1 Gyroscope . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92.2.2 Accelerometer . . . . . . . . . . . . . . . . . . . . . . . . . . 102.2.3 Bias Estimation . . . . . . . . . . . . . . . . . . . . . . . . . 11

2.3 Feature Point Tracker . . . . . . . . . . . . . . . . . . . . . . . . . . 112.4 Pinhole Camera Model . . . . . . . . . . . . . . . . . . . . . . . . . 12

2.4.1 Intrinsic Calibration . . . . . . . . . . . . . . . . . . . . . . 122.4.2 Epipolar Geometry . . . . . . . . . . . . . . . . . . . . . . . 13

2.5 Minimisation Methods . . . . . . . . . . . . . . . . . . . . . . . . . 142.5.1 RANSAC . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 142.5.2 Improving RANSAC . . . . . . . . . . . . . . . . . . . . . . 14

2.6 Structure-from-Motion . . . . . . . . . . . . . . . . . . . . . . . . . 152.6.1 Two-view reconstruction . . . . . . . . . . . . . . . . . . . . 152.6.2 Perspective-n-point - adding a view . . . . . . . . . . . . . . 162.6.3 Bundle Adjustment . . . . . . . . . . . . . . . . . . . . . . . 16

2.7 Volumetric Representation - Space Carving . . . . . . . . . . . . . 172.7.1 Object Silhouette and Projection . . . . . . . . . . . . . . . 17

2.8 Fusion of Camera and IMU Sensors & SfM Improvements . . . . . 18

ix

x Contents

2.8.1 Estimating IMU to Camera Transformation . . . . . . . . . 182.8.2 Global Positioning System Fused With SfM . . . . . . . . . 182.8.3 Scaling Factor Between CCS and WCS . . . . . . . . . . . . 182.8.4 SfM Using a Hierarchical Cluster Tree . . . . . . . . . . . . 20

3 Method 213.1 Sensors and Data Acquisition . . . . . . . . . . . . . . . . . . . . . 233.2 Tracking in Video Sequence . . . . . . . . . . . . . . . . . . . . . . 233.3 OpenMVG SfM module . . . . . . . . . . . . . . . . . . . . . . . . . 243.4 Volumetric Representation - Space Carving . . . . . . . . . . . . . 24

3.4.1 Object Silhouette and Projection . . . . . . . . . . . . . . . 253.4.2 Removing Voxels Below the Groundplane . . . . . . . . . . 25

3.5 Determining the Real Volume . . . . . . . . . . . . . . . . . . . . . 25

4 Evaluation and Results 274.1 Data Sets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 274.2 Sensor Fusion - Scale Estimation . . . . . . . . . . . . . . . . . . . . 28

4.2.1 IMU Readings and Integration . . . . . . . . . . . . . . . . . 284.2.2 Result - Scale Factor . . . . . . . . . . . . . . . . . . . . . . . 28

4.3 Computer Vision - Volume Estimation . . . . . . . . . . . . . . . . 304.3.1 Tracking module . . . . . . . . . . . . . . . . . . . . . . . . 304.3.2 Structure-from-Motion . . . . . . . . . . . . . . . . . . . . . 314.3.3 Space Carving . . . . . . . . . . . . . . . . . . . . . . . . . . 324.3.4 Result - Volume Estimation . . . . . . . . . . . . . . . . . . 33

5 Discussion 415.1 Result . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41

5.1.1 Scale Estimation . . . . . . . . . . . . . . . . . . . . . . . . . 425.1.2 Space Carving . . . . . . . . . . . . . . . . . . . . . . . . . . 425.1.3 Volume Estimation . . . . . . . . . . . . . . . . . . . . . . . 42

5.2 Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 425.3 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43

5.3.1 Rolling shutter calibration . . . . . . . . . . . . . . . . . . . 435.4 Wider Implications of the Project Work . . . . . . . . . . . . . . . . 44

6 Conclusions 456.1 Volume Estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . 456.2 Method Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . 456.3 System Performance . . . . . . . . . . . . . . . . . . . . . . . . . . . 46

Bibliography 47

Notation

Symbols

Notation Meaning

RN The set of real numbers in N dimensions

M A matrix~y A vector in 3Dy C-normalised homogeneous 2D point

Abbreviations

Abbreviation Meaning

cs Coordinate systemsfm Structure-from-Motionimu Inertial measurement unit

1

1Introduction

1.1 Motivation

Volume estimation of large scale objects such as stone and gravel piles at e.g.building sites and quarries is today done by complex and time-consuming meth-ods. Not to mention the man hours and the expensive costs of the equipmentsused, such as laser camera systems and drones. In this thesis another approachis suggested, where the idea is to have an easy to access and relatively cheapinstrument to determine volume of such objects as described above. The replace-ment equipment is a single modern handheld device, a tablet or a smart phone.The user only needs to gather data of the object of interest, without any priorknowledge to the technique and algorithms running in the background, whilststill being presented with the volume estimation in the end.

1.2 Purpose

The objective of this thesis is to develop a system which can accomplish the vol-ume estimation automatically by using a modern handheld device and then eval-uate at what precision the volume estimation is made. The volume estimationis made using 3D reconstruction and inertial measurement unit (IMU) data, ex-plained in section 2.2. These result are then compared with ground truth (GT)-data. To calculate the volume of stone and gravel piles, 3D representations mustbe generated. In this thesis, Structure-from-Motion (SfM) representations of thescenes are generated, explained in section 2.6. The unknown scaling factor inthese are determined with sensor fusion. This is done simultaneously with thevideo acquisition, i.e. the IMU recordings are logged whilst shooting the video ofthe object in the scene. The general workflow of the ideal system can be seen infigure 1.1.

3

4 1 Introduction

1.3 Problem statements

The thesis answer the following questions:

• How can the volume of stone and gravel piles be estimated in an outdoorenvironment using a mobile phone and a computer?

• How can the method be evaluated?

• Is it possible to make the system usable in practice?

The answers to the formulated questions are presented in chapter 5, based onthe results in chapter 4. The system is implemented in Windows with the meth-ods described in chapter 3 and uses data recorded with a smart phone (SamsungS6). The results are compared to the true volume measured on site. The clientEscenda Engineering AB determines that the estimations are viable for practicaluse if the error is < 10% of the real volume.

1.4 Delimitations

The system needs only to handle large scale objects which are present in staticoutdoor scenes. The used tracking method require that the object of interest hasa lot of texture, which the stone and gravel piles have. Instead of streaming datato a server or workstation, the data is passed manually and all computations aredone offline and currently not communicated back to the smart phone due to thetime frame of the thesis work. Only one smart phone will be used to gather data.

1.4 Delimitations 5

Figure 1.1: A simple flowchart of the problem

2Theory

The goal to estimate volume of stone and gravel piles using smartphone data,such as the video stream, can be achieved with different computer vision meth-ods. In this chapter some of the most common 3D reconstruction and volumeestimation methods are presented. In these methods, a 3D point cloud represen-tation is generated by using the video stream data.

With a 3D point cloud the volume can be calculated but only in the units oflocal camera coordinate system. To find the relation to the real world coordinatesystem (WCS) which has units in meter, the data needs to be scaled correctly. Theidea explored in this thesis work, is to use some of the smartphone’s inertial mea-surement unit (IMU) sensors as additional input. In particular the accelerometerand gyroscope data are used, in the following referred to as the IMU data. TheIMU data is collected in the sensor body frame, yet another CS, from now on re-ferred to as ICS. To correctly have this data represent the real world movement itis transformed to the real world CS (WCS), first with an initial rotation estimatedfrom the first samples, this transformation is denoted RI,Wt=0 . This initial transfor-mation to WCS is estimated using the gravity impact on the accelerometer y-axiswhich is pointing up, i.e. along the negative direction of gravity, as shown in fig-ure 2.1, where the WCS is defined from the first image taken. As the IMU dataand video are recorded and while the sensor unit moves around the pile, a timedependent back to the original coordinate system is estimated. The other tworeference frames, ICS and camera coordinate system (CCS) are defined as shownin figure 2.2.

2.1 Related Work

Similar studies have shown that it is possible to estimate volumes, for predeter-mined types of objects. In particular smaller objects.

7

8 2 Theory

Figure 2.1: How the WCS is defined in the thesis

Figure 2.2: The CS in the smartphone, to the left ICS and to the right CCS

2.2 Sensor Data and Sensor Fusion 9

2.1.1 Recognition and volume estimation of food intake using amobile device

It has been shown that volume estimation of food plates can be made[19]. Herethe user needs to identify the type of food and a reference pattern is used inorder to compute the unknown scale factor present in the 3D reconstruction. Adense point cloud together with Delaunay triangulation enables volume to becalculated for each voxel.

2.1.2 Estimating volume and mass of citrus fruits by imageprocessing technique

A study has been made in an attempt to facilitate the packaging process of cit-rus fruits. The goal was to remove the need of manual weighting of all gath-ered fruits before packaging. The method uses image processing methods to es-timate the volume of each fruit. This is done by estimating the cross-sectionalareas and the volume of each elliptical slice segment. These segments are gener-ated from two cameras, mounted perpendicular to each other, capturing imagessimultaneously[16]. Using only two cameras with exact mounting is not withinthe scope of this thesis and therefore this method cannot be used.

2.1.3 Shape from focus

Another approach is to utilise "shape from focus" in order to determine the dis-tance to the target object. This method is based on using a microscope with acamera taking still images. Such a setup has a shallow depth of field and relies onthat the internal camera parameters change when the focus is shifted around onthe object and to the surrounding scene [14]. This approach is however deemedto not be stable enough since the cellphone camera is designed to work the oppo-site way as a microscope: it has a wide angle lens, small sensors and no movingparts.

2.2 Sensor Data and Sensor Fusion

The obtained IMU data goes through some processing in order to estimate rota-tions of the device, these steps are described below.

2.2.1 Gyroscope

The gyroscope measures the rotation ~ω(t) along three axes of the device in rad/sas a function of time. The rotation measured are the yaw, pitch and roll shownin figure 2.2 in ICS. This response has a bias ~bgyro, which is a constant error fora particular recording. But the bias changes slowly over the lifespan of a sensor,its variation is increased with sensor usage and particularly with time[17]. With

10 2 Theory

this in consideration the response that compensates for the bias along each axisis:

~ωadjx (t) = ~ωx(t) − ~bgyrox ,

~ωadjy (t) = ~ωy(t) − ~bgyroy ,

~ωadjz (t) = ~ωz(t) − ~bgyroz ,

(2.1)

where ~bgyro is the bias.To find the time dependent rotation from ICS to the WCS the response ~ωadj (t)

formed from the components above is used. For a particular time interval be-tween a sample gathered at t and the previous sample at t − 1, quaternions areformed for each sample by means of integration of ~ωadj (t). This is done itera-tively for each sample, basically the quaternions are formed as described in thispaper, p.69[23]:

~q(t) = ~q(t − 1) + ∆t/2

−q1(t − 1) −q2(t − 1) −q3(t − 1)q0(t − 1) q2(t − 1) −q3(t − 1)q0(t − 1) −q1(t − 1) q3(t − 1)q0(t − 1) q1(t − 1) −q2(t − 1)

ωadj (t), (2.2)

where ∆t is the time between the samples and the initial quaternion is q(0) =[1 0 0 0

]T. The quaternions are then normalized such that ||q(t)|| = 1 to prop-

erly represent an normalised axis and rotation angle. Then the quaternions areconverted to rotation matrices in order to be applied to vectors in R

3. Such arotation matrix RI (t) rotates the ICS for time instance t back to the initial ICS, att = 0 and is constructed as[11]:

RI (t) =

q2

0 + q21 − q

22 − q

23 2 · q1 · q2 − 2 · q0 · q3 2 · q1 · q3 + 2 · q0 · q2

2 · q1 · q2 + 2 · q0 · q3 q20 − q

21 + q2

2 − q23 2 · q2 · q3 − 2 · q0 · q1

2 · q1 · q3 − 2 · q0 · q2 2 · q2 · q3 + 2 · q0 · q1 q20 − q

21 − q

22 + q2

3

, (2.3)

where ~q(t) = [q0q1q2q3].

2.2.2 Accelerometer

Similarly, the accelerometer data is affected by a bias which is introduced at thestart of each recording:

~aadj (t) = ~a(t) − ~bacc, (2.4)

where ~a(t) is the actual reading given in m/s2 from the sensor and ~bacc is the biasof unknown magnitude and direction for the accelerometer.

To express these accelerometer readings in WCS the dependency betweenRI,Wt=0 , RI (t) and the gravitation is as follows:

~aworld(t) = RI,Wt=0 RI (t)(~a(t) − ~bacc) − ~g, (2.5)

2.3 Feature Point Tracker 11

where ~bacc is the three element bias vector and ~g is the gravitation. These trans-formed readings are then integrated twice and used to find the displacement andultimately the unknown CCS scale which is explained more in detail in section3.5. The accelerometer readings are also used to find the initial IMU to worldrotation RI,Wt=0 . This is done by using the first samples of accelerometer data read-ings which are logged when the user is standing still, with the camera towardsthe object of interest. Here the gravity ~g will induce readings on all axes of theaccelerometer, and the mean of these readings ~amean is used to find a rotation thatsatisfies:

~g = RI,Wt=0 ~amean, (2.6)

where ~g is defined in WCS as [0, 9.82, 0].There are more than one possible rotations that satisfy the equation (2.6). This

rotation ambiguity is not considered to be a problem since the volume estimationis not dependent on the choice of this rotation. So any rotation that satisfy equa-tion (2.6) is used. This assumption is only valid when applying RI,Wt=0 to ~amean butapplying the same rotation to any other vector will generate different results.

2.2.3 Bias Estimation

There exist different methods to compensate for the sensor biases. The most com-mon approach is to estimate the bias of the sensors over some seconds of time.This can only be done when the sensor is completely stationary.

Another method to estimate gyroscope and accelerometer bias based on GPShas proven to be successful when used in a running car. In the scenes used inthis thesis the distance to the object of interest is in the scale of 10m. Since thesescenes also are outdoors this method might be applicable, given that the GPSmodule would be of higher quality than the one present in the smartphone orthat an external sensor is used[15].

2.3 Feature Point Tracker

The goal of the tracker is to produce a set of image feature points yk for a subsetof images k = 1, ..., l. The set of points are then used as input to a 3D recon-struction algorithm. The target family of objects, stone and gravel-piles, have ahigh repeatability. This means, for any given object at any time the texture andconsequently the video images will be similar. With this in mind a local trackeroperating at small pixel areas with a high frame-rate of 60 instead of the normal30 frames per second (fps) is sufficient. Such a high fps means that there is asmaller displacement in each image. The tracker uses a track-re-track process foreach pair of images by applying tracking in both directions. If a point is not cor-rectly tracked both from the first image to the second and vice versa it is removedin order to ensure robust correct tracking, explained more in detail in section3.2. The tracking method used is based on a corner detection method goodFea-

12 2 Theory

turesToTrack [21]. The tracker also stores a unique ID for each point track whichis used in a later step.

2.4 Pinhole Camera Model

The most common model in computer vision is the pinhole camera which is alsoused in this thesis. It models the camera aperture as a point and with no lensesthat focus the light. This simplified model has a CCS that does not use a discretemapping of 3D to 2D point as the smartphone camera, i.e. where a world pointis mapped to a discrete numerated pixel. The pinhole camera model does notmodel any lens distorsions either.

In its whole the pinhole camera model is described as:

C ∼ K(R | ~t), (2.7)

where C is the camera matrix, K is an upper triangular 3x3 matrix describing theintrinsic parameters, R a 3x3 rotation matrix and ~t a 3x1 translation vector. The∼means that the assumption is valid but without a known scale.

The projection of 3D points to 2D image points in homogeneous coordinatesis then:

yk ∼ Cnorm~xk = (R|~t)(~xk1

)= R~xk+~t, k = 1, ..., n, (2.8)

where yk =(yk1

)and Cnorm is the C-normalized camera, defined as:

Cnorm ∼ (R|~t), (2.9)

where Cnorm is the camera pose with the dependency of the internal K parametersremoved, (R|~t) represent a rigid transformation in C-normalized CCS betweenthe previous camera pose to the current one[11]. Essentially this describes thecameras position and orientation in the local C-normalized CCS where the 3Dreconstruction will be made. This means that the reconstruction is defined inEuclidean space where the camera poses can be related to each other.

Finally the projection operator P(Ck, ~xj ) defined below in equation (2.10):

P(C, ~xj ) =1wj

(ujvj

), where

ujvjwj

= C~xj , (2.10)

2.4.1 Intrinsic Calibration

Each smartphone camera has unique properties which must be modelled andestimated in order to produce robust results. The intrinsic parameter matrix Kcan be estimated offline by the method explained in [11]. An implementation ofthis estimation can be found in Matlab toolbox Camera Calibration Toolbox[1].

2.4 Pinhole Camera Model 13

It uses a set of images of a chessboard of known size, captured with a fixed focallength of the cellphone, i.e. fixed focus on the camera in order to form constraintson K. In the beginning of each scene recording the camera focus is set and fixedto simulate the same conditions as when the camera was calibrated. This is doneusing Google’s camera API in the smartphone by disabling the autofocus andfocusing roughly on the chessboard.

2.4.2 Epipolar Geometry

The geometry regarding relations between two camera views depicting the samestatic scene can be described using epipolar geometry. The goal is to relate howa 3D point is projected into two camera views. Let’s consider the most simplesolution to the problem where image points, y1 in view 1 with camera matrix C1,and y2 depicted in C2 are known. These points are corresponding points if theyboth are projections of the same 3D point x.

The projection line Ly1then contains all possible projections of x into C1. The

epipolar point e12 is defined as the projection of second camera center into C1.The epipolar line l is then the line between this epipole and the projection of L1,more in detail described in chapter 9 of [11]. Two image points, y1 and y2 needto fulfill what is called the epipolar constraint to be corresponding points:

0 = y1 · l1 = yT1 l1 = yT1 Fy2, (2.11)

where F is called the fundamental matrix which describes the epipolar constraint.In the uncalibrated case, i.e. the two camera matrices are not known, F is es-timated from y0 and y1, a set of corresponding image points that have beentracked from the first to the second image[22]. A simple error cost function thatminimises the geometric error over the points and epipolar lines can be definedas[11]:

ε =n∑j=1

dpl(y0j , l0j )2 + dpl(y1j , l1j )

2, (2.12)

where n denotes the number of corresponding points in the views, dpl the pointto epipolar line distance, y0j and y1j are points in the first and second view andl0j and l1j are defined below:

l0 ∼ Fy1 l1 ∼ FT y0 (2.13)

i.e. epipolar lines. The minimisation is done by a non-linear optimisation method-ology and is consequently minimising the point to epipolar line distance[9]. Thealgorithm, its implementation and solution details can be found in OpenMVG’sdocumentation[18].

One problem remains when F is found, there exist an infinitely set of cam-eras which are consistent with F. To determine the relative camera positions theproblem must be solved in accordance to calibrated cameras and this is wherethe essential matrix E is introduced, further explained in section 2.6.1.

14 2 Theory

2.5 Minimisation Methods

A common tool used in computer vision for non-linear optimisation is the socalled Random Sample Consensus (RANSAC) method[5].

2.5.1 RANSAC

RANSAC is effectively used in estimation problems where outliers occur, whichis the case for example when corresponding points needs to be found. A RANSACscheme is run for a finite amount of time in order to find a subset, which with thehighest possible probability only contains inliers, and then a model is estimatedfrom this subset. This subset is denoted consensus set C, with a correspondingestimated model M. The RANSAC scheme, is an iterative process and it is runfor predetermined number of iterations or until C is sufficiently large. A generalRANSAC scheme is roughly described below:

Starting from the full data set, a smaller subset T is randomly chosen in eachiteration. From T , M is estimated and even if this subset would contain onlyinliers, M would only fit most, but not all, of the remaining inliers in the data set,since there is also an inaccuracy in the inliers.

For each RANSAC iteration, Cest is formed, consisting of all points in T thatfit the estimated model M. To determine if a point fits the model a cost errorfunction ε is defined. For example, if the estimated model is a line, ε is thegeometric distance between a point y in the full data set to the line currentlyestimated from T . If ε for this y is below some threshold t the point would beconsidered as a inlier and consequently added to C for this iteration of T [9].

Initially C is empty, but it is replaced whenever the current consensus set Cest ,is larger than C. The estimated M is then used until a larger Cest is found or untilthe scheme is finished. As an optional but common step a last RANSAC itera-tion can then be run on Cest to improve M even further, excluding any potentialoutlier.

2.5.2 Improving RANSAC

The OpenMVG system uses a unique approach in their minimisation methodsolving, called a contrario RANSAC (AC-RANSAC). The main idea of this ap-proach is to find a model which best fits the data, adapted automatically to anynoise. AC-RANSAC looks for a C that includes a controlled Number of FalseAlarms (NFA), a model that is generated by chance, further explained in theirarticle[13]. NFA is a function of M and the hypothesized number of inliers in thedata set. Using NFA, a model M is considered valid if:

NFA(M) = mink=Nsample+1...n

NFA(M, k) ≤ ε. (2.14)

The only parameter for AC-RANSAC is consequently ε, which is usually set to 1.The AC model estimation finds the arg minMNFA(M) among all models M, fromall possible Nsample correspondences present in the current T . Using RANSAC,

2.6 Structure-from-Motion 15

NFA is minimised, instead of maximising the inlier count, and this is the task ofthe AC-RANSAC. A model is eventually returned if it provides at least 2 ∗Nsampleinliers.

The AC-RANSAC replaces ordinary RANSAC by its shear improvements. Itremoves the need of setting globally-fixed threshold for each parameter, whichmust be done when running the RANSAC algorithm. Instead the thresholds areadapted to the input data, automatically by the nature of AC-RANSAC. For ex-ample, it is used to estimate E with a adaptive threshold t for the inliers. Howthis specific AC-RANSAC scheme is used, is explained below in section 2.6 andin the method chapter 3.

2.6 Structure-from-Motion

In 3D reconstruction made by Structure-from-Motion (SfM) a representation ofa scene is determined up to an unknown scaling factor. The final results fromSfM are camera poses, denoted C, and 3D points, denoted ~x, which representthe structure found in the scene in the video. The coordinates of C and ~x is C-normalized local CCS, i.e. estimated in normalised camera coordinate systemwith the dependency of K removed. In local CCS the units are defined in someunknown unit and to estimate the real volume ~x must be related to the WCS sothe coordinates represent real units, e.g. metres.

The camera pose matrices relate to each other with three matrices, explainedin section 2.4, equation (2.7). The 3D points ~x are generated from yk using tri-angulation. Other algorithms used in SfM and the basic steps for a general SfMpipeline are briefly described below.

2.6.1 Two-view reconstruction

The first step of 3D reconstruction is to find camera poses, which relate the initialpair of two images, with that information the first 3D points can be triangulated.

A first solution can be found by solving an epipolar geometry problem relat-ing two images of the same scene. Using F the relative position between twocameras can be found in calibrated camera coordinates. The essential matrix E isintroduced and is computed as[11]:

E = KTFK. (2.15)

It basically is the fundamental matrix defined in C-normalized coordinates. Withthat the infinite solution space is narrowed down to a set of 4 camera poses. Byusing a corresponding point pair a 3D point ~x is triangulated for a CCS centeredin one of the cameras C1. If the last coordinate of ~x is above 0 the 3D pointis in front of C1. To determine if this is the case for both cameras, ~x must betransformed to C2 CCS. This is done by the relation ~x′ = R~x + ~t. In one case thelast coordinate of ~x′ will as well be above 0, which means that the 3D point isin front of both cameras and consequently this E is used. A common method fortwo-view triangulation is the mid-point method, described in section 14.1.1 in

16 2 Theory

[11]. The exact methodologies of a contrario E estimation, triangulation and poseestimation are described in the OpenMVG documentation[12]. With the cameracentered CS in Euclidean space the translation vector has a determined directionbut not length. Here ~t is set to have unit length and consequently scales thereconstructed scene. With this assumption, the relation between WCS and CCScan be found, by approximating the distance between each camera pose with IMUdata. This unknown scaling factor from CCS to WCS is from here on denoted assC2W and how it is generated is described in section 2.8.1.

2.6.2 Perspective-n-point - adding a view

After the first two poses are found, using the methodology explained in section2.6.1, each new added view needs to be handled. For example with the next viewsome n known C-normalised image points yj , j = 1,...,n can be tracked to theprevious image.

Since the corresponding 3D points are known for these new image points, thecamera pose can be estimated for this new view. For example, for each new addedview, C can be found by solving the PnP problem.

Since the data may contain noise the geometric minimisation of the PnP esti-mation problem is formulated for one camera pose as below:

εP NP ,Geo =n∑j=1

dpp(yj , y′j )

2, where y′j = P(C, ~xj ), (2.16)

where P(C, ~xj ) is explained above in equation (2.10). This is then minimisedover R∈SO(3) and ~t ∈ R

3. RANSAC, i.e. the algorithm that handles data withoutliers described in section 2.5, is used to find a robust P3P solution. The PnPsolution will then be minimised over the inlier set, i.e. only the points which arereprojected within some threshold[11].

2.6.3 Bundle Adjustment

Bundle adjustment (BA) is a process used when a new pose have been added. Atthese occasions new 3D points are added using a RANSAC triangulation schemefor the given poses. The overall solution of these 3D points and camera posesis not ideal, i.e. the re-projection error for all 3D points to all camera poses isnot zero. By using the BA method all parameters are refined and bad points areflushed out. The goal is to minimise the reprojection error for all ~xj . This is doneby having the BA algorithm change the parameters (Rk , tk) and ~xj . By varying theparameters of the poses and the positions of the 3D points xj BA tries to minimisethe reprojection error:

εBA,l =l∑k=1

n∑j=1

Wkjdpp(ykj ,P(Ck, ~xj ))2, (2.17)

2.7 Volumetric Representation - Space Carving 17

where l is the current number of camera poses, n is the current number of 3Dpoints, Ck is the camera matrix for pose k and dpp is the point-to-point distancebetween ykj and the re-projected point Ck~xj . Essentially the overall projectionerror is minimised, i.e. the distance between the observed 2D coordinate and there-projection of the known 3D point, for all views it is seen. Wkj is a visibilityfunction which tells if a 3D point xj is visible in pose k. This optimisation is doneusing non linear optimisation methodology, e.g. nonlinear least square.

2.7 Volumetric Representation - Space Carving

Some different volume estimation methods have been presented in section 2.1.The one used in the thesis is called Space Carving. By using the set of cameramatrices produced for all views and refined by BA, the volume can be calculatedin the CCS up to an unknown scale.

The Space Carving method produces a set of voxels, the volume of whichis easily calculated since it consist of predetermined sized voxels. This voxelblock is a representation of the reconstructed scene, the volume of which aredetermined by the methods described in[6].

The first step is to find suitable limits in each dimension for the voxel blockin order to initiate it and also determine the resolution of it. The boundary di-mensions are initially limited by a relation to the actual position of the cameracenters. In the current implementation the initial boundary condition is set to75% of each dimension found by the minimum and maximum positions of thecamera centers.

The boundaries are then also limited by the camera frustums. The camerafrustum is the space which is depicted by a camera and all of the camera frustumsare used to limit and offsetting the voxel block, height wise. The direction of thecamera with index k is found as follows:

nk = ~Y /Y , (2.18)

where ~Y = Rk,r3 and Rk,r3 is the third row of the rotation matrix of camera k, Yis the norm of ~Y , and nk is consequently the viewing direction for camera k.

With these boundaries known, finally a voxel block is initiated, where thenumber of voxels is set to a predetermined number.

2.7.1 Object Silhouette and Projection

The next step is to project all voxels in the block into each image to determineif they belong to the object or not. The images used here are binary masks, rep-resenting the object and its silhouette. A challenge here is to find a suitable seg-mentation method for the object of interest. In this thesis work a simple colorsegmentation is made, the implementation of which is described in the chapter3.

18 2 Theory

This binary mask is used to determine whether a voxel belongs to the objector not. This is done by projecting each voxel in the scene into the image. A singleprojection outside the binary mask means that it is removed from the voxel grid.

The remaining voxels are then counted and used as the estimate of the volume,but still in CCS.

2.8 Fusion of Camera and IMU Sensors & SfMImprovements

To determine sC2W , transforming from local C-normalized CCS to real world CSdescribed in section 2.6.1, the sensor data from accelerometer, section 2.2.2 andgyroscope, section 2.2.1 are used. To improve precision of the data the physicaldistance between the sensors can be estimated for example.

2.8.1 Estimating IMU to Camera Transformation

Sensor fusion can be used to improve translation and rotation estimates by find-ing the rigid transformation between ICS and CCS. This is a tedious and time-consuming process and normally requires additional equipment but would inthe end improve the performance of the scale estimate.

In an attempt to improve robot navigation systems it has been shown thatsensor bias can accurately be obtained, as well as the metric scale and the trans-formation between IMU and camera. This is all done autonomously by using thevisual information combined with sensor data from the gyroscope and accelerom-eter alone[10]. This have not been done in the thesis, and the local CCS is insteadapproximated to be the same as the ICS. This is not an issue since the sensors arelocated in a rigid frame and within a couple of centimeters from each other.

2.8.2 Global Positioning System Fused With SfM

With a stable high-performing GPS, the SfM can be modulated so that the costfunction punishes camera centers drifting too far away from corresponding geo-tags. This would be similar to an algorithm that does robust scene reconstructionwhere noisy GPS data is used for camera center initialisation[3]. The GPS moduleof the used smartphone does not have such a precision GPS and this method wastherefore not used.

2.8.3 Scaling Factor Between CCS and WCS

The unknown camera-to-world scale factor sC2W is most commonly deduced byintroducing a reference object of known length in the scene. Here, this approachis only used to validate the performance of the system. Instead sC2W is estimatedby finding the relation between the real world translations tWT+1,T and the corre-sponding camera translations tCT+1,T from time instances T to T + 1 for T < 100.

2.8 Fusion of Camera and IMU Sensors & SfM Improvements 19

To find a given translation tWT+1,T in WCS between the keyframes at time in-stance T and T + 1, the positions p(T + 1) and p(T ) for these two time samplesmust be found. In order to correctly integrate the acceleration to position the ini-tial velocity is assumed to be zero. With this assumption in mind p(T ) is foundby a double integration of the acceleration as:

p(T ) =

T∫0

T∫0

~aworld(t) dtdt, (2.19)

where ~aworld(t) is defined in equation (2.5). The translation vectors are thenfound as follows:

tWT ,T+1 = p(T + 1) − p(T ). (2.20)

The corresponding vector tCT+1,T in CCS is simply found as the vector betweenthe two camera centres representing T and T +1. In this implementation the scalesC2W is simply calculated as the mean norm of their fraction as:

sC2W =1

100

100∑k=1

∣∣∣∣∣∣tWk−1,k

∣∣∣∣∣∣∣∣∣∣∣∣tCk−1,k

∣∣∣∣∣∣ . (2.21)

With known sC2W the volume in the 3D scene can be converted into the realvolume estimate. The result of the Space Carving explained above in section 2.7is a voxel cloud representing the object in CCS. By counting the number of voxelsand using known resolution of said voxels the real world volume can finally becalculated as:

V = NtotVvoxel s3C2W , (2.22)

where Ntot is the total number of voxels occupied after the Space Carving. Thevolume per voxel given in CCS is denoted Vvoxel and sC2W is calculated by therelation in equation (2.21).

Scaling factor determined by reference object

In the scenes a 1m ruler, marked with distinct red tape at each end, is placed.This ruler’s end points are reconstructed in the SfM process. By identifying thesepoints, they can be used to determine a reference scale and evaluate the perfor-mance of the scale calculation. This reference scale is used as the ground truthand from now on denoted as sGTC2W . It is simply found by using the corresponding3D points and the known length of the object:

sGTC2W =l∣∣∣∣∣∣∣∣XWf irst − XWend ∣∣∣∣∣∣∣∣ . (2.23)

Here, l is the real world length of the ruler specified in meters, XWf irst is the first

3D point of the ruler, and XWend the end 3D point of the ruler.

20 2 Theory

2.8.4 SfM Using a Hierarchical Cluster Tree

One problem of incremental SfM, explained in section 2.6 is scalability. Insteadof incrementally adding one pose at a time, global SfM uses a hierarchical clus-ter tree with the leaves starting at each image in the video sequence. From hereparallel processes can be started, with the purpose of local reconstruction amongneighbouring poses. This also improves the error distribution over the entire dataset and the method is less sensitive to the initialisation and drift. This methodis in particular useful when the system should run in real time. The idea is touse a subset of the input images and feature points for each reconstructed node.These subsets subsumes the entire solution and thereby reduces the problemcomputational complexity significantly compared to the traditional incrementalpipeline[4]. There was not enough time to use this method in the thesis workand it was also not considered to be necessary, given the delimitation’s found insection 1.4.

3Method

To solve the problem introduced in chapter 1, namely automatic volume estima-tion, a system has been developed. Using the theory explained previously, thischapter describes the implementation and solving of the various problems de-rived from the main goal of volume estimation.

The system consists of different subsystems, which for example generate im-ages and then communicate through text files. The software components arelocated in two separate hardware platforms. The first one being the cellphone,which gathers a video stream and sensor data. The recordings are started whenthe user is standing still, with the camera facing the object of interest. Whilerecording the object, the user simultaneously walks around the object in a circle.The second hardware platform used is a PC running separate programs. Onetracker, generating images and feature points to a Linux Virtual Machine run-ning 3D reconstruction, and finally a Matlab program running Space Carvingand volume estimation. A detailed overview of the systems modules can be seenin figure 3.1.

Using the obtained data from the scene, the corresponding feature point tracksare generated, described in section 3.2. These are then used in the Open SourceMultiple View Geometry Library (OpenMVG) incremental SfM-pipeline[18],which is described in section 3.3. In this step 3D points are generated by meansof triangulation for the chosen views. By solving the PnP problem, each view isadded incrementally. The corresponding camera poses are then refined by per-forming BA. Here both 3D points and camera parameters for all given views aretweaked to minimise a global error. Finally the output of OpenMVG, i.e. thecamera poses and 3D points, are used in Space Carving in order to calculate thevolume of the object depicted in the scene, in section 3.4. The unknown scale inthat scene is then found by means of sensor fusion in section 3.5.

21

22 3 Method

Figure 3.1: Flowchart of the system workflow

3.1 Sensors and Data Acquisition 23

3.1 Sensors and Data Acquisition

To record the video and IMU-data simultaneously, two applications are used.First the IMU-data log is started before the video recording, which is recordedat 1080p@60fps. The user then fixes the focus on the object in order to mitigateany changes of the focal length induced by the autofocus from Google Camera.The result of this, is a more robust internal camera matrix estimation.

The IMU-data logging is started from an initial position and is continuouslylogged while the user is walking around the object of interest. It is then stoppedwhen the user has reached the start position (approximately). The IMU data setsare collected with a cellphone (Samsung S6) at a sample rate of 100 Hz.

The application used are the standard Google Camera[8] and the Android ver-sion of Sensor Fusion[7], developed by Fredrik Gustafsson and Gustaf Hendebyat Linköping University.

Consequently there is an offset in capture time between the video frames andthe sensor data which also has to be synchronised due to different sample rates.This is easily corrected by using the start of the video recording tcamera, the starttime of IMU-readings tIMU and the sample rates fIMU and fcamera. Simply put,there is an offset from the given frame timestamp to the corresponding IMU data.Also this data needs to be transformed from ICS to WCS, using the methodologyexplained in section 2.2.

To be noted here is that inaccuracy of fIMU and fcamera are not accountedfor, neither any lost frames nor any rolling shutter artifacts. Such artifacts areinduced as a consequence of different pixel readout times when the camera isin motion. For example a long straight object registered over the entire camerachip would appear more distorted and skew the faster the camera is moving.Implementing e.g. rolling shutter correction might improve the tracking but itwas not within the scope of this thesis. Also the timing of the timestamps is notconsidered, i.e if the time stamp refers to the start, middle or end of an exposureand how this affects induced artifacts. Bias estimation of IMU-data is also notimplemented, but ideally it should be estimated for each recording.

3.2 Tracking in Video Sequence

The tracker used is operating on downscaled 1080p images from a 60 fps videosequence and produces corresponding feature points. The implemented trackeruses the corner detector GoodFeaturesToTrack from OpenCV 2.4.12 [2] to gener-ate 1100 image points of corners. A higher number means a more robust trackingbut also results in larger memory usage and slower total run time. The found fea-tures are tracked frame-to-frame using calcOpticalFlowPyrLK, enabling trackingon multiple scale levels. The matching points are then re-tracked to ensure cor-rect matching. The remaining points are also evaluated by a geometric constraint,by finding F with a threshold of 1 pixel.

The designated goal is to produce image coordinates yk for a subset of imagesdenoted keyframes, which represents the entire current video sequence of N > l

24 3 Method

frames. Running on all frames would not be possible with the PC setup andwould not produce a significantly better result. Mainly because the tracked fea-tures are found with a high confidence between each keyframe and consequentlyestimating camera poses on each frame would yield the same 3D model.

A keyframe is generated when one of two conditions are fulfilled. Case one iswhen the median displacement between the current tracked points and the lastkeyframe feature points yk−1 have reached a certain threshold, default at 5px.The second situation happens when the current number of tracked points fromthe last keyframe have dropped below a certain threshold, default 30% of thenumber of initially generated points.

3.3 OpenMVG SfM module

Rather than implementing the methods described in sections 2.6.1 to 2.6.3, ready-made SfM modules can be used. In this implementation the system uses theOpenMVG SfM-pipeline which differs somewhat from the theory described insection 2.6. The RANSAC usage in two-view reconstruction and PnP is replacedwith AC-RANSAC, explained in section 2.5.2. The inputs to the SfM-moduleare the tracked points produced by the tracker with the corresponding images,described above in section 3.2.

With known correspondences and image pairs the system first solves the epipo-lar two-view problem of finding E to extract pose estimations. The AC-RANSACscheme runs for a maximum of 4096 iterations in order to find a suitable model,i.e. the model which has the minimal amount of NFA. The number of iterationsincrease the chance of finding aM consisting of only inliers. But at a certain pointthe increased chance of finding a better model is negligible compared to the in-creased computation time required. The system OpenMVG uses the a contrarioprinciple on the correspondence problem but also on the pose estimation. Thisagain applies to AC-RANSAC, where the threshold for an inlier, is computed foreach view. This threshold is furthermore used as a confidence and in outlier re-jection of new possible triangulated tracks. Any triangulated point which yieldsa larger reprojection error than this threshold is discarded[13].

The estimated camera poses and triangulated points are then parsed fromC++ environment to text files. These are then further used in the system pipeline,in a Matlab environment. More exactly, the SfM data is used to generate andcalculate the volume, described in the next section 3.4.

3.4 Volumetric Representation - Space Carving

The Space Carving according to the theory described in 2.7 is implemented inMatlab. The first step is to find suitable limits in each direction for the voxelblock in order to initiate it. These boundaries are simply found by a relationto the camera centres and their camera frustums, or viewing directions. Withthese boundaries known, a 3 million voxel block is initiated, where the numberof voxels determines the resolution and consequently the computational load.

3.5 Determining the Real Volume 25

3.4.1 Object Silhouette and Projection

The silhouette is then found as a binary mask for every view. Since the targetfamily of objects are in the grey color spectrum, this is done by a simple colorsegmentation. A pixel is deemed to represent the object and set to 1 in this maskif it has RGB values between 30 and 120 in each channel, where the maximumvalue is 255.

At this stage the mask often includes pixels outside of the object. In orderto remove such small remaining connected components, those below 15% of theimage area are removed. The resulting binary mask is then dilated to fill anyholes that might be present before it is finally used in the projection.

3.4.2 Removing Voxels Below the Groundplane

When the object has been formed, the ground plane is removed. In this imple-mentation this is done by finding the ground plane manually from the existing3D points, which are produced in the OpenMVG SfM incremental pipeline. Firsttwo vectors are chosen manually in the 3D point cloud which best correspondsto the ground plane. The cross product of these is then the normal to the groundplane and the equation of the plane can easily be found. The normalised Hessiannormal is then used to determine whether a point in the voxel block is above orbelow the ground plane. Any point below the ground plane is removed.

3.5 Determining the Real Volume

After the 3D points have been generated and filtered a representative 3D modelof the object is left. The estimated volume of this model given in CCS must thenbe related to WCS. The volume in CCS is found after projections for all viewshave been made and the ground plane has been removed. The volume of theobject is simply the number of occupied voxels that are left multiplied by thevolume per voxels. With known span of the initial voxel block on each axis andknown number of initial voxels, the volume of each voxel is:

Vvol =1

Ninit||xaxis || ||yaxis || ||zaxis || , (3.1)

where Vvol is the volume of each voxel given in CCS, Ninit the initial number ofvoxels and ||xaxis || is the length of the x.

The next step is to move from the CCS to the WCS. This is done by using thescale factor described in section 2.8.3. Matlabwas used with the data generatedby OpenMVG and the IMU-data. In attempt to suppress the impact of bias andnoise of ~aworld(t) the scaling factor sC2W is only calculated for the first 100 poses.

The final estimated volume is consequently an estimate in real world units.The estimated volumes of the GT-data sets, together with results from each differ-ent module are presented in the next chapter.

4Evaluation and Results

Using the method described in the previous chapter, the system results are gener-ated. In this chapter the results are presented and evaluated.

With the goal of having a simple experiment setup, the user only have tocollect data and then run it through the system. The system inputs are the sensordata and a video sequence and the final output to the user is the volume of theobject. The result presented in this chapter mainly consists of two sub results,the IMU processing, i.e scale estimation, and the computer vision module withthe final volume estimation.

The chapter starts with the IMU and video data collection and then shows thesubsystem results. The program is written in C++ and Matlab and executed on aPC running Windows 10 and a Linux Virtual Machine in order to run OpenMVG.

Each module has been tested separately and the data and corresponding re-sults are presented below.

4.1 Data Sets

The full system is only evaluated on two stone and gravel piles with knownground truth volume. These sets of GT-data have been acquired with the com-pany which has the major responsibility of transport from and to quarries aroundLinköping. Two different types of stone and gravel were recorded and each pilewas recorded 2 times. The piles were then lifted on to a truck and weighted withan accuracy of ±50kg, shown as volume in table 4.1 below. The density of thestone piles was given by GDL transport and has an inaccuracy of below 5%. Thereal volume is then calculated directly from the weight and density and presentedas Volume in the same table.

Images from the recordings can be seen below. In figure 4.1 the first pile canbe seen, recorded while moving clockwise. Each image is roughly 1 second or

27

28 4 Evaluation and Results

Pile-Scene Weight Density GT Volume Frames Keyframes1-1 38900 ± 50kg 1700 ± 85kg/m3 22.9 ± 1.1m3 2545 11821-2 38900 ± 50kg 1700 ± 85kg/m3 22.9 ± 1.1m3 2366 11332-1 20100 ± 50kg 1700 ± 85kg/m3 11.8 ± 0.6m3 1879 8172-2 20100 ± 50kg 1700 ± 85kg/m3 11.8 ± 0.6m3 2037 923Table 4.1: First column denotes the pile and which scene, second column isthe weight in kilo, third density in kg/m3, fourth GT volume (weight timesdensity), fifth is the number of frames in the entire video stream of that sceneand sixth is the number of keyf rames.

60 frames, apart. In figure 4.2 pile 2 is shown, also recorded clockwise. Bothscenes have also been recorded counterclockwise. Keyframes are the number offrames used for SfM whilst all frames are used in the tracking module, how thesekeyframes are selected is described in section 3.2.

4.2 Sensor Fusion - Scale Estimation

The first major result is the estimated scale which is compared to the referencescale from the ruler, described in section 2.8.3. Below is the parts which are usedto compute the scale.

4.2.1 IMU Readings and Integration

The theory described in section 2.2 and the implemented method regarding IMUhave been tested by simple recordings, i.e. by moving the cellphone in a rectan-gular shape and walking back and forth. These results are not shown, instead theresults on the GT scenes can be seen below. In figure 4.3 the acceleration data isshown for scene 1-2.

To make this data useful the methods described in section 3.1 is used to trans-form it to WCS and then remove the gravity. The corresponding result can beseen in the upper right corner of figure 4.3. To retrieve positions, the accelerom-eter data is double integrated into position estimation of the cellphone using themethods described in section 3.5.

The intermediate integration, i.e. velocity, is shown in the lower left cornerand the final estimation of 2D positions in the x-z plane given in meters is shownin the lower right corner of figure 4.3. Similarly the same responses from scene2-1 can be seen below in figure 4.4 and for scene 2-2 in figure 4.5. In the idealcase these positions should be a circular shape, or at the least an elliptic shape,since the user is ending the recordings at the initial position.

4.2.2 Result - Scale Factor

The reference ruler of 1m is used to determine sGTC2W for each scene. Since the useris at different lengths from it, and the 3D model is different, sGTC2W is different for

4.2 Sensor Fusion - Scale Estimation 29

Figure 4.1: Scene 1 collage with images from corresponding video sequence

Scene Keyframes sGTC2W sC2W1-1 719 2.35m 1.17m1-2 1133 1.35m 1.10m2-1 817 1.41m 3.98m2-2 923 2.25m 1.27m

Table 4.2: Results for scale estimation for each scene. First column denotesthe scene, second column the number of Keyframes used in SfM 3D modelgeneration, third the GT scale and fourth the estimated scale.

each scene. The result of this scale for the different scenes is presented in table 4.2.By using equation (2.21), the estimated scale from IMU-data is then presented assC2W and the quality of these are directly dependent on the on the IMU data.

Scale estimation when using all poses

The estimated sC2W , described above, is estimated using only the first 100 poses.In an attempt to show the influence of using all poses for each correspondingscene these results have been generated and are shown below. To estimate thescale with this parameter change leads to a large impact of the bias and noisepresent in the IMU, see table 4.3.

30 4 Evaluation and Results

Figure 4.2: Scene 2 collage with images from corresponding video sequence

Scene sC2W1-1 60.26m1-2 2.67m2-1 48.27m2-2 17.52m

Table 4.3: Results with all poses used for metric scale calculation. First col-umn denotes the scene and second column the estimated scale.

4.3 Computer Vision - Volume Estimation

Here, the results from the second major subsystem are shown. The correspondingmodules which produce the 3D model and volume estimation are also evaluated.

4.3.1 Tracking module

The tracking module has been tested on several piles besides the GT-data. Suchtests have been performed by recording an object of interest and running thevideo sequence through the tracker and then evaluating the length of the pointtracks. The results of this module are putative corresponding point tracks and aset of keyframes.

A typical tracking sequence is shown in figure 4.6, generated from scene 1.The current position of each tracked point (red dots) and its tracked path (blue

4.3 Computer Vision - Volume Estimation 31

Figure 4.3: Accelerometer data from scene 1-2, x-axis in blue, y-axis inbrown and z-axis in yellow. Upper left: raw acceleration data. Upper right:WCS acceleration. Lower left: velocity. Lower right: position in 2D, sC2W isestimated from green dot to black dot.

tracks) are shown. These tracks describes how the point has been displaced overtime. The length of these vectors are used to determine the median displacementcriterion described in section 3.2. If it fulfils the criterion new points are gener-ated and added to the tracker.

4.3.2 Structure-from-Motion

The keyframes and corresponding point tracks, explained above in section 4.3.1,are used in OpenMVG’s SfM pipeline. The output of this pipeline are 3D pointcloud and estimations of camera poses. These camera poses are visualised for theGT-data sets later in this section. The module result, 3D points and camera poses,has also been evaluated and verified by using the results explained above both ontest and GT-data.

The estimated camera poses are then used in the Space Carving module togenerate a solid volume. The typical SfM results generated can be seen below,with one image per scene. In figure 4.7 the result from data set 1-2 is shown fromthe side. In these images the 3D points are shown in white and the poses in green.In figure 4.8 the same scene is shown but in a top-view perspective. Similarly theresult for pile 2-2 is shown in figure 4.9 from the side and in figure 4.10 in atop-view perspective.

32 4 Evaluation and Results

Figure 4.4: Accelerometer data from scene 2-1, x-axis in blue, y-axis inbrown and z-axis in yellow. Upper left: raw acceleration data. Upper right:WCS acceleration. Lower left: velocity. Lower right: position in 2D, sC2W isestimated from green dot to black dot.

4.3.3 Space Carving

The output from the SfM module in form of camera poses, 3D points and imagesare then used to form a solid object. This is done by using the Space Carvingmethod explained in section 3.4.

Initial Voxel Block

First the scene is initiated with a voxel block inside the camera frustums, seefigure 4.11. In these images only a tenth of all the cameras are visualised in blue,with their viewing direction shown as well. This module has only been usedon the GT-data, generating a total of four results. Using the silhouette for eachcamera pose, such as in figure 4.12, the voxel block is carved image by image.

Ground Plane Segmented From Voxel Block

Below is the result when the ground plane is found, see figure 4.13 and removedfrom the voxel block, see figure 4.14.

Final Voxel Block

The object volume is represented and calculated from the remaining voxels leftafter the Space Carving. Two typical results are shown in figures 4.15 and 4.16.

4.3 Computer Vision - Volume Estimation 33

Figure 4.5: Accelerometer data from scene 2-2, x-axis in blue, y-axis inbrown and z-axis in yellow. Upper left: raw acceleration data. Upper right:WCS acceleration. Lower left: velocity. Lower right: position in 2D, sC2W isestimated from green dot to black dot.

4.3.4 Result - Volume Estimation

Finally, the volume estimation, based on the modules described above, can begenerated. The relation between the estimated volume CCS volume and the realworld volume is the sC2W presented above in section 4.2.2. In table 4.4 the re-sults from the system on the GT-data are shown. The camera-to-world scale islabelled as sC2W . Furthermore the volumes in CCS after applying the Space Carv-ing methodology explained in section 3.4 are listed for each scene. Those valuesare calculated from the space carved scenes, e.g. from scene 1-2 visualised infigure 4.15. The pose data, shown in e.g. figure 4.9, where green dots are theestimated camera positions, are used for the Space Carving projections. The ori-entation of the cameras can more easily be seen in e.g. figure 4.15. Finally thisresult is combined with the scaling factor and used to determine the WCS volumefor each scene in cubic meters, labelled as V .

GT Volume Estimation

To evaluate the effect of sC2W on the volume estimation separately, a GT scaleestimation has been made. The GT scale, sGTC2W , is then used instead of sC2W toestimate the volume. The result is presented in table 4.5.

34 4 Evaluation and Results

Figure 4.6: Scene 2-2 tracked points motion vectors. Green area is zoomedand showed in right side image. In this small area, the red dots have beentracked in a similar fashion. The result of which are shown by the bluemotion vectors, indicating each tracked point movement between trackedframes.

Figure 4.7: Scene 1-2 point cloud, side view. Green points are the estimatedcamera centers, white points 3D points and each red curve represent anCartesian coordinate CCS axis.

4.3 Computer Vision - Volume Estimation 35

Figure 4.8: Scene 1-2 point cloud, top view. Green points are the estimatedcamera centers, white points 3D points and each red curve represent anCartesian coordinate CCS axis.

Figure 4.9: Scene 2-2 point cloud, side view. Green points are the estimatedcamera centers, white points 3D points and each red curve represent anCartesian coordinate CCS axis.

36 4 Evaluation and Results

Figure 4.10: Scene 2-2 point cloud, top view. Green points are the esti-mated camera centers, white points 3D points and each red curve representan Cartesian coordinate CCS axis

Figure 4.11: Scene 1-2 Space Carving with initiated voxel block in green andcameras in blue.

Figure 4.12: Scene 2-1 Space Carving silhouette segmentation.

4.3 Computer Vision - Volume Estimation 37

Figure 4.13: Scene 1-2 scene with 3D points and ground plane estimated.The Ground plane is the straight line below the green pile, and the bluetracks around the pile are the camera centers.

Figure 4.14: Scene 1-2 Space Carving with ground plane removed

38 4 Evaluation and Results

Figure 4.15: Scene 1-2 space carved with all views. Green is the space carveresult and blue the cameras.

Figure 4.16: Scene 2-1 space carved with all views. Green is the space carveresult and blue the cameras.

4.3 Computer Vision - Volume Estimation 39

Scene s3C2W Kf Ntot Vvoxel GT Volume V1-1 1.60m 719 1071515 6.75 · 10−6 22.88m3 11.58m3

1-2 1.33m 1133 713564 3.05 · 10−5 22.88m3 29.11m3

2-1 63.04m 817 149411 1.38 · 10−5 11.82m3 130.05m3

2-2 2.05m 923 1118972 6.74 · 10−6 11.82m3 15.40m3

Table 4.4: The numeric results for each scene. First column is thescene, sC2W is the estimated camera-to-world scale, Kf is the number ofKeyf rames used, Ntot the number of voxels left after Space Carving, Vvoxelthe volume of each voxel, GT Volume (weight times the density) and V thesystem volume estimation

Scene sGT ,3C2W GT Volume V GT

1-1 12.98m 22.88m3 93.83m3

1-2 2.46m 22.88m3 53.82m3

2-1 2.80m 11.82m3 5.79m3

2-2 11.39m 11.82m3 85.66m3

Table 4.5: The numeric results for each scene using sGTC2W . First column is thescene, sGT ,3C2W is the estimated GT camera-to-world scale cubed, GT Volume(weight times the density) and V GT is the system volume estimation usingsGT ,3C2W .

5Discussion

In this chapter the thesis work result and implementation is discussed.

5.1 Result

The quality of the final result, the estimated volume, varies vastly for each scene.This can be explained by two flaws which both affect the outcome directly; esti-mation of sC2W and the Space Carving.

This is based on the fact that the the results from the tracker is sufficient forthe goal of acquiring well-estimated poses. Consider e.g. figure 4.6, where theblue motion vectors indicates each tracked points movement between frames. Byexamining a small area, it shows that the points have been tracked in a simi-lar fashion. Furthermore the 3D models which contain camera poses and pointclouds, have been reconstructed very well. They are similar to the real structure,see e.g. figure 4.7 to 4.10 and the camera poses corresponds to the user movementaround the the objects of interest.

To draw any exact conclusions of the overall system performance, a largeramount of GT-data should be added. Preferably the GT-data should be locatedwithout similar objects in the background, but finding such areas with isolatedGT volumes have not been found. In an attempt to generate such GT-data, gravelpiles in proximity to construction sites were filmed, but the exact volume of thesewere not known to the personnel on site. Also the surrounding area had similartexture, e.g. trees, construction material etc so this approach was not pursued.Currently the four scenes have a large variation in the final results which makesdifficult to make any statement about the system performance.

41

42 5 Discussion

5.1.1 Scale Estimation

The overestimated scale in scene 2-1, see table 4.2 is due to the large false driftsat the beginning of the accelerometer recordings which can be seen in figure 4.4.The scale has been calculated from the starting point, marked green, to the 100thpose, at the black cross. The user has not moved such a distance, about 130m dur-ing that window of time. The origin of this problem is the bias offset and noisepresent in the sensor which are not being accounted for in this implementation.This results in a roughly linear error which accumulates over time and leads toposition estimation drifts, which can be noted in figure 4.3 but the effect is espe-cially apparent in figure 4.5. This effect can also be noted in table 4.3 where theeffect of using all IMU data leads to large over-estimations in scene 2-1 and 1-1.

According to the reference scale in table 4.2 the scales in 1-2 should be abouttwice as big. Consequently, even by using 100 poses to estimate sC2W this systemmodule has low robustness.

5.1.2 Space Carving

The second major problem is caused by inadequate silhouette segmentation. Theground falls into the grey color spectrum and is in some views therefore alsosegmented as part of the object silhouette. This also applies to parts of the back-ground where the massive piles are registered as part of the actual silhouette.

The resulting model from scene 2-1, figure 4.16 instead has the problem oflost voxels in the object. This is a result of holes in the binary mask, projectinglines through the model and removing voxels. This happens even though the bi-nary mask has been dilated in an attempt to remove such cavities. To overcomethis issue, the point correspondences from the SfM module could be used to gen-erate more accurate Space Carving. Basically the 2D points should be used torestrict the mask to the object itself. Another approach would be to only removevoxels which are not seen in multiple views.

5.1.3 Volume Estimation

The volume estimation in table 4.4 suggest that the Space Carving module is notcreating stable results, which is apparent in scene 2-1, see figure 4.16, where apile cannot even be seen. On the other hand, a large chunk of ground is left inall scenes, increasing the volume estimated falsely. This suggest that the groundplane segmentation, again a part of Space Carving, is not robust enough. Theresult of V GT using sGT ,3C2W furthermore shows that using a GT scale the volume isoverestimated, probably caused by the ground segment.

5.2 Method

The tracking method is commonly used and is working as intended for thesescenes. This can be seen in the resulting 3D models in figure 4.7 to 4.10.

5.3 Future Work 43

A negative aspect of the implementation is that the system is separated inthree modules as explained in section 3. This means that the data is parsedthrough each module instead of passed continuously, which would be more effi-cient, faster and less resource demanding. This would also allow all the SfM datato be used in the Space Carving module instead of the current version where onlythe 3D points and cameras are parsed to the Matlabmodule.

To improve the IMU data, a simple bias estimation was made at the start of thework by simply reading the sensors while the telephone was lying still. However,this bias is not practically usable since the magnitude and the direction of it differwith each start up of the IMU and can also change over time due to physicalconditions such as temperature and mechanical stress. To overcome this problemis a tough task but would result in a better end product.

A better estimate of sC2W could also be determined such that it minimisesthe distances between the real world and the camera poses. Such an error costfunction for such optimisation could look like:

εsC2W=

N∑k=1

∣∣∣∣∣∣sC2W tCk − t

Wk

∣∣∣∣∣∣ (5.1)

where sC2W is chosen such that it minimises the error εsC2Wfor all camera poses

N , tWk is then the real world translation between keyframes, generated by e.g.IMU-data as presented in this thesis.

5.3 Future Work

Besides verifying the results extensively by running the system on more data, abias estimation should be implemented in order to effectively remove the largeaccumulated errors in the accelerometer readings. This would improve accuracyof the scale factor estimation and effectively also the volume estimation.

The software could also be developed to process other objects than the onespresented and used in this thesis. To use the system on a object without the sameamount of texture as stone and gravel piles, it would involve implementing a newtracker. The current tracker could be complemented with e.g. a line detector orSIFT tracker and the most suited tracked depending on the target object wouldthen be used.

The Space Carving would also need modifications in order to find the silhou-ette of the target object in the images. Ideally the silhouette segmentation shouldutilise the points generated from SfM. Such a implementation would improve theSpace Carving shown in figure 4.15.

5.3.1 Rolling shutter calibration

Due to the cell-phones rolling shutter camera the read-out timings of the pix-els in the image plane are different from one another. This means that sincethe camera is moved during the shoot some geometric distorsion is induced in

44 5 Discussion

the acquired images[20]. The smartphone cameras are also using wide lenses,inducing another type of distorsion. The rolling shutter distorsion can be com-pensated for by using the IMU-data[17] by creating a model for the distorsion foreach image and compensate accordingly. Similarly the wide lens distorsion canbe compensated for a given camera, but both of these compensations would notchange the system performance radically. Consequently, these distorsion havebeen neglected since the tracking performance is robust.

5.4 Wider Implications of the Project Work

By addressing the current issues of Space Carving and scale estimation, the sys-tem should produce more stable results. With such improvements, the systemhave the potential to be useful practically if the subsystems are integrated intoa single application. A product which only gathers data, and then retrieves thevolume estimation, could have the potential to replace laser camera systems anddrones which are commonly used to determine volumes at e.g. quarries.

This would remove most of expensive equipment costs and the costs for thehours needed for training and use of said equipment. The replacement technol-ogy is then intuitive, cheap and easy to use. The largest cost would then be foracquiring and using such an software. An ethical issue also arises with automatictechnology such as explained above. The past couple of years have had an in-creased number of robots and similar technology used in everyday job situations,replacing humans. For the most part this is positive for the society and individu-als in the bigger picture, but it also means that there are less practical jobs.

6Conclusions

The goals with the thesis, described in 1.3 have been answered to an extent withthis approach, further elaborated below.

6.1 Volume Estimation

Is it possible to compute the volume semi-automatically of the target objects us-ing a cellphone and a PC? The thesis proves that this task only requires a videostream and the users motion pattern, recorded by the cellphones IMU. Possibly,the cellphone could send the data to a PC and get the volume estimation throughan app or similar, but this has not yet been implemented.

6.2 Method Evaluation

The implementation of the system is made in different subsystems where eachsingle module has been run and evaluated separately. The tracker has been eval-uated by visually examining the point feature tracks. At the same time the cell-phone camera has been validated, since the tracker has robust performance onthe video data.

OpenMVG’s SfM module has been used and the developers have validatedthe performance on different data sets. It has simply been used with the trackerresults, and the pose and 3D data has been robust for the GT scenes.

The Space Carving has been validated while it was developed. The finishedmodule is also validated by visual inspection of the remaining 3D voxel block.However the performance of this module is vastly varying and needs to be im-proved.

45

46 6 Conclusions

Sensor fusion was validated on simple motions, in rectangular patterns andalong each axis by generating the position from such IMU-data.

6.3 System Performance

The results seen in table 4.4 are not practically usable due to the instability ofboth the scale estimation and the Space Carving results. The main change wouldbe to use a more advanced Space Carving. One improvement would be to removevoxels based on counting how many views a voxel is not included in the segmen-tation. If this count reach a specific threshold, that voxels would then be removed,instead of removed directly as in the current implementation. On top of this abetter silhouette segmentation must be made. The system could e.g. use machinelearning to improve stone and gravel pile segmentation. Furthermore, the SpaceCarving could utilize e.g. an octree to create the voxel block with higher resolu-tion.

Another improvement would be to use more advanced sensor fusion and anaccurate estimation of the accelerometer bias. With such an implementation theoriginal idea could be realised, where the poses would be generated with theIMU and optimised against the ones generated from the SfM. The thought is thata certain scale will minimise the error between the rotations and translations forall views, resulting in a robust and accurate scale estimation.

With further testing and improvements the system have the potential to beuseful. The client Escenda is interested in a fully developed version of the sys-tem. Such a product could have the potential to replace laser camera systemsand drones which are commonly used to determine volumes at quarries.

Bibliography

[1] MathWorks Avinash Nehemiah. Camera calibration with matlab.https://se.mathworks.com/videos/camera-calibration-with-matlab-81233.html, Used 24-03-2016. Cited on page 12.

[2] G. Bradski. Opencv 2.4.12 c++ library. Dr. Dobb’s Journal of Software Tools,2000. Cited on page 23.

[3] Hainan Cui, Shuhan Shen, Wei Gao, and Zhanyi Hu. Efficient large-scalestructure from motion by fusing auxiliary imaging information. Image Pro-cessing, IEEE Transactions on, 24(11):3561–3573, 2015. Cited on page 18.

[4] Michela Farenzena, Andrea Fusiello, and Riccardo Gherardi. Structure-and-motion pipeline on a hierarchical cluster tree. In Computer VisionWorkshops (ICCV Workshops), 2009 IEEE 12th International Conferenceon, pages 1489–1496. IEEE, 2009. Cited on page 20.

[5] Martin A Fischler and Robert C Bolles. Random sample consensus: aparadigm for model fitting with applications to image analysis and auto-mated cartography. Communications of the ACM, 24(6):381–395, 1981.Cited on page 14.

[6] Andrew W Fitzgibbon, Geoff Cross, and Andrew Zisserman. Automatic 3dmodel construction for turn-table sequences. In 3D Structure from MultipleImages of Large-Scale Environments, pages 155–170. Springer, 1998. Citedon page 17.

[7] Gustafsson Fredrik. Sensor fusion. https://play.google.com/store/apps/details?id=com.hiq.sensor&hl=sv,Used 18-01-2016. Cited on page 23.

[8] Google. Google camera. https://en.wikipedia.org/wiki/Google_Camera,Used 02-10-2017. Cited on page 23.

[9] Richard Hartley and Andrew Zisserman. Multiple view geometry in com-puter vision. Cambridge university press, 2003. Cited on pages 13 and 14.

47

48 Bibliography

[10] Jonathan Kelly and Gaurav S Sukhatme. Visual-inertial sensor fusion: Lo-calization, mapping and sensor-to-sensor self-calibration. The InternationalJournal of Robotics Research, 30(1):56–79, 2011. Cited on page 18.

[11] Nordberg Klas. Introduction to representations and estimation in geome-try. http://www.cvl.isy.liu.se/research/publications/IREG/0.31/, Used 18-01-2016. Cited on pages 10, 12, 13, 15, and 16.

[12] Pierre Moulon and Bruno Duisit. Openmvg documentation 1.1.https://media.readthedocs.org/pdf/openmvg/latest/openmvg.pdf, Used05-09-2017. Cited on page 16.

[13] Pierre Moulon, Pascal Monasse, and Renaud Marlet. Adaptive Struc-ture from Motion with a contrario model estimation. In JamesM. Rehg Zhanyi Hu Kyoung Mu Lee, Yasuyuki Matsushita, editor,ACCV 2012, volume 7727 IV of Lecture Notes in Computer Sci-ence, pages 257–270, Daejeon, South Korea, November 2012. Springer.doi: 10.1007/978-3-642-37447-0\_20. URL https://hal-enpc.archives-ouvertes.fr/hal-00769266. Cited on pages 14 and 24.

[14] Shree K Nayar and Yasuo Nakagawa. Shape from focus. Pattern analysis andmachine intelligence, IEEE Transactions on, 16(8):824–831, 1994. Cited onpage 9.

[15] Eduardo Nebot and H Durrant-Whyte. Initial calibration and alignmentof an inertial navigation. In Mechatronics and Machine Vision in Practice,1997. Proceedings., Fourth Annual Conference on, pages 175–180. IEEE,1997. Cited on page 11.

[16] M Omid, M Khojastehnazhand, and A Tabatabaeefar. Estimating volumeand mass of citrus fruits by image processing technique. Journal of foodEngineering, 100(2):315–321, 2010. Cited on page 9.

[17] Hannes Ovrén and Per-Erik Forssén. Gyroscope-based video stabilisationwith auto-calibration. In Robotics and Automation (ICRA), 2015 IEEE In-ternational Conference on, pages 2090–2097. IEEE, 2015. Cited on pages 9and 44.

[18] Pascal Monasse Pierre Moulon, Renaud Marlet, and Oth-ers. Openmvg. an open multiple view geometry library.https://github.com/openMVG/openMVG, Used 24-02-2016. Cited onpages 13 and 21.

[19] Manika Puri, Zhiwei Zhu, Qian Yu, Ajay Divakaran, and Harpreet Sawhney.Recognition and volume estimation of food intake using a mobile device. InApplications of Computer Vision (WACV), 2009 Workshop on, pages 1–8.IEEE, 2009. Cited on page 9.

[20] Erik Ringaby and Per-Erik Forssén. Efficient video rectification and stabil-isation for cell-phones. International Journal of Computer Vision, 96(3):335–352, 2012. Cited on page 44.

Bibliography 49

[21] Jianbo Shi et al. Good features to track. In Computer Vision and Pat-tern Recognition, 1994. Proceedings CVPR’94., 1994 IEEE Computer So-ciety Conference on, pages 593–600. IEEE, 1994. Cited on page 12.

[22] Carlo Tomasi and Takeo Kanade. Detection and tracking of point features.1991. Cited on page 13.

[23] David Törnqvist. Estimation and detection with applications to navigation.PhD thesis, Linköping University Electronic Press, 2008. Cited on page 10.