Advanced point cloud estimation based on multiple view...

978-1-5386-2485-2/18/$31.00 ©2018 IEEE RADIOELEKTRONIKA 2018

Advanced Point Cloud Estimation based on Multiple View Geometry

Jan Hlubik, Patrik Kamencay, Robert Hudec, Miroslav Benco, Peter Sykora Department of Multimedia and Information-Communication Technologies

University of Zilina Zilina, Slovakia

{patrik.kamencay, jan.hlubik, robert.hudec, miroslav.benco, peter.sykora}@fel.uniza.sk

Abstract—In this paper we present a method based on Multiple

View Geometry (MVG) for the accurate 3D point clouds estimation. This method identifies features in all the images from the dataset, like edges with gradients in multiple directions, and tries to match these features between all the images and then computing the relative motion. It builds a 3D model with the correlated features. It then creates a 3D point cloud with color information of the scanned object. This designed method provides simple access to the classic problem solvers in Multiple View Geometry (MVG). The experimental results show that the best results using combination of SIFT (Scale-invariant feature transform) descriptor and ANN (Approximate Nearest Neighbor) matcher were obtained (43 matching images and 8942 found corresponding points).

Keywords—point cloud; multiple view geometry; 3D reconstruction; structure from motion; computer vision

I. INTRODUCTION

The geometry and geometric camera calibration are very important topics because it is a prerequisite for image-based metrology, and we have a strong expertise in that field. This problem area includes the structure from motion problem, which is the simultaneous recovery of sparse scene and camera motion (rotations and translations) from multi-views, and the multi-view stereo problem, which deals with dense image-based surface reconstruction given known camera motion. Besides measurement and modeling of rigid scenes captured with conventional photographic cameras, the research has recently expanded into other areas as well, such as calibration of non-classical cameras, analysis of dynamic and non-rigid scenes, and novel bio-imaging applications [1]. In recent years, new algorithms and new insights into the problem of matching images and computing camera calibration have been gained. However, the goal of automatically obtaining simplified polygonal model that authentically captures main features of the scene still seems to be out of reach.

Recently exists many problems that need to be taken into account in order to build accurate 3D point cloud model based on multiple-view geometry. Nowadays existing many facets to surface reconstruction. This survey focuses on those relating to the reconstruction from point clouds of static objects.

The outline of this paper is divided into these sections. In the second section, the state of the art is discussed. In the next section, the structure from motion (SFM) is described. The achieved experimental results are described in the Section IV.

Finally, the last section summarizes achieved experimental results and describes future work.

II. STATE OF THE ART

State of the art in area of the camera calibration and 3D reconstruction systems are based on very sparse point features, such as SIFT, and projective geometry, which can only model points and lines or simple curves such as circles and other conic sections. These systems suffer from many of the following limitations: sparsity, requirements of simple scene, controlled acquisition and difficulty with non-planar objects, requirement of strong calibration, abundant texture, short baselines, and lack of geometric consistency. We believe these systems are useful but form only a module within a greater structure from motion system [1], [12].

Most comparisons between different structures from motion algorithms have been made in close range measurements of small objects, like ornaments or sculptures [2], [3]. Nikolov et al. [4] tested six different commercial structure from motion software packages (including PhotoScan by Agisoft). They chose objects with varying characteristics, like featureless surfaces, repetitive patterns and glossiness. The produced models are compared to an undisclosed ground truth reference, and they found that most algorithms provided submillimeter accuracy for close range photos, and that PhotoScan in particular was consistent with the results taken from different lighting conditions but failed to resolve finer detail, as it over-smoothed the surfaces [5], [6].

III. STRUCTURE FROM MOTION

The Structure from Motion (SFM) is an approach to generate 3D models of objects and structures (see Fig. 1). The test dataset consists of series rendered images from the 3D reference model. The ease of the data acquisition and the wide array of available algorithms makes the technique easily accessible. There are different implementations of the structure from motion method that use different approaches to solve the feature-correlation problem between the images from the data set, different methods for detecting the features and different alternatives for sparse reconstruction and dense reconstruction as well. These differences influence variations in the final output across distinct algorithms. It handles of method to solve SFM problems (camera pose estimation, structure triangulation, bundle_adjustment) [7].

This work was supported by Slovak Research and Development Agencyunder the contract No. APVV-16-0505: The short-term PREDICtion ofphotovoltaic energy production for needs of pOwer supply of Intelligent BuildiNgs – PREDICON.

Mutiple View Geometry

(MVG)

Image matching

Structure from Motion

Surface estimation

Color restitution

Multiple View Stereovision

(MVS)Input images

Fig. 1. The block diagram of Structure from Motion algorithm (SFM).

The problem of Structure from Motion (SFM) is that of estimating the parameters of 3D geometric entities by observing apparent motion of their projections on several images taken from different viewpoints (see Fig. 2).

The OpenMVG provides a generic bundle_adjustment framework to refine or keep as constant the following parameters:

• internal orientation parameters (intrinsic: camera projection model),

• external orientation parameters (extrinsic: camera poses),

• corresponding geometric valid features across images,

• 3D structure (3D points).

a) b)

Fig. 2. Camera localization: a) sferical camera localization, b) estimated camera location and structure (green points represent cameras).

Firstly, the tracks (see Fig. 3) from point correspondences (red points) are computed. The pairwise correspondences by viewSet function are managed. Next, using this function is possible find the tracks (see Fig. 3).

Below is an example of the specific code libraries [11]:

• The SIFT or Scale Invariant Feature Transform library find matching points between input two images (red line in Fig. 3).

• The camera coordinates using geometry library are computed based on intersections of the matching points.

• The sparse point cloud using multi-view library based on trigonometry is obtained.

(x, y, z)

tracks

connections

Fig. 3. The example of the point correspondences across multiple images.

For the more information about these libraries, you can see the OpenMVG library [11].

IV. EXPERIMENTAL RESULTS

For comparison point clouds we used high resolution 3D scan (Free Skin Shader and Model) [8], then contain 540482 vertices and 10k x 10k color map (see Fig. 4 and Fig. 5). This 3D scene by the Andrei Cristea was created [8]. We set position cameras to verticals UV sphere with 12 segments and 5 rings (60 views) as you can see in Fig. 4. Color texture only for model was used.

Specification of dataset [8]:

• The 3D scan using Zbrush was cleaned and exported to 3D model (file format *.OBJ). Next, this 3D model to Blender was imported and 3D scene for final render was created.

• The high polygonal model consist of 8 million polygons.

• The Reflection maps, specular and roughness maps were used.

• The resolution of color map is 10 000 pixel.

x

y

z

Camera 1

Camera 5

Camera 3

Camera 4

Camera 7Camera 8

Camera 10

Camera 11

Camera 2

Camera 6

Camera 9

Camera 12

Fig. 4. Illustration of camera position for object acquisition (12 cameras) [9].

Normal Map, Bumb Map and lights was disabled. The origin of model was set to Center of Mass (Volume), which was target by cameras. The camera preset was set on Canon APS-C with sensor dimension 22.30 x 14.90 mm, Focal Length 50.00mm. Rendered JPEG images have resolution 4752 x 3168 pixels, RGB, 16bit color depth.

Fig. 5. Test example of the image database [8].

The algorithm for camera motion estimation is divided into these basic steps:

• Estimation of the relative pose of the current view (camera orientation and location).

• Finding of the corresponding points and connections across all the views.

• Very important step is to computation of the 3D locations of corresponding points using triangulate_multiview function.

• Estimation of the 3D points using bundle_adjustment function.

The position of Cam 7 relative to Cam 8 is find by using SFM method. This approach is possible to expand to multiple view case by finding a position of Cam 9 and Cam 10, and so on. The relative poses must be transformed into a common coordinate system. This means that all the relative positions of the camera are determined with respect to the Cam 7 (all relative positions are in the same coordinate system) [9], [10], [12]. The OpenMVG SFM pipelines run as a 4 step process [11]:

A. Image Listing (each images an object called a View) The first task to process an image dataset in openMVG

pipelines consist in creating a sfm_data.json file that describe the used image dataset. This structured file lists for each images an object called a View. This view store image information and lists:

• the image name and the image size,

• the internal camera calibration information (intrinsic parameters).

Each view is associated to a camera pose index and an intrinsic camera parameter group. Group means that camera intrinsic parameters can be shared between some Views (that leads to more stable parameters estimation).

In our case, the camera model Pinhole radial 1 (Pinhole Model + Radial Distortion) was used. The example of the intrinsic camera matrix: [f; 0; ppx; 0; f; ppy; 0; 0; 1]. The focal in pixel is computed as follow (if the EXIF camera model and maker is found in the provided sensor width database). This parameters we used for calculate focalpix in intrinsic matrix:

= , , (1)

where focalpix is the EXIF focal length (pixels), focal is the EXIF focal length in [mm], wpix, hpix is the image of width and height (pixels), ccdw is the known sensor width size in [mm].

B. Image Description The first step in this algorithm is to examine a region in the

image centered in a key-point and calculate a set of characterization values for that region. This involves calculating affine invariants for stretch, skew, photometric intensity changes and rotation effects, using shape adapted texture descriptors. The output of this algorithm is a feature vector. Compute image description for a given sfm_data.json file. For each view it compute the image description (local regions). The most popular algorithms for this purpose utilized in multiple structure from motion packages are SIFT, SURF, BRISK, AKAZE, DAISY. In our case, the control configuration NORMAL, HIGH, ULTRA for SIFT and AKAZE method was used [11].

C. Corresponding Images and Correspondences Computation We establish the corresponding putative photometric

matches and filter the resulting correspondences using some robust geometric filters. Nearest matching method was used BruteForce L2, Approximate Nearest neighbor L2 and Cascade Hashing L2 matching. Nearest Neighbor distance ratio was 0.8.

D. SFM solving For SFM solving we used Global SFM method [7], [11].

This method first apply filters to the point clouds to remove outliers, then start searching from the initial transformation. In each iteration, the algorithm establishes the correspondence, creates point pairs from the reading and reference, conducts transformation over the reading point cloud, and find the transformation that minimizes the match error (see Fig. 6 and Fig. 7). The process will continue until the result converges or reaches the maximum number of iteration [11].

a) b)

Fig. 6. Point cloud: a) Reference point cloud, b) Point cloud colorized.

a) b)

Fig. 7. 3D models: a) Point cloud aligments using by 4 equivalent point pairs, b) registration of original point cloud with created point cloud.

The computation times of three algorithms are shown in Tab. I. The computational time for SIFT algorithm using Brute-Force matcher (BF) was approximately 80 seconds and for AKAZE method 829 seconds, respectively (see Table I). On the other hand, the computational time for SIFT algorithm using Approximate Nearest Neighbors matcher (ANN) was approximately 160 seconds and for AKAZE method 109 seconds, respectively. The feature matching time for AKAZE uses binary descriptor with Brute-Force Hamming matcher (BF Hamming) was approximately 194 seconds. The filtering time (remove points) for SIFT using Brute-Force matcher (BF) was approximately 10 seconds and for AKAZE method 11 seconds. On the other hand, the filtering time for SIFT algorithm using Approximate Nearest Neighbors matcher (ANN) was approximately 12 seconds and for AKAZE method 11 seconds, respectively. The filtering time for AKAZE uses binary descriptor with Brute-Force Hamming matcher (BF Hamming) was approximately 12 seconds.

TABLE I. THE COMPARISON OF PROCESSING TIME

Method Feature Matching Time/ Filtering (remove points) time [s]

BF ANN Cascade Fast Cascade

BF Hamming

SIFT 80/ 10 160/ 12 15/ 13 6/ 13 -

AKAZE 829/ 11 109/ 11 30/ 22 25/ 22 -

AKAZE binary

- - - - 194/ 12

The key idea of matching algorithm (SIFT, AKAZE, AKAZE binary) is find matching points between two input images. The matching rate (number of correct matching points) is shown in Tab. II.

TABLE II. MATCHING RATE (NUMBER OF CORRECT MATCHING POINTS)

Method Number of images/ Number of matching points [-]


BF Hamming

SIFT 41/ 8732 43/ 8942 36/ 7146 37/ 7219 -

AKAZE 35/ 7801 33/ 7331 28/ 6690 30/ 7020 -

AKAZE binary

- - - - 25/ 5500

The root mean square error (RMSE), returned as the Euclidean distance between the aligned point clouds. The point to point error takes every point in the reading (pi), applies the transformation matrix (rotation matrix R and translation vector t), and finds the closest corresponding point in reference (qi). Then it calculates the square of the Euclidean distance and sums over all N such pairs. When the models are identical, the error would be zero. However, this error itself does not hold too much information about the quality of the registration itself. Instead, the root mean square error (RMSE) is calculated, which gives the average distance error between the two point clouds (see Tab. III).

TABLE III. THE COMPARISON OF RMSE ERROR

Method RMSE [-]


BF Hamming

SIFT 0.5528 0.5554 0.5068 0.5114 -

AKAZE 0.8164 0.7971 0.7941 0.8109 -

AKAZE binary

- - - - 0.7071

Align two clouds was realized by 4 equivalent point pairs (see Fig. 7). The visualization of the Hausdorff distance between the reference point cloud and the created point cloud is shown in Fig. 8.

0.000486

0.062926

0.156584

0.203414

0.250243

0.328292

0.437561

0.500000

Fig. 8. Visualization of Hausdorff distance.

For the comparison the Hausdorff distance of two point clouds was used. The blue color is minimal Hausdorff distance and red color is maximal Hausdorff distance (see Fig. 8).

V. CONCLUSION AND FUTURE WORK

In this work, we presented an experimental evaluation of the performance of three feature matching algorithm (SIFT, AKAZE, AKAZE binary) using BF, ANN Cascade, Fast Cascade and BF Hamming matcher. The all our experiments are divided into two parts. Firstly, the algorithm identifies features in all the images from the dataset, like edges with gradients in multiple directions, and tries to match these features between all the images and then computing the relative motion. Secondly, the algorithm builds a 3D model with the correlated features and

creates a 3D point cloud with color information of the scanned object. The final alignment of the two clouds was realized by 4 equivalent point pairs. The experimental results show that the best results using combination of SIFT (Scale-invariant feature transform) descriptor and ANN (Approximate Nearest Neighbor) matcher were obtained (43 matching images and 8942 found corresponding points). The computational time for SIFT algorithm using Approximate Nearest Neighbors matcher (ANN) was approximately 160 seconds.

In our next work, we plan to improve all experiments by acceleration of the computation time and by improving accuracy of point cloud registration algorithm. In our case this represents compromise between accuracy and efficiency. This means that the higher the accuracy requirement, the smaller the demands will be in terms of the final memory usage. Finally, these algorithms based on multiple view geometry for improved sky/ cloud detection, recognition and registration will be used.

REFERENCES [1] L. Moisan and B. Stival, “A Probabilistic Criterion to Detect Rigid Point

Matches Between Two Images and Estimate the Fundamental Matrix,” International Journal of Computer Vision, 57(3), 2004, pp. 201-218.

[2] S. Agarwal and B. Bhowmick, “3D point cloud registration with shape constraint,” 2017 IEEE International Conference on Image Processing (ICIP), Beijing, 2017, pp. 2199-2203.

[3] M. Attia and Y. Slama, “Efficient Initial Guess Determination Based on 3D Point Cloud Projection for ICP Algorithms,” 2017 International Conference on High Performance Computing & Simulation (HPCS), Genoa, 2017, pp. 807-814.

[4] I. Nikolov and C. Madsen, “Benchmarking Close-range Structure from Motion 3D Reconstruction Software Under Varying Capturing Conditions,” Springer International Publishing, Cham, 2016, pp. 15-26.

[5] Djurdjani and D. Laksono, “Open source stack for Structure from Motion 3D reconstruction: A geometric overview,” 2016 6th International Annual Engineering Seminar (InAES), Yogyakarta, 2016, pp. 196-201.

[6] M. Nabil and F. Saleh, “3D reconstruction from images for museum artefacts: A comparative study,” 2014 International Conference on Virtual Systems & Multimedia (VSMM), Hong Kong, 2014, pp. 257-260.

[7] P. Moulon, P. Monasse and R. Marlet, “Global Fusion of Relative Motions for Robust, Accurate and Scalable Structure from Motion,” IEEE International Conference on Computer Vision, Sydney, NSW, 2013, pp. 3248-3255.

[8] A. Cristea: Free Blender Skin Shader and Model, http://www.3dscanstore.com/index.php?route=information/information&information_id=16

[9] Structure from Motion (Structure from Motion from Two Views), http://ww2.mathworks.cn/help/vision/ug/structure-from-motion.html?nocookie=true&.mathworks.com

[10] Structure From Motion from Multiple Views, http://uk.mathworks.com/help/vision/examples/structure-from-motion-from-multiple-views.html?.mathworks.com&nocookie=true

[11] P. Moulon, P. Monasse, R. Perrot, and R. Marlet, “OpenMVG: Open Multiple View Geometry,” Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics). Cancun, Mexiko, Volume 10214 LNCS, 2017, pp. 60-74, December 2016.

[12] A. Khatamian, H. R. Arabnia, “Survey on 3D Surface Reconstruction,” Journal of Information Processing Systems, 2016, Vol. 12, No. 3, pp. 338-357, September 2016.

Advanced point cloud estimation based on multiple view...

Documents

Transcript of Advanced point cloud estimation based on multiple view...