High Resolution Point Cloud Generation from Kinect and HD ...

High Resolution Point Cloud Generation from Kinect and HD CamerasUsing Graph Cut

Suvam Patra1, Brojeshwar Bhowmick1, Subhashis Banerjee1 and Prem Kalra1

1Department of Computer Science and Engineering, Indian Institute of Technology Delhi, New Delhi, India{suvam.patra.mcs10, brojeshwar, suban, pkalra}@cse.iitd.ac.in

Keywords: Kinect, Resolution Enhancement, Graph Cut, Normalized Cross Correlation, Photo Consistency, VGA, HD.

Abstract: This paper describes a methodology for obtaining a high resolution dense point cloud using Kinect (J. Smisekand Pajdla, 2011) and HD cameras. Kinect produces a VGA resolution photograph and a noisy point cloud.But high resolution images of the same scene can easily be obtained using additional HD cameras. We combinethe information to generate a high resolution dense point cloud. First, we do a joint calibration of Kinect andthe HD cameras using traditional epipolar geometry (R. Hartley, 2004). Then we use the sparse point cloudobtained from Kinect and the high resolution information from the HD cameras to produce a dense pointcloud in a registered frame using graph cut optimization. Experimental results show that this approach cansignificantly enhance the resolution of the Kinect point cloud.

1 INTRODUCTION

Nowadays, many applications in computer vi-sion are centred around generation of a complete 3Dmodel of an object or a scene from depth scans orimages. This traditionally required capturing imagesof the scene from multiple views to generate a modelof the scene. However, today with the advent ofaffordable range scanners, reconstruction of scenesfrom multi-modal data which include image as wellas depth scans of objects and scenes help in more ac-curate modelling of 3D scenes.

There has been considerable work with time-of-flight (ToF) cameras which capture depth scans ofthe scene by measuring the travel time of an emit-ted IR wave from the device reflected back from theobject (S. Schuon and Thrun, 2008). Recently, amuch cheaper range sensor has been introduced byMicrosoft called the Kinect (J. Smisek and Pajdla,2011) which has an inbuilt camera, an IR emitter anda receiver. The emitter projects a predetermined pat-tern whose reflection off the object provides the depthcues for 3D reconstruction. Though Kinect producesrange data only in VGA resolution, this data can bevery useful as an initial estimate for subsequent res-olution enhancement. There have been several ap-proaches to enhance the resolution of a point cloudobtained from range scanners or ToF cameras, usinginterpolation or graph based techniques (S. Schuonand Thrun, 2009; S. Schuon and Thrun, 2008). Diebel

et.al. (Diebel and Thrun, 2006) used a MRF basedapproach whose basic assumption is that depth dis-continuities in scene often co-occur with intensity orbrightness changes in the scene, or in other wordsregions of similar intensity in a neighbourhood havesimilar depth. Yang et.al. (Qingxiong Yang and Nistr,2007) make the same assumption and use a bilateralfilter to enhance the resolution in depth. However, theassumption is not universally true and may result inover smoothing of the solution.

Sebastian et. al. (S. Schuon and Thrun, 2009;S. Schuon and Thrun, 2008), use a super-resolutionalgorithm on low resolution LIDAR ToF cameras andthey rely on the depth data for detecting depth dis-continuities instead of relying on regions of imagesmoothness.

In this paper we propose an algorithm for depthsuper-resolution using additional information frommultiple images obtained through HD cameras. Weregister the VGA resolution point cloud obtainedfrom Kinect with what can be obtained from the HDcameras using multiple views geometry and carry outa dense 3D reconstruction in the registered frame us-ing two basic criteria: i) photo-consistency (K Ku-tulakos, 1999) and ii) rough agreement with Kinect.The reconstructed point cloud is at least ten timesdenser in comparison to the initial point cloud. In thisprocess we also fill up the holes of the initial Kinectpoint cloud.

2 PROPOSED METHODOLOGY

2.1 Camera Calibration

We determine the camera internal calibration matrices(R. Hartley, 2004) for the Kinect VGA camera and allthe HD cameras offline using a state of the art cam-era calibration technique (Z. Zhang, 2000). Hence-forth we assume that all the internal camera calibra-tion matrices are known and define the 3× 4 cameraprojection matrix for the Kinect VGA camera as

P = K[I|0] (1)

where K is the camera internal calibration matrix ofthe Kinect VGA camera. In other words, Kinect isour world origin.

We use ASIFT (Jean-Michel Morel, 2009) to ob-tain image point correspondences and for every HDcamera we compute the extrinsic camera parametersusing standard epipolar geometry (R. Hartley, 2004).For each HD camera we first carry out a robust esti-mation of the fundamental matrix (R. Hartley, 2004).Given a set of image point correspondences x and x′,the fundamental matrix F is given by:

x′T Fx = 0 (2)

and can be computed using eight point correspon-dence.

Once, the Fundamental Matrix is known, we canestimate the external calibration from essential matrixE, derived from Fundamental matrix using the equa-tion as in (R. Hartley, 2004)

E = K′T FK = [t]×R = R[RT t]× where, K′ is theinternal calibration matrix of the HD camera. As thisessential matrix has four possible decompositions, wecan select one of them using the cheirality check(R. Hartley, 2004) on Kinect point cloud.

The projection matrix of the HD camera in theKinect reference frame is then given as

P′ = K′[R|t] (3)

2.2 Generation of High ResolutionPoint Cloud

Normalized cross correlation(NCC) method, whichtries to find point correspondences in an image pairby computing statistical correlation between the win-dow centred at the candidate point, is an inadequatetool for finding dense point correspondences. Pro-jecting the sparse Kinect point cloud on to an HD im-age leaves most pixels without depth labels, and onecan attempt to establish correspondence for these pix-els using normalized cross correlation along rectified

epipolar lines. Once the correspondence is found wecan obtain the 3D point for this correspondent pairusing stereo triangulation technique. In figure 1 weshow a result obtained using NCC. The reconstruc-tion has many holes due to ambiguous cross correla-tion results and incorrect depth labels.

(a) Initial Kinect pointcloud

(b) High resolution pointcloud generated by NCC

Figure 1: Resolution enhancement using NCC

The voxel labelling problem can be represented asone of minimizing an energy function of the form

E(L) = ∑p∈P

Dp(Lp)+ ∑(p,q)∈N

Vp,q(Lp,Lq) (4)

where P is the set of voxels to be labelled, L ={Lp|p ∈ P} is a 0-1 labeling of the voxel p, Dp(.) isdata term measuring the consistency of the label as-signment with the available data, N defines a neigh-bourhood system for the voxel space and each Vp,q(.)is a smoothness term that measures the consistency oflabelling at neighbouring voxels.

When the above energy minimization problemis represented in graphical form (Boykov and Kol-mogorov, 2004), we get a two terminal graph withone source and one sink nodes representing the twopossible labels for each voxel (see figure 2). Eachvoxel is represented as a node in the graph and eachnode is connected to both source and sink nodes withedge weights defined according to the data term ofthe energy function. In addition, the voxel nodes arealso connected to each other with edges, with edgestrengths defined according to the neighbourhood in-teraction term. A minimum cut through this graphgives us a minimum energy of the configuration.

Figure 2: A two terminal graph from (Boykov and Kol-mogorov, 2004)

Assigning Cost to the Data Term

Photo consistency (K Kutulakos, 1999) is one of themost frequently used measures for inter image con-sistency. However, in real situations, several voxelsin a close neighbourhood in depth satisfy the photoconsistency constraint resulting in a “thick” surfaceas demonstrated in top view in figure 3. In view ofthis, we use closeness to initial Kinect data as an ad-ditional measure to resolve this problem of thicknessin the output high resolution point cloud.

(a) Acual viewof the scenefrom front

(b) Topview withoutdistancemeasure

(c) Top viewwith distancemeasure

Figure 3: Comparison between resolution enhancementwithout and with distance measure

We define the data term based on the followingtwo criteria: i) Adaptive photo consistency measurefor each voxel. ii)Distance of each voxel from itsnearest approximate surface.

We use the photo consistency measure suggestedby Slabaugh et. al.(Slabaugh and Schafer, 2003). Weproject each voxel i on to the N HD images and cal-culate the following two measures:1. S(i), the standard deviation of the intensity values

in the projection neighbourhoods calculated overall N images.

2. s̄(i), the average of the standard deviation in theprojection neighbourhoods for each image projec-tion.

The voxel i is photo consistent over the N images ifthe following condition is satisfied

S(i)< τ1 + τ2 ∗ s̄(i) (5)where τ1 and τ2 are global and local thresholds to besuitably defined depending on the scene. The overallthreshold specified by the the right hand side of theabove inequality changes adaptively for each voxel.For each voxel we assign a weight Dphoto(.) for theterminal edges in the graph based on this threshold.

Dphoto(i) = photocost ∗ exp(− S(i)τ1 + τ2 ∗ s̄(i)

) (6)

with the source and

Dphoto(i) = photocost ∗ (1− exp(− S(i)τ1 + τ2 ∗ s̄(i)

))

(7)

with the sinkwhere, S(i) and τ1 + τ2 ∗ s̄(i) is the standard devia-tion and the adaptive threshold respectively for the ith

voxel and photocost is a scale factor. Here the ex-pression inside the exponential gives the normalizedstandard deviation of ith voxel.

As a pre-processing step before applying graphcut, we create an approximate surface (M Alexa,2003) for each non-Kinect voxel using the Kinectvoxels in its neighbourhood NK of size K×K×K.We pre-process the Kinect point cloud to generate anapproximate surface for each non-Kinect voxel in ourvoxel space in the following way:

We consider Sp as the surface that can be con-structed with the voxels P = {pi} captured by theKinect. Then, as suggested in (M Alexa, 2003), wetry to replace Sp with an approximate surface Sr withreduced set of voxels R = {ri}. This is done in twosteps: A local reference plane H = {x| 〈n,x〉−D =0,x ∈ R3},n ∈ R3, ||n|| = 1 is constructed using themoving least squares fit on the the point pi under con-sideration. The weights for each pi is a function ofthe distance from the projected current voxel on tothe plane. So, H can be determined by locally mini-mizing

N

∑i=1

(〈n, pi〉−D)2θ(||pi−q||) (8)

where θ is a smooth monotonically decreasing func-tion, q is the projected point on the plane correspond-ing to the voxel r, n is the normal and D is the per-pendicular distance from the origin of the plane. As-suming q = r + tn with t as a scale parameter alongthe normal, equation(8) can be rewritten as

N

∑i=1

(〈n, pi− r− tn〉)2θ(||pi− r− tn||) (9)

Let qi be the projection of pi on H and fi be theheight of pi over H. We can find the surface estimateZ = g(X ,Y ) by minimizing the least squares equationgiven by:

N

∑i=1

(g(xi,yi)− fi)2θ(||pi−q||) (10)

where xi and yi are the x and y values correspond-ing to the ith voxel and θ is a smooth monotonicallydecreasing function which is defined as:

θ(d) = e−d2

h2 (11)

where, h is the fixed parameter which depicts thespacing between neighbouring voxels. It reflects thesmoothness in the surface. For our experiment wehave taken a fourth order polynomial fitting.

This surface is locally smooth and usually lacksgeometric details, but provides a good measure for theapproximate depth of the surface.

Hence, the second cost that we include in the dataterm is based on the distance of the current voxel fromthe pre-computed surface that fits that voxel. So, weproject each of the non-Kinect voxel on to the pre-computed surface (M Alexa, 2003). Ideally if thevoxel is on the surface then the difference between itsactual coordinates and projected coordinates shouldbe small, which encourage us to use this measure inthe data term. Accordingly, we assign a cost to Dp onthe basis of the euclidean distance between its actualcoordinates and projected coordinates on the approx-imate surface.

Ddist(i) =P(ri)− ri

dist threshold(12)

with the source andD′dist(i) = 1−Ddist(i) (13)

with the sink. Here, the threshold dist threshold isexperimentally determined on the basis of the sceneunder consideration. The total cost is expressed as:

Dp(i) = Ddist(i)∗Dphoto(i) (14)

TABLE 1:Assignment of Dp

Dp(i) Type of Voxel∞ with source and Kinect voxel0 with sinkBased on equation(6,7,12,13,14) Non-Kinect voxel

The cost Dp(.) is assigned to a Kinect voxel so thatit is turned “ALWAYS ON”. After that, for each non-Kinect voxel first a distance check is done followed bya photo consistency check over all the N HD images.Then accordingly a cumulative cost is assigned basedon the equations above.

Assigning Cost to the Smoothness Term

We have assigned a constant smoothness cost to theedges between each of the voxels and its neighbour-hood N . Here, we have taken N to be the 6-neighbourhood of each voxel.

Smoothness cost is assigned according to the Pottsmodel(V. Kolmogorov, 2004; Y. Boykov and Zabih,2001). We can represent Vp,q as

Vp,q( fp, fq) =Up,q.δ( fp 6= fq) (15)Here, we have taken Vp,q from Potts model as in Table2. After assigning the data and smoothness costs tothe graph edges, we run the min-cut on this graph.

TABLE 2:Assignment of Vp,q based on Potts Model

Vp,q( f p, f q) Condition0 f p = f q(Both are Kinect voxels)100 Otherwise

3 RESULTS

We provide experimental results on both indoorand outdoor scenes. For capturing the HD images wehave used the SONY HVR-Z5P camera which has animage resolution of 1440× 1080. This camera wasplaced at multiple positions to capture images of thesame scene from different viewpoints. The experi-mental set-up for capturing a typical scene by oneKinect and three HD cameras has been depicted infigure 4.

Figure 4: Our experimental set-up for capturing a typicalscene

We have used a Dell Alienware i7 machine with6GB RAM support for producing the results. In ourcase the number of voxels that we take for the scenedepends largely on the amount of physical memoryof the machine. The figure 5 shows the resolution en-hancement of an indoor scene done using one Kinectand two HD cameras. Figure 5b shows the high res-olution point cloud generated with our method. Inthis all the holes have been filled up in contrast tothe point cloud generated using NCC based methodas shown in figure 1b. There are almost no outlierpoints. Here we have used 300×300×100 voxels andthe value of τ1 = 60 and τ2 = 0.5. Figure 6 shows theresult of resolution enhancement on an outdoor scenein the archaeological site of Hampi using one Kinectand two HD cameras. The point cloud is at least 10times denser than the initial point cloud. The value ofτ1 and τ2 were chosen to be 80 and 0.5 respectively.Figure 7 also shows the resolution enhancement onanother sculpture at Hampi using one Kinect and twoHD cameras. The values of τ1 and τ2 were similar tofigure 6.

Figure 8 shows the resolution enhancement of atoy model where the surface is not smooth. This ex-periment was performed using one Kinect and threeHD cameras. We have shown the dense point cloudcorresponding to both the low resolution scene as well

as the high resolution scene and finally overlappedtheir coloured depth map to show that the geometryis not distorted in any way. In order to do a quantita-tive evaluation of our methods we have adopted twoapproaches.

(a) Initial pointcloud

(b) High resolu-tion point cloud

(c) Side view

Figure 5: Indoor scene- A typical room. (a) Initial lowresolution point cloud from Kinect, (b) and (c) front andside view of the high resolution point cloud generated byour method with τ1 = 80 and τ2 = 0.5

(a) Initialpoint cloud


(c) Sideview

Figure 6: Archaeological scene1- A sculpture depicting amonkey on a pillar. (a) Initial low resolution point cloudfrom Kinect, (b) and (c) front and side view of the highresolution point cloud generated by our method with τ1 =60 and τ2 = 0.5



(c) Side view

Figure 7: Archaeological scene2- A sculpture depicting agoddess on a pillar. (a) Initial low resolution point cloudfrom Kinect, (b) and (c)front and side view of the high res-olution point cloud generated by our method with τ1 = 60and τ2 = 0.5

3.1 Verification through Projection onAnother Camera

In order to demonstrate the efficiency of our methodwe have computed the projection matrix of a different


(b) High res-olution pointcloud

(c) Two depthmaps overlapped

Figure 8: Indoor Scene- A model of a dog. (a) Initial lowresolution point cloud from Kinect, (b) front view of thehigh resolution point cloud generated by our method withτ1 = 70 and τ2 = 0.5, (c) blue HD depth map overlappedwith red low resolution depth map showing that the geome-try is preserved.

(a) OriginalImage

(b) Pro-jectedImage

(c) Dif-ferenceImage

Figure 9: Verification through projection on another cam-era for the scene in figure 6. The difference image in whicharound 90% is black, shows that the geometry is preserved.

camera which is seeing the same scene as of figure 6,little displaced from the original cameras used for res-olution enhancement and whose external calibrationmatrix [R|t] is known beforehand. We have used thisprojection matrix to project the HD point cloud ontoa 2D image and have taken the difference between theprojected image and the ground truth. The differenceimage in figure 9c is around 90% black showing thatthe HD point cloud generated by our method was ge-ometrically accurate.

3.2 Verification through Interpolationand Comparison

In order to show that the depth map of the HD pointcloud generated by our method conforms to the pointcloud generated by Kinect, we generated an interpo-lated point cloud for the initial point cloud of figure6 by fitting an MLS surface of order four through it.In order to quantify that our result show better depthvariations than the interpolated point cloud, we tooka part of each of the point clouds generated by theinterpolation method and our method; and comparedthem with that of the Kinect point cloud. The stan-dard deviation of the depth variations in the selected

part of the point cloud generated by interpolationwas 0.010068 whereas the same by our method was0.021989, which is much closer to the standard devi-ation generated by original point cloud i.e. 0.024674.

(a) OriginalKinect pointcloud

(b) InterpolatedPoint Cloud

(c) HD pointcloud generatedby our method

Figure 10: Verification through interpolation and compar-ison. The area selected by the red rectangle shows the partselected for quantitative estimation of the depth variations.

4 CONCLUSIONS

We have presented a methodology which com-bines HD resolution images with the low resolutionKinect to produce high-resolution dense point cloudusing graph cut. Firstly, Kinect and HD camerasare registered to transfer Kinect point cloud to theHD camera for obtaining high resolution point cloudspace. Then, we discretize the point cloud in voxelspace and formulate a graph cut formulation whichtake care of the neighbor smoothness factor. Thismethodology produces good high resolution imagewith the help of low resolution Kinect point cloudwhich could be useful in building high resolutionmodel using Kinect.

ACKNOWLEDGMENTS

The authors gratefully acknowledge Dr. Sub-odh Kumar, Neeraj Kulkarni, Kinshuk Sarabhai andShruti Agarwal for their constant help in providingseveral tools for Kinect data acquisition, module anderror notification respectively.

Authors also acknowledge Department of Scienceand Technology, India for sponsoring the project on“Acquisition, representation, processing and displayof digital heritage sites” with number “RP02362”under the India Digital Heritage programme whichhelped us in acquiring the images at Hampi in Kar-nataka, India.

REFERENCES

Boykov, Y. and Kolmogorov, V. (2004). An experimentalcomparison of min-cut/max-flow algorithms for en-ergy minimization in vision. In IEEE Transactionson Pattern Analysis and Machine Intelligence, Vol. 26,No. 9, pages 1124–1137.

Diebel, J. and Thrun, S. (2006). An application of markovrandom fields to range sensing. in advances in neuralinformation processing. In Advances in Neural Infor-mation Processing Systems, page 291 298.

J. Smisek, M. J. and Pajdla, T. (2011). 3d with kinect.In IEEE Workshop on Consumer Depth Cameras forComputer Vision.

Jean-Michel Morel, G. Y. (2009). Asift: A new frameworkfor fully affine invariant image comparison. In SIAMJournal on Imaging Sciences. Volume 2 Issue 2.

K Kutulakos, S. S. (1999). A theory of shape by space carv-ing. In 7th IEEE International Conference on Com-puter Vision, volume I, page 307 314.

M Alexa, J. B. e. (2003). Computing and rendering pointset surfaces. In IEEE Transactions on Visualizationand Computer Graphics.

Qingxiong Yang, Ruigang Yang, J. D. and Nistr, D. (2007).Spatial-depth super resolution for range images. InIEEE Conference on Computer Vision and PatternRecognition.

R. Hartley, A. Z. (2004). Multiple View Geometry in Com-puter Vision. Cambridge University Press, New York,2nd edition.

S. Schuon, C. Theobalt, J. D. and Thrun, S. (2008). High-quality scanning using time-of-flight depth superres-olution. In IEEE Computer Society Conference onComputer Vision and Pattern Recognition Workshops.

S. Schuon, C. Theobalt, J. D. and Thrun, S. (2009). Lidar-boost depth superresolution for tof 3d shape scanning.In IEEE Conference on Computer Vision and PatternRecognition.

Slabaugh, G. and Schafer, R. (2003). Methods for volumet-ric reconstruction of visual scenes. In IJCV 2003.

V. Kolmogorov, R. Z. (2004). What energy functions canbe minimized via graph cuts? In IEEE Transactionson Pattern Analysis and Machine Intelligence, pages147–159.

Y. Boykov, O. V. and Zabih, R. (2001). Fast approximateenergy minimization via graph cuts. In IEEE Trans-actions on Pattern Analysis and Machine Intelligence,vol. 23, pages 1222–1239.

Z. Zhang, M. C. R. W. U. (2000). A flexible new techniquefor camera calibration. In IEEE Transactions On Pat-tern Analysis And Machine Intelligence, VOL. 22, NO.11.

High Resolution Point Cloud Generation from Kinect and HD ...

Documents

Transcript of High Resolution Point Cloud Generation from Kinect and HD ...