[IEEE 2013 Visual Communications and Image Processing (VCIP) - Kuching, Malaysia...

RECOVERING DEPTH OF BACKGROUND AND FOREGROUND FROM A MONOCULARVIDEO WITH CAMERA MOTION

Hu Tian1, Bojin Zhuang1,2,∗, Yan Hua1, Yanyun Zhao1,2, Anni Cai1,2

1School of Information and Communication Engineering2Beijing Key Laboratory of Network System and Network Culture

Beijing University of Posts and Telecommunications, Beijing, China

ABSTRACT

In this paper we propose a depth recovery approach formonocular videos with or without camera motion. By com-bining geometric information and moving object extraction,not only the depth of background but also the depth of fore-ground can be recovered. Furthermore, for cases involvingcomplex camera motion such as fast moving, translating, ver-tical movement, we propose a novel global motion estimation(GME) method including effective outlier rejection to extractmoving objects, and experiments demonstrate that the pro-posed GME method outperforms most of the state-of-the-artmethods. The depth recovery approach we propose is testedon four video sequences with different camera movements.Experimental results show that our approach produces moreaccurate depth of both background and foreground than exist-ing depth recovery methods.

Index Terms— geometric perspective, GME, outlier re-jection, moving objects, depth recovery

1. INTRODUCTION

Recovering 3D depth information from existing 2D videos isan important issue in computer vision. Many works in thisfield have been focused on semi-automatic depth generationfor 2D-to-3D conversion [1] and on multi-view depth gener-ation [2]. Recovering depth automatically from a monocularvideo is still a challenging problem due to lack of user-inputor multi-view sequences. However if this problem can besolved, it will promote developments of many applications,such as 2D-to-3D conversion, scene understanding, robotics.

The existing methods of depth recovery from monocularvideos can be divided into two categories: methods for stat-ic camera and methods for moving camera. The first cate-gory is widely studied by many researchers. In this kind of

∗Corresponding author. E-mail address: [email protected] work was supported by the National High Technology Research

and Development Program of China (863 Program) (No.2012AA012504),the National S&T Major Project of the Ministry of S&T of China(2012BAH41F03), and the Fundamental Research Funds for the Central U-niversities.

methods, various depth cues, such as geometric perspective[3, 4], occlusion [5] and focus/defocus [6], are often utilizedto recover the depth of the scene. Cheng [4] uses a depth hy-pothesis based on geometrical information to assign the depthof each segmented image block. Battiato [3] combines imageclassification and geometric information to generate the depthmap. Tam [6] extracts depth from focus by using blur analy-sis. When camera is static, motion information can be easilyused to obtain the relative positions of moving objects [7].

When camera is in motion, how to recover depth is a hardwork and few studies have been done on this problem. Zhang[8] uses a method, called structure from motion (SFM), to es-timate the camera parameters, and applies a bundle optimiza-tion method to recover depth from videos. Kim [9] uses mo-tion analysis to recognize whether the camera is static. Whencamera is moving, the motion is estimated by a KLT featuretracker and then converted to depth. However, these methodsonly focus on background depth recovery, so moving objectsin the scene significantly degrade their performances.

Recently a new type of depth recovery methods based onmachine learning appears. Saxena [10] employs a trainingset which includes a group of images and their correspondingdepth maps to train a Markov Random Field (MRF) modeland then uses the model to estimate the depth of a test im-age. This method is presented only for still images. Karsch[11] gives a depth recovery method for videos, which alsoemploys a dataset including images and their correspondingdepth maps. For each frame of the video, the top 7 matchedimages from the reference dataset are selected as the candi-date images, and an energy function is optimized to generatethe most likely depth map for the frame by considering al-l of the warped candidates. A simple moving object detec-tion method is added to enhance the depth, which can be usedin non-translating movies. The two methods achieve goodresults at their test sets. However, when the test image isdistinctly different from the reference images, these methodsproduce unsatisfactory results.

As we have seen above, most of the methods only recov-er the depth of background. When camera is static or withsimple movement, methods like [11] can estimate the depth

Fig. 1. Proposed framework.

of moving objects. However, a monocular video contains notonly a background scene but also a number of moving objects,and in addition this video may be captured by a camera withcomplex movements such as fast moving, translating, verticalmovement. In such circumstances, depth recovery of back-ground and moving objects is a challenging problem. We willpropose an approach in this paper to attack this problem.

To recover the depth of moving objects from a video cap-tured by moving camera, the primary difficulty is to extractthe moving objects. In order to do that, one has to know howthe background is moved by using methods so called glob-al motion estimation (GME). GME methods can generally bedivided into two classes: based on minimizing the residualerrors of intensity [12] and based on minimizing the residu-al errors of motion vectors [13]. The latter class of methodsneeds to compute motion vector for each pixel block, so it istime consuming and the accuracy of motion vector greatly af-fects the precision of the estimated camera parameters. TheGME method we propose belongs to the first class.

It should be noted that some pixels experience not onlyglobal motion (GM) but also local motion (LM) which is usu-ally induced by object motion. Pixels with LM are consideredas outliers in GME, which bias the estimation of GM param-eters. Chen [13] proposes to reject outliers by applying threefilters before GME, while Hartley [14] uses a method calledRandom Sample Consensus (RANSAC). Qi [12] proposes toselect candidate outlier blocks which have large residuals, andto reject outlier blocks from the candidates according to theirpositions and relations with their neighborhoods. However,these methods cannot eliminate all outliers.

In this paper we propose a framework to recover the depthof both background and foreground from a monocular videowith and without camera motion. Especially when camera isin motion, we propose a novel GME method including effec-tive outlier rejection to extract the moving objects, which isthe most important phase of the proposed framework.

2. PROPOSED FRAMEWORK

A schematic description of the proposed framework is shownin Fig.1. It consists of four stages: background depth gen-

Input image sequences

Background depth bygeometric information

Extracted movingobjects by GME

Final depth afterfusion

Fig. 2. Example of the proposed framework.

eration (BDG), motion analysis, moving objects extraction,depth fusion. The depth of the background is generated inBDG by combining geometrical information, color segmen-tation and sky detection. In motion analysis stage, the sum ofabsolute difference (SAD) of four corner blocks between twoconsecutive frames is calculated to judge whether the camerais in motion. If camera movement exists, the proposed GME;otherwise the motion segmentation (MS), is used to extractmoving objects. In depth fusion stage, the depth of a movingobject is assigned to the depth of the point that the bottom ofthe object touches the background. Finally, a bilateral filteris employed to smooth the fused depth map. Fig.2 shows anexample of the procedure.

2.1. Background depth generationA depth map of the background in a frame is obtained by us-ing geometric perspective and sky detection. Geometric per-spective refers to the fact that parallel lines in real life, suchas buildings, roads, hallways, appear to converge to a point inthe image. This kind of parallel lines is called vanishing lines,and the convergent point is called vanishing point. Along thevanishing lines, the distant increases gradually. In this paper,we use the method of [15] to detect edge and Hough trans-form to find vanishing lines in a frame, and a number of in-tersections of these vanishing lines can thus be obtained. Wecluster these intersections by K-means with K = 3 and select

Fig. 3. Flow chart of proposed GME method.

the clustering center that includes the most numbers of inter-sections as the vanishing point. Then the depth hypothesis ofthe frame is determined from five depth hypotheses [3, 4] ac-cording to the position of the vanishing point. If no vanishingpoint can be obtained, the bottom-up depth hypothesis is se-lected [4]. After segmenting the frame by mean shift [16], thedepth of each segmented region is assigned to the hypothesisdepth value. Finally, sky detection is employed to refine theobtained depth map. The sky detector is an SVM with RGBvalues and the vertical coordinate of a pixel as its input fea-tures. If a region is detected as sky, the depth of this region isassigned to the farthest depth value.

2.2. Proposed GME methodThe GME method for complex movement of camera is thecore of our proposed approach. Fig.3 shows the flow chart ofthe proposed GME method.

In order to reduce computational complexity, our GMEmethod adopts a three-level frame pyramid which is con-structed by down sampling [12, 17]. As shown in Fig.3, theGM parameter estimation is done from the coarsest (bottom)level to the finest (top) level. At each level, outlier rejection isfirst performed and the remaining pixels are then used to opti-mize a GME model. At the coarsest level, we set pixels at theboundary and center of the frame as the initial outliers, andperform the optimization twice in order to acquire good esti-mates of GM parameters, which will be projected to the nextlevel as initial parameters. At the top level, the final GM pa-rameters at the current frame are obtained and used as initialparameters for next frame to speed up optimization process.2.2.1. GME model

We use the following six-parameter affine model to describeglobal motion: {

x′

i = a0xi + a1yi + a2

y′

i = a3xi + a4yi + a5,(1)

where (xi, yi) is the coordinate of the ith pixel in curren-t frame I , (x

′

i, y′

i) is the coordinate of corresponding pixelin the previous frame I

′. a = (a0, a1, a2, a3, a4, a5)

T is thevector of global motion parameters to be solved. The pro-jection from the lower level to the higher one is performedby multiplying the translation parameters a2 and a5 by 2 andkeeping the remaining parameters unchanged.

Given the global motion model, we use an objective func-tion based on minimizing the residual errors of intensity:

E(a) =1

2

N∑i=1

[I′(x

′

i, y′

i)− I(xi, yi)]2, (2)

where N is the numbers of selected pixels in E(a), ri =I(x

′

i, y′

i)− I(xi, yi) is the residual of the ith pixel.To estimate the global motion parameters a, we use

the Levenberg-Marquardt method that is a damped Gauss-Newton method to minimize E(a). We define:

r(a) = (r1, r2, ......rN )T . (3)

We rewrite (2) to

E(a) =1

2

N∑i=1

[I′(x

′

i, y′

i)− I(xi, yi)]2

=1

2rT (a)r(a).

(4)

(4) can be minimized by the Levenberg-Marquardt methodwhich can better converge to local minimum than theGaussian-Newton method.

2.2.2. Pixel selectionIn some existing methods such as [12], all the pixels in aframe take part in the optimization of the objective function.However, we notice that the residuals of pixels on edges ofa uniform region are likely to change greater than that of thepixels inside the region during each iteration of the optimiza-tion, and a large intensity gradient benefits the GME methodto find the local minimum. Therefore, we only choose pix-els with large gradients, such as edge pixels, to take part inthe optimization. This also reduces computational complexi-ty because fewer pixels are computed in (2).

In addition, if the selected pixels gather in a small region,the motion of these pixels can hardly represent the motion ofall the background and this will lead to wrong GM parameterestimation. In order to avoid such a case, we divide an im-age into blocks and only select the pixels with the θ% largestgradients in each block.

2.2.3. Outlier rejection

Given GM parameters of current frame I , the differenceframe D between I and I

′can be obtained by motion com-

pensated prediction. Pixels with distinct large values (e.g., the

Stefan F078 Coastguard F170

RAN-LS RAN-LS

BLOC-RES BLOC-RES

Proposed Proposed

Fig. 4. Comparison of extracted moving objects on sequencesStefan and Coastguard using RAN-LS, BLOC-RES and theProposed, respectively.

top ρ% largest values) in D can be considered as outliers. Wetry to construct an outlier rejection mask B, and suppose thatB is composed of a set of non-overlapped blocks with a sizeSIZE × SIZE:

B =

M⋃m=1

bm

bm⋂bn = ∅, m 6=n, and m, n ∈ {1, 2, . . . ,M},

(5)

where bm is the block, ∅ is empty set, M is the number-s of blocks. And B is divided into BO and BI , whereBO and BI are the set of outlier and inlier blocks. Werepresent all the boundary blocks of a frame as BB , anddenote the sets of all selected pixels (mentioned in section2.2.2) in a frame and in the block as S and Sbm respectively,S = {Sb1 ,Sb2 , ...,SbM }. Furthermore, Sbm is divided intoS>T

bmand S≤T

bm(i.e., Sbm = S>T

bm

⋃S≤T

bm), where T is the

outlier threshold and is obtained at the top ρ% largest values.Block bm is classified to the set of candidate outlier blocks

BO by the following rule:

bm ∈ BO, if(β|S>Tbm| > |Sbm |) (6)

where β s a positive integer and | · | denotes the number ofpixels in the set. The outlier rejection will be performed onthese candidate outlier blocks, but not limited to them. Wepropose three rules to reject outliers.

Rule 1: A candidate block is an outlier block if it is aboundary block [12], i.e.,

bm ∈ BO, if(bm ∈ BB). (7)

Rule 2: A block is an outlier block if it is not a boundaryblock but has no less than Tob candidate blocks in its eight-neighborhood, i.e.,

Bblbm

= {bl : neighbor of bm}, (8)

bm ∈ BO, if(|Bblbm| ≥ Tob). (9)

In [12], the outlier blocks are limited to the candidate blockswith large residuals. However, moving objects may containoutliers with small residuals, and our rule 2 can reject suchoutliers.

Rule 3: After the first two rules are applied, pixels ofinlier blocks are checked. A pixel is a outlier if its residual islarger than T . This kind of outliers may be caused by noise.

By combining the three rules to reject outliers , we get theset of inliers pixels SE :

SE = {i|(i ∈ bm) ∧ (bm ∈ BI) ∧ (i ∈ S≤Tbm

)}. (10)

Only pixels in SE participate the optimization of the objectivefunction (2).

2.2.4. Moving objects extraction

When camera is in motion, we obtain the GM parameters ofcurrent frame using the method above. The residual framebetween the current frame and the motion-compensated pre-vious frame can be obtained. Since the residual frame nowonly contains local motions, moving objects can be detect-ed by binarilizing the residual frame with a threshold. Thenmorphological operations including the closing operation andholes filling are applied to refine the extracted moving object-s. Some results can be seen in Fig.4. When camera is static,the method of motion segmentation proposed in [11] is usedto extract moving objects.

3. EXPERIMENTS AND RESULTS

3.1. Performance of our proposed GME methodWe compare our proposed GME method with methods of[12, 13]. In [13], the authors give the comparison results of

Fig. 5. MAE comparison of CAS-GD, GD, RAN-LS, BLOC-RES and our method on sequences Stefan and Coastguard.

Table 1. PSNR Comparison

Sequence CAS-GD GD RAN-LS Proposed

Stefan 24.02 23.83 24.21 26.32Coastguard 26.78 26.56 26.98 28.15

City 29.48 28.71 29.54 31.46Tempete 27.82 26.51 27.84 28.17Waterfull 34.86 34.7 35.5 36.04Foreman 26.32 27.69 27.73 30.22

their method (CAS-GD) with other four methods, FLT-GD,GD, RAN-LS, LSS-ME, all based on minimization of mo-tion vectors. Our method outperforms all the five methods,here we only show the results of CAS-GD, GD and RAN-LS. [12] is an algorithm based on the sum of block residual(BLOC-RES). Experiments are carried out using standard testsequences: Stefan, Coastguard, City, Tempete, Waterfall andForeman, in CIF (352×288).Algorithm Parameters SIZE=16, θ% = 25%, ρ% = 20%,Tob = 2, β = 4.Subjective Results Sequences Stefan and Coastguard withfast camera motion are chosen to test the performance of mov-ing object extraction algorithm. We compare our method onlywith BLOC-RES and RAN-LS because RAN-LS outperform-s CAS-GD and GD on these two sequences. Some exampleresults are shown in Fig.4, where the result of BLOC-RESis from the original paper [12], and the result of RAN-LSis obtained with the MATLAB code provided in [13]. FromFig.4, we can see that our proposed method has better perfor-mance than other methods. Our method produces less noiseoutside the extracted moving objects and provides more com-plete moving objects (see legs in Stefan and contours of theboat in Coastguard) than other methods.Objective Results Two criteria are used to evaluate the pro-posed GME algorithm: mean absolute error (MAE) [12] andpeak signal to noise ratio (PSNR). The MAE reflects the pre-

cision of the estimated camera parameters, and PSNR rep-resents the image quality after global motion compensation.From Fig.5 we can see that the MAE of the proposed methodis generally lower than those of CAS-GD, RAN-LS, GD andBLOC-RES. Table 1 lists the average PSNR in dB on six se-quences. The former 100 frames of these sequences are se-lected to compute the average PSNR. As can be seen fromthe table, the proposed algorithm has better performance thanCAS-GD, RAN-LS and GD. Parameter deviations (in a rea-sonable range) have small influences to the performance ofthese algorithms. Therefore the parameter selection is notcritical.

3.2. Results of recovered depth maps

We mainly concern the performance of our method when bothcamera and objects are in motion. Because few methods con-sider this situation, we compare our methods to [11, 8, 9]which can handle camera movement to some extent. Theresults of [11] are produced by its public MATLAB codeand the Make3D database [10] (400 outdoor images withground truth) is used to be the reference dataset since ourtest sequences are all outdoor sequences. The results of[8] are produced by its released software on the website(http://www.zjucvg.net/acts/acts.html). For [9], we imple-ment its code. Four sequences with different motions are usedto evaluate these methods: Stefan and three video clips frommovie “Forrest Gump”. Stefan shows a tennis player withfast camera motion. Gump1 shows a mother holds her child’shand and walks on the road with a translating camera. Gump2shows a man drives a small car while the camera moves up.Gump3 shows a running man with a slow translating camera.

From results shown in Fig.6, we can see that the recovereddepth maps of our method are much better than those of oth-er methods. The bad results of [11] demonstrate that methodbased on machine learning requires the reference dataset cov-ers images similar to test images to produce decent results. InGump2, the method of [8] fails to produce depth maps by its

Fig. 6. Comparisons of depth on four sequences. Images from left to right are test images, depth of [11], depth of [8], depth of[9], depth and 3D images of the proposed approach. Darker pixels indicate objects farther away.

own software. When moving objects occupy a significant partof the frame, such as in Stefan and Gump1, the results of [8]are all wrong because large amount of object motion causeswrong estimation of camera parameters in SFM. The methodof [9] also produces poor results, especially the depth of themoving objects. That’s because that the motion of objectsgreatly decreases the accuracy of motion estimation as cam-era is moving. The three methods cannot recover the depth ofmoving objects on these sequences.

4. CONCLUSION

To automatically recover depth map from a monocular video,we propose a method by combining geometric informationand moving object extraction, which can be applied to stat-ic camera and moving camera. The geometric informationis used to recover the depth of background, and the movingobject extraction is used to refine the depth map. The mostimportant part of our work is moving object extraction undercomplex camera movement. Most of existing methods cannot recover accurate depth when both the camera and object-s are in motion. However, the proposed approach may givewrong depth in background when the scene doesn’t containobvious geometric information, and sometimes may producesunstable moving objects in time domain. These problems willbe studied in the future.

5. REFERENCES

[1] R. Phan, R. Rzeszutek, and D. Androutsos, “Semi-automatic2d to 3d image conversion using scale-space random walks anda graph cuts based depth prior,” in ICIP, 2011.

[2] W. Z. Yang, G. F. Zhang, H. J. Bao, J. Kim, and H. Y. Lee,“Consistent depth maps recovery from a trinocular video se-quence,” in CVPR, 2012.

[3] S. Battiato, S. Curti, M. La Cascia, M. Tortora, and E. Scorda-to, “Depth map generation by image classification,” in Elec-tronic Imaging, 2004.

[4] C. C. Cheng, C. T. Li, and L. G. Chen, “A 2d-to-3d conversionsystem using edge information,” in ICCE, 2010.

[5] Y. L. Chang, C. Y. Fang, L. F. Ding, S. Y. Chen, and L. G. Chen,“Depth map generation for 2d-to-3d conversion by short-termmotion assisted color segmentation,” in ICME, 2007.

[6] W. J. Tam and L. Zhang, “3d-tv content generation: 2d-to-3dconversion,” in ICME, 2006.

[7] T. Li, Q. H. Dai, and X. D. Xie, “An efficient method forautomatic stereoscopic conversion,” IET, 2008.

[8] G. F. Zhang, J. Y. Jia, T. T. Wong, and H. J. Bao, “Consistentdepth maps recovery from a video sequence,” TPAMI, 2009.

[9] D. Kim, D. Min, and K. Sohn, “A stereoscopic video gener-ation method using stereoscopic display characterization andmotion analysis,” Broadcasting, IEEE Transactions on, 2008.

[10] A. Saxena, M. Sun, and A. Y. Ng, “Make3d: Learning 3d scenestructure from a single still image,” TPAMI, 2009.

[11] K. Karsch, C. Liu, and S. B. Kang, “Depth extraction fromvideo using non-parametric sampling,” in ECCV. 2012.

[12] B. Qi, M. Ghazal, and A. Amer, “Robust global motion esti-mation oriented to video object segmentation,” TIP, 2008.

[13] Y. M. Chen and I. V. Bajic, “Motion vector outlier rejectioncascade for global motion estimation,” SPL, IEEE, 2010.

[14] R. Hartley and A. Zisserman, Multiple view geometry in com-puter vision, Cambridge Univ Press, 2000.

[15] P. Meer and B. Georgescu, “Edge detection with embeddedconfidence,” TPAMI, 2001.

[16] D. Comaniciu and P. Meer, “Mean shift: A robust approachtoward feature space analysis,” TPAMI, 2002.

[17] F. Dufaux and J. Konrad, “Efficient, robust, and fast globalmotion estimation for video coding,” TIP, 2000.

[IEEE 2013 Visual Communications and Image Processing (VCIP) - Kuching, Malaysia...

Documents

Transcript of [IEEE 2013 Visual Communications and Image Processing (VCIP) - Kuching, Malaysia...