Automatic Real-Time Video Matting Using Time-of...

18
Int J Comput Vis (2012) 97:104–121 DOI 10.1007/s11263-011-0471-x Automatic Real-Time Video Matting Using Time-of-Flight Camera and Multichannel Poisson Equations Liang Wang · Minglun Gong · Chenxi Zhang · Ruigang Yang · Cha Zhang · Yee-Hong Yang Received: 9 September 2010 / Accepted: 23 May 2011 / Published online: 15 June 2011 © Springer Science+Business Media, LLC 2011 Abstract This paper presents an automatic real-time video matting system. The proposed system consists of two novel components. In order to automatically generate trimaps for live videos, we advocate a Time-of-Flight (TOF) camera- based approach to video bilayer segmentation. Our algo- rithm combines color and depth cues in a probabilistic fu- sion framework. The scene depth information returned by the TOF camera is less sensitive to environment changes, which makes our method robust to illumination variation, dynamic background and camera motion. For the second step, we perform alpha matting based on the segmentation Electronic supplementary material The online version of this article (doi:10.1007/s11263-011-0471-x) contains supplementary material, which is available to authorized users. L. Wang ( ) · C. Zhang · R. Yang Department of Computer Science, University of Kentucky, 329 Rose Street, Lexington, KY, 40506, USA e-mail: [email protected] C. Zhang e-mail: [email protected] R. Yang e-mail: [email protected] M. Gong ( ) Department of Computer Science, Memorial University of Newfoundland, St. John’s, NL, A1B 3X5, Canada e-mail: [email protected] C. Zhang Microsoft Research, Redmond One Microsoft Way, Redmond, WA 98052, USA e-mail: [email protected] Y.-H. Yang Department of Computing Science, University of Alberta, Edmonton, AB, T6G 2E8, Canada e-mail: [email protected] result. Our matting algorithm uses a set of novel Poisson equations that are derived for handling multichannel color vectors, as well as the depth information captured. Real- time processing speed is achieved through optimizing the algorithm for parallel processing on graphics hardware. We demonstrate the effectiveness of our matting system on an extensive set of experimental results. Keywords Bilayer segmentation · Video matting · Time-of-flight camera 1 Introduction Digital matting or “pulling a matte” refers to the process of extracting a foreground object from still images or video se- quences and estimating the opacity (alpha value) for each pixel covered by the object. This operation serves as an im- portant tool in image and video editing applications, giving user the ability to composite extracted objects seamlessly into a novel background. In image composition, an observed image I(x,y) can be modeled as a linear blending of a foreground image F(x,y) and a background image B(x,y) with its alpha matte α(x,y) ∈[0, 1] by the compositing equation (Porter and Duff 1984)((x,y) arguments omitted for clarity) I = αF + (1 α)B. (1) On the other hand, matting algorithms aim at estimating F , B and α from a single input image I . Matting is thus inherently an under-constrained problem because for each pixel location three unknowns need to be determined from the observed pixel value. Most matting approaches rely on prior assumptions and user guidance to constrain the prob- lem. For instance, in blue screen matting (Mishima 1993),

Transcript of Automatic Real-Time Video Matting Using Time-of...

Int J Comput Vis (2012) 97:104–121DOI 10.1007/s11263-011-0471-x

Automatic Real-Time Video Matting Using Time-of-FlightCamera and Multichannel Poisson Equations

Liang Wang · Minglun Gong · Chenxi Zhang ·Ruigang Yang · Cha Zhang · Yee-Hong Yang

Received: 9 September 2010 / Accepted: 23 May 2011 / Published online: 15 June 2011© Springer Science+Business Media, LLC 2011

Abstract This paper presents an automatic real-time videomatting system. The proposed system consists of two novelcomponents. In order to automatically generate trimaps forlive videos, we advocate a Time-of-Flight (TOF) camera-based approach to video bilayer segmentation. Our algo-rithm combines color and depth cues in a probabilistic fu-sion framework. The scene depth information returned bythe TOF camera is less sensitive to environment changes,which makes our method robust to illumination variation,dynamic background and camera motion. For the secondstep, we perform alpha matting based on the segmentation

Electronic supplementary material The online version of this article(doi:10.1007/s11263-011-0471-x) contains supplementary material,which is available to authorized users.

L. Wang (�) · C. Zhang · R. YangDepartment of Computer Science, University of Kentucky,329 Rose Street, Lexington, KY, 40506, USAe-mail: [email protected]

C. Zhange-mail: [email protected]

R. Yange-mail: [email protected]

M. Gong (�)Department of Computer Science, Memorial University ofNewfoundland, St. John’s, NL, A1B 3X5, Canadae-mail: [email protected]

C. ZhangMicrosoft Research, Redmond One Microsoft Way, Redmond,WA 98052, USAe-mail: [email protected]

Y.-H. YangDepartment of Computing Science, University of Alberta,Edmonton, AB, T6G 2E8, Canadae-mail: [email protected]

result. Our matting algorithm uses a set of novel Poissonequations that are derived for handling multichannel colorvectors, as well as the depth information captured. Real-time processing speed is achieved through optimizing thealgorithm for parallel processing on graphics hardware. Wedemonstrate the effectiveness of our matting system on anextensive set of experimental results.

Keywords Bilayer segmentation · Video matting ·Time-of-flight camera

1 Introduction

Digital matting or “pulling a matte” refers to the process ofextracting a foreground object from still images or video se-quences and estimating the opacity (alpha value) for eachpixel covered by the object. This operation serves as an im-portant tool in image and video editing applications, givinguser the ability to composite extracted objects seamlesslyinto a novel background.

In image composition, an observed image I (x, y) canbe modeled as a linear blending of a foreground imageF(x, y) and a background image B(x, y) with its alphamatte α(x, y) ∈ [0,1] by the compositing equation (Porterand Duff 1984) ((x, y) arguments omitted for clarity)

I = αF + (1 − α)B. (1)

On the other hand, matting algorithms aim at estimatingF , B and α from a single input image I . Matting is thusinherently an under-constrained problem because for eachpixel location three unknowns need to be determined fromthe observed pixel value. Most matting approaches rely onprior assumptions and user guidance to constrain the prob-lem. For instance, in blue screen matting (Mishima 1993),

Int J Comput Vis (2012) 97:104–121 105

Fig. 1 Overview of our automatic real-time matting system. A two-phase process is performed to pull high-quality mattes for live videos auto-matically and in real-time

a background with constant color is required and B is there-fore known. Although blue screen matting is still the pre-dominant technique in the film industry, the special setupis rarely available to ordinary users. In natural image mat-ting, where no prior assumption on the scene background ismade, user interaction is essential to obtain a good separa-tion. Nearly all existing methods include users in the loopto obtain additional constraints, limiting their applicationsin still image or offline video editing (Wang et al. 2005;Wang and Cohen 2007b; Li et al. 2005). In this paper, weaddress the problem of performing automatic real-time mat-ting for live videos. As an extension of still image matting,video matting is a more widely applied technique in com-mercial television, film production, and telepresence. Oneof the typical applications is video teleconferencing, wherethe background content can be replaced by other imagesor videos. Such a live background substitution module isboth fun and aesthetically pleasing, and it can protect pri-vacy information of the users. Compared with still image oroffline video matting, automatically pulling an alpha matteand dynamic foreground objects from video sequences inreal-time is even more challenging. The main challenges aretwo-fold:

• Automatic trimap generation. In order to disam-biguate the ill-posed problem, almost all matting algorithmsrequire the input image to be segmented into three regions:definite foreground, definite background and unknown. Theproblem is thus simplified to estimating F , B and α forpixels within the unknown region. This tri-layer segmen-tation map is referred to as a trimap. Automatic and ro-bust trimap generation is critical to the success of high-

quality video matting (McGuire et al. 2005; Sun et al. 2004;Wang and Cohen 2007b).

For most video matting systems, a common solution toautomate the trimap generation is to start with a binary clas-sification (Pham et al. 2009). The α values are constrainedto be 1 or 0 and pixels are labeled as foreground or back-ground respectively. The layer boundaries are then dilatedto form a narrow band which serves as unknown region formatting. Unfortunately, robust bilayer segmentation on livevideos is challenging in real world scenarios. In a typicaloffice environment, illumination may change dramaticallydue to the light being turned on and off. The backgroundappearance may vary from time to time when there are peo-ple moving around. Additionally, camera shaking or gradualcamera movement can happen occasionally in many appli-cations (Sun et al. 2006).

• Real-time matte estimation. Recent development indigital matting has significantly advanced the state of the artin terms of quality. However, in terms of speed, most of thebest automatic matting algorithms typically take from sev-eral seconds to several minutes to process a single image.Due to the high computational cost and manual interven-tion involved, most of the existing matting systems work inan offline mode and thus are not well suited for process-ing live videos. So far real-time video matting for dynamicscenes can be achieved under studio settings using speciallydesigned optical devices, however, such techniques are re-stricted to controlled studio environments.

In this paper we present a novel video matting system thatoperates at real-time while still pulling high-quality mattesfor live video sequences. Figure 1 gives the overview of thewhole process, which uses both the color and depth images

106 Int J Comput Vis (2012) 97:104–121

as input. The system adopts a two-phase process by first per-forming bilayer segmentation to generate initial trimaps forvideo frames, then applying a real-time matting algorithm toestimate the final mattes.

For the first step of the method, we advocate a Time-of-Flight (TOF) camera-based approach to bilayer video seg-mentation. We are motivated by the fact that depth cue isless sensitive to environment changes compared to tradi-tional appearance or motion cues. In additional to the colorinformation obtained from a video camera, we utilize a TOFcamera to sense the scene depth during capture and pro-pose an effective segmentation algorithm which combinescolor and depth cues into a unified framework and adjuststheir relative importance adaptively over time to achieve im-proved robustness.

For the second step, a novel matting algorithm is pre-sented to achieve high-quality real-time matte estimation.The algorithm is based on a set of novel Poisson equationsthat are derived for handling multichannel color vectors, aswell as the depth information returned by the TOF camera.Real-time processing speed is achieved through optimizingthe algorithm for parallel processing on the Graphics Pro-cessing Units (GPUs). The quantitative evaluation on stillimages shows that our method is comparable to state-of-the-art offline image matting techniques.

This paper builds upon and extends our recent workin (Wang et al. 2010; Gong et al. 2010), with a more de-tailed description of the algorithm and additional evalua-tion results. The rest of paper is organized as follows: Afterreviewing the related work in Sect. 2, we in Sect. 3 intro-duce the TOF camera-based binary segmentation algorithm.In Sect. 4, we present the derivation of the multichannelPoisson equations and our real-time matting algorithm. Im-plementation details of our matting system are reported inSect. 5. In Sect. 6 we report experimental results. We dis-cuss limitations and future work in Sect. 7 and conclude inSect. 8.

2 Related work

This paper is related to a sizable body of literature on imageand video foreground extraction. Here we limit the discus-sion to video bilayer segmentation and matting approacheswhich are most relevant to our work.

2.1 Bilayer segmentation

Bilayer segmentation of live video has been long an ac-tive research topic (Blake et al. 2004; Criminisi et al. 2006;Kolmogorov et al. 2005; Rother et al. 2004; Sun et al. 2006,2007; Yin et al. 2007; Yu et al. 2007). People usually makevarious assumptions to simplify the problem. For instance,

in Criminisi et al. (2006), Harville et al. (2001), Sun et al.(2006), Yin et al. (2007), the camera is assumed to be sta-tionary and the background appearance is either previouslyknown or near static. These assumptions may be acceptablein personal offices, but can be invalidated in meeting rooms,shared labs, etc. Recently, Zhang et al. (2011) propose touse a structure from motion algorithm to estimate depth in-formation, hence they are able to handle dynamic scenesand moving cameras. Nevertheless, their method is an of-fline approach and the relative long processing time (severalminutes per frame as reported in the paper) makes it imprac-tical for live video segmentation. In Gordon et al. (1999),Harville et al. (2001), Kolmogorov et al. (2005), stereo cam-eras are deployed to compute the scene depth and the fusionof color and depth information leads to improved segmen-tation results compared to using the color cue alone. On theother hand, passive stereo matching is prone to errors underlow lighting environments or when the scene contains largetextureless regions like white wall.

2.2 Video Matting

A variety of techniques have been developed in the pastdecade for still image matting (Bai and Sapiro 2007; Chuanget al. 2001; Gong and Yang 2009; Grady et al. 2005; Levinet al. 2008; Sun et al. 2004; Wang and Cohen 2005, 2007a).Video matting is pioneered by Chuang et al. (2002). In theirBayesian video matting approach, users are required to man-ually specify trimaps for some key frames. These trimaps arethen propagated to all frames using the estimated bidirec-tional optical flows. Finally the alpha matte for each frameis calculated independently using Bayesian matting (Chuanget al. 2001). Trimaps are generated from binary segmenta-tions in two video object cutout approaches (Li et al. 2005;Wang et al. 2005). Individual frames are over-segmentedinto homogenous regions, based on which a 3D graph isconstructed. The optimal cut that separates foreground andbackground regions are found using 3D graph cuts. Pixelswithin a narrow band of the optimal cut are labeled as un-known regions, with their alpha values estimated using im-age matting techniques.

The above approaches are all designed to handle pre-captured video sequences offline, both of which utilize tem-poral coherence for improved accuracy. Our approach, onthe contrary, is designed for handling live captured videos inreal-time. When processing a given frame, only the previousframes are available to us.

There are several online video matting techniques avail-able. For example in defocus matting (McGuire et al. 2005),the scene is captured using multiple optically aligned cam-eras with different focus/aperture settings. The trimap is au-tomatically generated based on the focused regions of cap-tured images. The alpha matte is calculated by solving an er-ror minimization problem, which takes several minutes per

Int J Comput Vis (2012) 97:104–121 107

frame. Automatic video matting can also be achieved us-ing a camera array (Joshi et al. 2006). The images capturedare aligned so that the variance of pixels re-projected fromthe foreground is minimized and the variance of pixels re-projected from the background is maximized. The alpha val-ues are calculated using a variance-based matting equation.The computational cost is linear with respect to the numberof cameras and near-real-time processing speed is achieved.

Real-time video matting was first achieved in studio set-ting (McGuire et al. 2006). In their approach, the back-ground screen is illuminated with polarized light and thescene is captured by two cameras each with a different po-larizing filter. Since the background has different colors inthe two captured images, the simple blue screen matting canbe applied to extract the alpha matte in real-time. Real-timevideo matting for natural scenes is made possible just re-cently. Gastal and Oliveira (2010) propose to accelerate theforeground/background sampling process used in the robustmatting algorithm initially proposed by Wang and Cohen(2007a). Based on the observation that pixels within a smallneighborhood tend to share similar attributes, they suggestto share the foreground/background samples found amongneighboring pixels, which helps to reduce the amount of per-pixel computation needed.

Our work is closely related to recent paper Pham etal.’s (2009). In their proposed matting system, trimaps aregenerated using a bilayer segmentation algorithm that isbased on color cue only. To enable real-time processing,Bayesian matting (Chuang et al. 2001) is modified by us-ing down-sampling approximation. Compared to Pham etal. (2009), although our work also relies on bilayer segmen-tation, the TOF camera employed in our system allows us tofuse color and depth information. Thanks to the depth cue,which is invariant to illumination variations and motion am-biguity, our method is better at handling challenging scenar-ios. Regarding matting accuracy, according to the author’sreport, their modified Bayesian matting is less accurate thanthe conventional Bayesian matting (Chuang et al. 2001) dueto the approximation employed. In contrast, our matting al-gorithm is based on new multichannel Poisson equations andquantitative comparison using ground truth data shows thatour method outperforms (Chuang et al. 2001) in both accu-racy and speed.

2.3 TOF Camera-Based Foreground Extraction

Recently, Time-of-Flight (TOF) cameras start to attract theattention of many vision researchers. TOF cameras are ac-tive sensors that determine the per-pixel depth value by mea-suring the time taken by infrared light to travel to the objectand back to the camera. These sensors are currently avail-able from companies such as 3DV Systems (3DV Systems),Canesta (Canesta Inc.) and Mesa Imaging (MESA Imaging

AG) at commodity prices. So far TOF cameras have not beenwidely used in video segmentation application. ExistingTOF camera-based matting algorithms (Crabb et al. 2008;Wang et al. 2007b; Wu et al. 2008; Zhu et al. 2009) directlytake the depth image from the TOF camera and threshold itto compute a foreground mask for trimap generation. Theselocal approaches are simple but less robust. Because TOFcameras are characterized by independent pixel depth es-timates, the measured depth map can be noisy both spa-tially and temporally. Such noises are content dependent andhence difficult to remove by typical filtering methods. Morecritically, when there are background objects that are closeto the foreground layer, depth thresholding can lead to in-correct segmentation. Given the fact that a good trimap isessential for matting, these approaches have difficulty han-dling challenging scenarios. In addition, to the best of ourknowledge, near all existing TOF camera-based matting al-gorithms cannot achieve real-time processing speed whichlimits their applications to offline video processing.

3 Bilayer Segmentation Using a TOF Camera

In this section, we present the first half of our matting sys-tem, the bilayer segmentation step. To achieve robust seg-mentation our method combines color and depth cues in aunified probabilistic fusion framework. By noticing colorand depth may have different discrimination capability dur-ing different time periods we propose a novel fusion methodwhich is able to adjust the influence of color and depth cuesadaptively over time.

3.1 Problem Formulation

Let I t be the RGB color image at the current instance t thatis to be processed, and Dt be its corresponding depth mapreturned by the TOF camera (as shown in Fig. 1). Let �

be the set of all pixels in I t . The color and depth valuesof pixel r ∈ � are denoted as I t

r and Dtr , respectively. For

notation clarity we assume I tr and Dt

r are color and depthmeasurements of the same scene point in 3D. In practice formost TOF cameras, the optical center of the color sensor andthe depth sensor are different but very close to each otherand the depth map can be warped to align with the colorimage (Yang et al. 2007). In the following, when there is noconfusion, we will omit the superscript t for conciseness.

Following the general framework in Boykov and Jolly(2001), we formulate bilayer segmentation as a binary la-beling problem. More specifically, a labeling function f

assigns each pixel a unique binary label αr ∈ {0 (back-ground), 1 (forground)}. The optimal labeling can be ob-tained by minimizing the energy of the form:

E(f ) =∑

r∈�

U(αr) + λ∑

(r,s)∈ξ

V (αr ,αs), (2)

108 Int J Comput Vis (2012) 97:104–121

where∑

U(·) is the data term that evaluates the likelihoodof each pixel belonging to foreground or background. Thecontrast term

∑V (·, ·) encodes the assumption that seg-

mentation boundaries are inclined to align with edges ofhigh image contrast. ξ denotes the set of edges that correlatethe center pixel with its eight neighbors. λ is a strength pa-rameter that balances the two terms. The contrast term usedin our paper is defined as:

V (αr,αs) = |αr − αs | exp

(−‖Ir − Is‖2

β

), (3)

where ‖Ir − Is‖ is the Euclidean norm of the color differ-ence and β is chosen to be β = 2〈‖Ir − Is‖2〉 (〈·〉 indicatesexpectation) (Criminisi et al. 2006; Sun et al. 2007).

The color and depth information obtained from the TOFcamera is combined to form the data term. That is,

∑U(·)

consists of two parts:

r∈�

U(αr) = λc∑

r∈�

Uc(αr) + λd∑

r∈�

Ud(αr), (4)

where∑

Uc(·) is the color term, which models the fore-ground and background color likelihoods.

∑Ud(·) is the

depth term that models depth likelihood of the scene. λc andλd are two parameters that control the influences of thesetwo terms. The optimal labeling function that minimizes thecost function (2) can be efficiently solved using the graphcuts method (Boykov et al. 2001).

3.2 Likelihood for Color

To model the likelihood of each pixel belonging to fore-ground or background layer, a foreground color modelp(Ir |αr = 1) and a background color model p(Ir |αr = 0)

are learned from image data. In Blake et al. (2004), Kol-mogorov et al. (2005), Sun et al. (2006, 2007), both are mod-eled with Gaussian Mixture Models (GMMs) and learnedusing Expectation Maximization (EM). However, a goodinitialization of the EM algorithm is difficult to obtain, andthe iterative learning process of EM will slow the system.Criminisi et al. (2006) model the color likelihoods nonpara-metrically as color histograms. We notice that the perfor-mance of their simplified approach is sensitive to the userspecified number of color bins. In this paper we proposeto use a hybrid approach. We first construct histograms forforeground and background pixels respectively, and thenbuild foreground/background GMMs based on the 3D colorhistograms.

More specifically, two 3D histograms each with H binsin the RGB color space are constructed for the foregroundand background separately. We denote the foreground andbackground Gaussian components by {μF

1 ,�F1 ,ωF

1 }, . . . ,

{μFH ,�F

H ,ωFH } and {μB

1 ,�B1 ,ωB

1 }, . . . , {μBH ,�B

H ,ωBH }, re-

spectively. Here μ is the mean color, � the covariance ma-trix assumed to be diagonal, and ω the weight of the Gaus-sian component. Both μi and �i of the ith component of theGMMs can be directly learned using color samples in the ithbin. The component weight ωi is set to the value of the ithcolor bin (note that

∑ωi = 1).

Given a pixel Ir belonging to the kth bin, the conditionalprobability p(Ir |αr = 1) is computed as:

p(Ir |αr = 1) =∑

i∈ℵ ωFi G(Ir |μF

i ,�Fi )

∑i∈ℵ ωF

i

, (5)

where ℵ is the index set of k’s neighboring bins in 3D (like-wise, p(Ir |αr = 0) is defined in the same way). Finally, thecolor term is defined as

r∈�

Uc(αr) = −∑

r∈�

logp(Ir |αr). (6)

We found that the above scheme is both stable and suffi-ciently efficient for real-time implementation. Note that boththe foreground/background color likelihood models are up-dated over successive frames based on the segmentation re-sults of the previous frame. This continuous learning processallows us to estimate the color models more accurately ac-cording to the very recent history.

3.3 Likelihood for Depth

Under ideal conditions TOF cameras are capable of rel-atively accurate measurements. However, in practice thequality of measurements is subject to many factors. Themost well known problem is the measured depth suffersfrom bias as a function of object intensity. That is, darkobjects will appear farther in the returned depth map com-pared to their actual depth w.r.t. the camera. This depth biaswill cause dark foreground regions being labeled occasion-ally as background and result in “flickering” artifacts. Inprevious literature, researchers usually compensate this biasthrough a laborious photometric calibration step (Davis andGonzalesz-Banos 2003; Zhu et al. 2008). In order to alle-viate this bias without resorting to the pre-calibration, wetake the intensity bias into consideration when building theforeground/background depth models.

The foreground/background depth likelihoods are mod-eled as Gaussian distributions. Pixels from the latest seg-mented frame are first classified into dark and bright sam-ples based on an intensity threshold Tdark . For each fore-ground or background model, two Gaussian distributionsare learned using the dark and bright sample sets, respec-tively. Let {χF , νF } and {χ ′F , ν′F } represent the two Gaus-sian components of the foreground depth model, e.g., the

Int J Comput Vis (2012) 97:104–121 109

conditional probability p(Dr |αr = 1) is:

p(Dr |αr = 1) ={

G(Dr |χF , νF ) Ir < Tdark

G(Dr |χ ′F , ν′F ) Otherwise.(7)

Similarly, p(Dr |αr = 0) can be defined using the corre-sponding Gaussian models {χB, νB}, {χ ′B, ν′B} of the back-ground. The depth term can then be written as:

r∈�

Ud(αr) = −∑

r∈�

logp(Dr |αr). (8)

3.4 Adaptive Weighting

In previous work such as Kolmogorov et al. (2005), the colorand depth cues are treated equally, i.e., λc and λd are con-stant over time. However, color and depth may have differentdiscrimination power at different periods. Clearly a robustfusion algorithm should adaptively adjust the importance ofdifferent cues over time. For instance, when there are back-ground objects approaching the foreground the depth cue isambiguous. In that case if the foreground/background colorscan be well separated the algorithm should rely more on thecolor cue. Likewise, if there is a sudden illumination change,the color statistics learned from previous frames is less reli-able so depth cue should be favored more. Motivated by thisobservation, we propose to adaptively adjust the weightingfactors to improve the robustness of the segmentation pro-cess.

Weighting factors λc and λd are determined based on thediscriminative capabilities of the color and depth models.To measure the reliability of the color term, we compute theKullback-Leibler (KL) distance between the gray-scale his-tograms of frames I t−1 and I t together with the KL dis-tance between the color histograms of the separated fore-ground/background layers in I t−1. We denote the two gray-scale histograms as ht−1 and ht . Each histogram has K binsand the values are normalized so �ht−1(i) = �ht(i) = 1.The KL distance between them is

δKLlum =

K∑

i=1

ht (i) loght (i)

ht−1(i). (9)

The KL distance between the foreground and backgroundcolor histograms as defined in Sect. 3.2 for frame I t−1 canbe computed in a similar way. We denote their correspond-ing KL distance to be δKL

rgb .

The confidence of the color term is computed using δKLlum

and δKLrgb as

�c = exp

(− δKL

lum

ηclum

)·(

1 − exp

(− δKL

rgb

ηcrgb

)), (10)

where ηclum and ηc

rgb are the parameters that control the

sharpness of the exponential functions. If δKLlum is small,

we assume that there is no significant illumination changesor background color variation between image I t−1 and I t

therefore the learned color histograms from I t−1 should bereliable. On the other hand, if δKL

rgb is small it implies theforeground and background layers have similar color distri-butions and accurate segmentation via color cue is difficultto achieve.

The confidence of the depth term is calculated based onthe distance between the average depth values of the fore-ground and background layers in I t−1. The distance can beapproximated from the depth likelihood models defined inSect. 3.2 as χ = |(χF + χ ′F ) − (χB + χ ′B)|/2 (note that0 ≤ χ ≤ 255). The confidence of the depth term is definedas

�d = 1 − exp

(− χ

ηd

), (11)

where ηd is a constant parameter. The confidence �d issmall if the distance between the foreground and back-ground layers is small. The weighting factors are then com-puted as λc = �c/(�c + �d) and λd = �d/(�c + �d). De-tailed parameter settings for ηc

lum, ηcrgb and ηd are discussed

in Sect. 5.3.

4 Matting Via Multichannel Poisson Equations

For the second half of our matting system, we perform matteestimation based on the binary segmentation results fromSect. 3. We improve the well-known Poisson matting byintroducing a set of Poisson equations that are derived forhandling multichannel color vectors and the additional depthchannel from the TOF camera.

4.1 Equation for Single Channel Image Matting

Our approach is inspired by the Poisson matting algo-rithm (Sun et al. 2004). In Poisson matting, the alpha matteis estimated using a selected color channel k. An approx-imate gradient of alpha is calculated by taking gradient onboth sides of (1) and omitting the gradients of the foregroundand the background by assuming that they are both smoothacross the image

∇Ik = (Fk − Bk)∇α + α∇Fk + (1 − α)∇Bk

⇒ ∇α ≈ 1

Fk − Bk

∇Ik. (12)

The global Poisson equation is then set up by taking diver-gence on both sides

α ≈ div

( ∇Ik

Fk − Bk

). (13)

110 Int J Comput Vis (2012) 97:104–121

4.2 Equation for General Color Image Matting

Unlike the global Poisson matting equation used by Sun etal. (2004), our multichannel Poisson equations are derivedusing all color channels. This is done by first rearrangingequation (1) into

I − B = α (F − B) . (14)

Taking gradient on both sides of (14) and applying Leib-nitz’s law gives us

∇ ⊗ (I − B) = (∇α) ⊗ (F − B) + α (∇ ⊗ (F − B)) , (15)

where ∇ ⊗ I represents the tensor product between the gra-dient operator and RGB color image I as

∇ ⊗ I =[

∂Ir/∂x ∂Ig/∂x ∂Ib/∂x

∂Ir/∂y ∂Ig/∂y ∂Ib/∂y

]. (16)

Instead of relying on the smoothness assumption to omitthe unknown α term, here we first multiply a column vec-tor (F − B)T on both sides of (15) and then cancel out theunknown variable α using (14) as

(∇ ⊗ (I − B)) (F − B)T

= ((∇α) ⊗ (F − B)) (F − B)T

+ α (∇ ⊗ (F − B)) (F − B)T

= ∇α((F − B)(F − B)T )

+ (∇ ⊗ (F − B)) (I − B)T . (17)

The gradient of α can then be represented as

∇α = (∇ ⊗ (I − B))(F − B)T − (∇ ⊗ (F − B))(I − B)T

(F − B)(F − B)T

= (∇ ⊗ I )(F − B)T − (∇ ⊗ F)(I − B)T − (∇ ⊗ B)(F − I )T

(F − B)(F − B)T.

(18)

It is noteworthy that the above equations are derivedwithout any approximation. If both foreground and back-ground colors are known and different (the Smith-Blinn as-sumption), the gradient of α can be precisely calculated.When they are both unknown, however, we need to assumethey are smooth and omit their gradients. This gives us thefollowing multichannel Poisson equation

∇ ⊗ F ≈ 0, ∇ ⊗ B ≈ 0

⇒ α ≈ div (G) = div

((∇ ⊗ I ) (F − B)T

(F − B) (F − B)T

), (19)

where G is the approximate gradient of matte. It is also clearthat, if only one color channel is used, the above multichan-nel Poisson equation is equivalent to (13) given

α ≈ div

(∇Ik (Fk − Bk)

(Fk − Bk)2

)= div

( ∇Ik

Fk − Bk

). (20)

4.3 Matting Equation for Known Depth

It has been shown that leveraging depth information returnedfrom depth sensors helps to improve matting qualities forboth Bayesian and Poisson matting Wang et al. (2007b),Zhu et al. (2009). In Wang et al. (2007b), the depth infor-mation is integrated into the Bayesian matting equation asan additional color channel. However, for Poisson matting,the depth information is used for validation only, becausethe original Poisson equation was derived using one colorchannel Sun et al. (2004).

Under our multichannel Poisson matting formulation, thedepth cue can be integrated naturally into the matting equa-tion to bootstrap the matte estimation process. Same asin Wang et al. (2007b), here we assume depth readings infuzzy areas follow the same alpha compositing rule of colorchannels as

D − DB = α(DF − DB

), (21)

where DF , DB are the foreground, background depth val-ues, respectively.

Following the same derivation from (15)–(19) gives usthe following Poisson equation

α ≈ div(GD)

= div

((∇ ⊗ I )(F − B)T + τ∇D(DF − DB)

(F − B)(F − B)T + τ(DF − DB)2

),

(22)

where GD is the approximate gradient of alpha with knowndepth and parameter τ controls the contribution of the depthinformation.

5 Implementation

We in this section presents implementation details for ourproposed matting system.

5.1 Implementation for Bilayer Segmentation

To obtain a model initialization for the bilayer segmenta-tion, we adopt an automatic initialization approach proposedin Yu et al. (2007) by assuming that the to-be-segmentedforeground is a person (a valid assumption for many ap-plications). As illustrated in Fig. 2(a), the first step is to

Int J Comput Vis (2012) 97:104–121 111

Fig. 2 (Color online) Automatic initialization of the binary segmen-tation in the first frame. (a) Color image and the face detection resultfrom (Open Source Computer Vision (OpenCV) Libiary). (b) Depthimage. (c) According to the face region and depth map, certain imageregions are assumed to be definite foreground (red) and background(blue). (d) Binary segmentation result

apply a face detector on the input video frame. Based onthe face detection results, we assume certain pixels to bedefinite foreground or background. The decision can bemade by thresholding pixels’ depth values and their spa-tial distance to the face region in the image. For instance(Fig. 2(c)), the detected face region is assumed to be def-inite foreground; The expanded region below the detectedface (likely to be the shoulder) is regarded as foreground ifcorresponding depth values are consistent with the averagedepth of the face region. On the other hand, pixels to theleft, right and above the foreground areas are assumed to bebackground if their depths are different significantly fromthe foreground. We found this approach generally performswell for most videos with the foreground object being a per-son. Note that when the foreground object is not a person,users can draw a few strokes on the starting frame indicat-ing the foreground and background regions and initialize thesystem in a semi-automatic manner (Boykov and Jolly 2001;Pham et al. 2009).

For our earlier implementation reported in Wang et al.(2010), all the computations are carried out by CPU andthe bilayer segmentation is sufficiently fast for online fore-ground extraction. In order to maintain the system’s real-time performance after introducing the alpha matte extrac-tion step, we take the advantage of GPU’s massively dataparallel architectures and adopt a co-operative approach forbilayer segmentation. In detail, we use the GPU to com-pute per-pixel color/depth likelihoods and pixel-wise con-trast term in massive parallelism and the CPU to performgraph cuts and online learning which require more flexi-

Fig. 3 The flowchart of the presented algorithm and the intermediateresults for the “dog” data set. (a) color input; (b) trimap; (c) estimatedforeground; (d) estimated background; (e) initial alpha matte; (f) ap-proximate Laplacian of alpha; (g) estimated alpha matte; (h) updatedtrimap

ble looping and branching capability. Computation times arelater reported in Sect. 6.3.

5.2 Implementation for Matting

Once the current frame is segmented into foreground andbackground regions, matting is applied to handle the fuzzyboundaries of the foreground objects. Figure 3 illustrates thealpha matte extraction process. For real-time performance,

112 Int J Comput Vis (2012) 97:104–121

all the matting operations are designed and optimized forparallel execution on the GPUs.

5.2.1 Initial Trimap Generation

The matting step starts with generating a trimap from thebilayer segmentation result. This is done by eroding fore-ground and background regions and marking the areas inbetween as unknown. In our GPU implementation, the ero-sion is performed through applying a pixel shader in multi-ple rendering passes. In each pass, the shader marks a pixelp as unknown if any of p’s neighbors has the unknown la-bel or has a label differing from p. According to the natureof the foreground object, the user can control the amount oferosion to be applied. After the trimap is generated, the re-maining operations need to be applied only to the unknownregion. To improve processing speed, the early Z-kill featureis enabled to limit the computations to the unknown regionof the trimap.

5.2.2 Estimate Unknown Foreground and Background

Using the source image and the generated initial trimapas inputs, the next step estimates the colors of both fore-ground and background in the unknown region. Based onthe smoothness assumption, an unknown pixel’s foreground(background) color can be approximated using the color ofthe nearest pixel in the region (Sun et al. 2004). Followingthis idea, an image morphology based procedure is utilized.The procedure fills an unknown foreground (background)pixel with the average color of its neighbors if at least oneof its four neighbors has known foreground (background)color. Once a pixel is processed, the depth buffer used forearly Z-kill is updated accordingly so that the pixel will notbe processed again.

5.2.3 Alpha Matte Initialization

The estimated foreground and background colors, togetherwith the source image, are used as inputs for calculating theapproximate Laplacian of alpha. Depending on whether thedepth information is available or not, one of (19) and (22) isused.

Different techniques can be applied to solve the Poissonequation and recover the alpha matte. To facilitate parallelprocessing on GPUs, the Jacobi method, which iterativelyupdates the alpha value of the center pixel based on theneighboring alpha values to locally satisfy the Poisson equa-tion, is used. The Jacobi method requires an initial solutionto start with. In practice, we found that the quality of theinitial solution has a great impact on the convergence speed.In addition, when running the Jacobi method under limitedprecision, e.g., the alpha values are represented using inte-gers within [0,255] on the GPU, the final solution may not

converge to the global optimum and therefore a better initialsolution also helps to achieve more accurate result.

To generate a good initial solution, we compute the ini-tial alpha matte directly in the color space, before using thePoisson equation to solve the matte in the gradient space.The equation for alpha initialization is derived by applyingdot product with (I − B) on both sides of (14):

(I − B) · (I − B) = α (F − B) · (I − B)

⇒ α = (I − B) · (I − B)

(F − B) · (I − B), (23)

where F and B are estimated foreground and backgroundcolors in the unknown region, respectively. Similar equa-tions can be derived by computing dot products with othercolor difference vectors such as (F − B) or (F − I ). How-ever, we found that (I − B) in general works better. This isbecause I is known and B , if unknown, can be more accu-rately estimated than F can be.

5.2.4 Iterative Trimap Refinement

As shown in Fig. 3(e), although the initial alpha matte isfairly accurate when both foreground and background con-tain high frequency textures, a close inspection shows that,due to imprecise foreground/background color estimation,artifacts do exist in areas such as the one highlighted with ared rectangle.

Inspired by Sun et al. (2004), the initial alpha matte canbe used to generate a new trimap, which helps to obtain bet-ter foreground and background estimation and in turn bet-ter alpha mattes. In our implementation the new trimap iscomputed through multi-level thresholding on the estimatedalpha matte. In detail, a pixel is labeled as foreground if itsestimated alpha value is higher than Thigh and backgroundif lower than Tlow . The updated trimap is used to repeat thematte extraction process.

5.3 Parameter Settings

In this section, we provide the parameter settings used in ourimplementation. The number of bins H defined in Sect. 3.2for RGB color histogram is set to 83. We experimentallyfound such setting provides a good tradeoff between ac-curacy and efficiency. Dividing the RGB color space moreprecisely (say 163 bins) could produce better segmentationat the expense of longer computation time and sufferingfrom the risk of over-fitting. Regarding the threshold valueTdark for classifying “bright” or “dark” pixels defined inSect. 3.3, we experimentally found the overall performancecorresponding to that parameter is robust within the rangeof [50,70] (pixel intensities are between 0 and 255). In ourexperiments we set Tdark = 60 throughout.

Int J Comput Vis (2012) 97:104–121 113

Fig. 4 Data sets for evaluatingbilayer segmentation. The firstrow shows sample images andtheir corresponding depth mapsreturned by the TOF camera areshown in the second row

For adaptive weighting (Sect. 3.4), K is the number ofbins for gray-scale histogram, ηc

lum, ηcrgb and ηd control the

shapes of the three exponential functions defined in (10) and(11), respectively. According to our experiments, ηc

lum is theleast sensitive parameter among the three. This is becausethe KL distance δKL

lum between the two gray-scale histogramsht−1 and ht are in general very small (close to zero) unlessthere is a sudden illumination change from time t − 1 to t .In our experiments we set ηc

lum to 0.1 throughout. Comparedto ηc

lum, ηcrgb and ηd are slightly more sensitive and require

more tuning for different sequences. In our implementationηc

rgb ranges from 1.2 to 2.5 and ηd ranges from 45 to 60. Weexperientially found such parameter settings typically workquite well.

For matting, the first key parameter is the number of it-erations for the Jacobi method. Although the Jacobi methodhas the reputation of slow convergence, we experimentallyfound that it takes about 50 iterations to converge, thanksto the accurate initial solution obtained. To err on the safeside, the number of Jacobi iterations is set to 64 throughoutour experiments. The two threshold values Thigh and Tlow

defined in the iterative trimap refinement step are 0.95 and0.05, respectively. The final parameter is the number of iter-ations for the trimap refinement. Ideally the process shoulditerate until it converges, i.e., the updated trimap is the sameas the previous version. However, for speed performancea fixed number of iterations is performed. Through exper-iments, we found that two iterations are sufficient in mostscenarios.

6 Experimental Results

The TOF camera used in our experiments is the ZCam from3DV Systems as shown in Fig. 1. The ZCam can producesynchronized 320 × 240 RGB color video and the same res-olution depth maps (256 depth levels per pixel) at 30 framesper second (fps). The color images and depth maps are inter-nally aligned by the software. In order to evaluate the effec-tiveness of our proposed matting system, we have conducted

experiments on several challenging video sequences. In thissection, we first quantitatively evaluate the performance ofeach step, video binary segmentation and matting, respec-tively. We then report the speed performance of their com-bination as an automatic real-time matting system. We alsosuggest the readers to view the videos in our supplementalmaterials to verify the effectiveness of our algorithm.

6.1 Evaluation of Bilayer Segmentation

We have captured several videos with additional depth chan-nel available using the ZCam. Currently, four sequenceshave their ground truth binary segmentation labeled manu-ally, which allows quantitative evaluation to be conducted.Sample images of the four video sequences used in thiswork are shown in Fig. 4. The first WL sequence con-tains rich foreground motion and the foreground and back-ground color distributions are similar. Also the depth mea-surements of the dark hair region suffer from the inten-sity bias mentioned earlier. In the second sequence MS, wedemonstrate the case with a moving camera. Note that boththe background scene and global illumination are varyingover time (although not significantly) during the camera’smotion. Moving camera also produces some amount of cam-era shaking, which is not easy to handle in previous work.In the third MC sequence, lights in the room were switchedon and off to simulate global lighting variation and the back-ground contains dynamic moving objects. The last one, CWsequence, is particular challenging for segmentation algo-rithms using depth cue because there is a person passing bythe foreground layer and their relative distance is small ac-cording to the depth measurements.

Ground truth binary segmentation results are manuallylabeled on every fourth frame. Each pixel is labeled as back-ground, foreground or unknown. The unknown band is twoto three pixels in width and covers the mixed pixels alonglayer boundaries. Following Kolmogorov et al. (2005), er-ror rates are measured as a percentage of misclassified pix-els w.r.t. the ground truth data. In our experiments, for eachtest sequence we quantitatively evaluate the segmentation

114 Int J Comput Vis (2012) 97:104–121

Fig. 5 Evaluation on segmentation accuracy using ground truth data.Percentages of misclassified pixels in known region are computed onall four sequences, every fourth frame. Experimental results show thatfusing color and depth cues outperforms using color or depth informa-

tion along in general. Note that the quantitative evaluation also con-firms that our adaptive weight fusion method performs more robustlythan constant weight fusion on challenging sequences like MC and CW

Table 1 The meansegmentation error w.r.t. theknown image region (known)and the whole image space (all)for different methods and testsequences. Again, this tabledemonstrates adaptive weightfusion achieves better accuracythan the other three methods

accuracy of four different methods. Besides our segmenta-tion algorithm that uses the adaptive weight fusion scheme,we also assess three different variants based on the MRFformulation in Sect. 3.1. First, segmentation algorithms thatrely on either color or depth cue only are tested by settingλd or λc to zero, respectively. In order to validate the ef-fectiveness of the adaptive weight fusion, we also compareit against traditional constant weight fusion. For constant

weight fusion, we assume the color and depth terms haveequal contribution and set λc and λd to be 0.5 throughout.This is also the initial parameter setup for adaptive weightfusion.

In Fig. 5 we plot the error curves of the four methods w.r.t.to our test data. Percentage of misclassified pixels withinknown region is computed on all sequences, every fourthframe. In Table 1, we further provide the mean segmentation

Int J Comput Vis (2012) 97:104–121 115

Fig. 6 Sample framesdemonstrating the errorpropagation of constant weightfusion. In comparison, adaptiveweight fusion can avoid thedrifting issue for this scenario.Full segmentation results can befound in our supplementalmaterials

error for each method. Note that the error statistics on bothknown region and the whole image area (unknown pixelsare included when counting the total number of pixels) areshown in this table.

The quantitative evaluation confirms that combining thedepth and color in general achieves better accuracy than ei-ther using color or depth along. Furthermore, by determin-ing λc and λd based on the discriminative capabilities ofcolor and depth cues, adaptive fusion outperforms constantweight fusion on most sequences especially on the challeng-ing sequence CW.

By further investigating Fig. 5 we can find that the colorinformation is ambiguous for bilayer segmentation. Thedepth information from the TOF camera seems to be a robustcue and demonstrates good performances on the first threesequences. The constant weight fusion improves the depthonly segmentation in general, however, it is worth noticingthat it performs poorly for the CW sequence. Why fusingmultiple cues can sometimes lead to worse results? We nowlook into those problematic frames to find the answer. Asshown in Figs. 5 and 6, near the 100th frame the backgroundobject starts being incorrectly labeled as foreground in thedepth-based method because of the analogous depth distri-butions of the two layers. But for the fusion approach, giventhe color term remains reliable at that moment the separationis correct. Near the 120th frame the depth ambiguity causesthe background object to be misclassified as foreground forconstant weight fusion. Around the 160th frame, althoughthe background person is no longer within the camera’s fieldof view, the error propagation (color cue learned from earlyincorrect segmentation) causes the drifting artifacts. As aresult, the constant weight fusion fails form frame 160 toframe 300. In comparison, by plotting the weighting factorsas a function of time in Fig. 7, we can see our method intel-ligently adjusts the importance of the two terms over time.When the background object is approaching the foregroundlayer the influence of the depth term is decreased accord-ingly.

Fig. 7 (Color online) Plots of the color and depth weights as a functionof time for the CW sequence. As can be seen the relative importanceof the color term is increased when the background object is movingclose to the foreground object. (This figure is best viewed in color)

We also compare our algorithm against a commercial liveforeground segmentation routine released by the 3DV Sys-tems. The software requires that the user provide a dividingplane (as input parameter) that defines what objects lie inthe foreground. This manual input must be given once percamera setup when used in a controlled environment. It thencomputes a depth threshold on the given dividing plane tocreate a binary segmentation.

Because the software does not support offline process-ing we are not able to test their approach using data setswith ground truth label. We instead let the commercial soft-ware perform segmentation on a scene similar to the onewe setup for CW. Note that corresponding parameters of thesoftware are turned so it works at its best at the beginning ofthe video, i.e., the foreground layer is accurately segmentedfrom the scene. In Fig. 8 we demonstrate the screen shot ofthe segmentation result from the commercial software. As

116 Int J Comput Vis (2012) 97:104–121

can be seen, when the background is approaching towardforeground the software incorrectly treats the background

Fig. 8 The first row: screen shot of live foreground extraction fromthe a commercial software released by the 3DV Systems. Similar tothe constant weight fusion the background is incorrectly estimated asforeground when the distance between the two layers is small. Thesecond row: similar scene and our segmentation result. Note the lowerleft image is flipped horizontally to be consistent with the software’slive output

object as foreground. In the second row we provide our re-sult and the corresponding video frame. Our approach is ableto achieve correct segmentation by relying more on color in-formation. For a side by side comparison please refer to oursupplemental materials.

6.2 Evaluation of Matting

In this experiment, we first evaluate the matting qualityachieved using the multichannel Poisson equations. Forquantitative evaluation, we employ the data sets with groundtruth alpha mattes presented by Wang and Cohen (2007a).Each data set contains ten trimaps of different level of accu-racies, with T0 being the most accurate trimap and T9 theleast accurate one. Example results obtained using the sec-ond trimap (not necessarily the best choice) are shown inFig. 9. An alpha matte is generated using each of the tentrimaps and its accuracy is evaluated using the mean squareerror (MSE). The lowest MSE value among the ten results(Emin) and the difference between the highest and the lowestMSE values (Ediff ) are shown in Table 2. The latter mea-surement gives a good indication of the robustness of a givenalgorithm (Wang and Cohen 2007a).

As shown in Table 2, when compared to the global Pois-son matting, our multichannel Poisson algorithm reducesEmin value by 65 ∼ 90% and the Ediff value by 70 ∼ 90%.This suggests that our approach is not only more accuratethan the original Poisson matting, but also more tolerant to

Fig. 9 Results of our algorithmon “dog”, “hair”, “camera”,“bird”, and “child” data sets.From top to bottom: sourceimages, input trimaps, groundtruth alpha mattes from Wangand Cohen (2007a), estimatedalpha mattes using the proposedmethod, and composite results

Int J Comput Vis (2012) 97:104–121 117

Table 2 Quantitativecomparison with existingalgorithms (measurements forexisting approaches are reportedby Wang and Cohen (2007a)).Emin: the minimum MSE valueobtained using 10 differenttrimaps; Ediff : the differencebetween the maximum andminimum MSE values

imprecise trimaps, which is an important property for videomatting since automatically generated trimaps are generallynot as accurate as manually labeled ones. It is also note-worthy that the performance gain is achieved without usingscene depth information. As will be demonstrated later, mat-ting results in general can be further improved if the depthcue is leveraged. The comparison also suggests that our ap-proach is comparable to other state-of-the-art matting ap-proaches. It ranks on average 3.8 out of 8 on Emin measureand 3.2 out of 8 on Ediff measure. Considering our algo-rithm is designed for handling video sequences in real-timewhereas others require seconds/minutes on a single image,this result is very encouraging.

The performance gain over the original Poisson mattingalgorithm can be attributed to both multichannel Poissonequation and the color-space matte initialization techniquedescribed in Sect. 5.2. We further investigate the effect ofeach technique by enabling them one at a time. The advan-tage of multichannel Poisson equation over the standard sin-gle channel Poisson equation is demonstrated in Fig. 10. Thecomparison is conducted on all five data sets and the sec-ond trimap is used throughout. The results for single chan-nel Poisson equation are obtained based on (13) using oneof the RGB channels or the combined luminance channel,respectively. The results show that, for all five data sets, themultichannel Poisson equation produces more accurate al-pha mattes than applying the single channel equation on anyof the four selected channels. On average, the multichannelequation helps to reduce the MSE by about 50%.

Figure 11 compares the alpha mattes obtained with andwithout enabling color-space matte initialization. The com-parison is performed on the “camera” and “child” data setsunder different input trimaps. When color-space matte ini-tialization is disabled, the initial alpha values for all un-known pixels are set to 0.5. The results suggest that, thecolor-space matte initialization step does not offer much im-provement over the conventional approach when the mostaccurate trimap (T0) is used, but becomes very helpful whenthe input trimap is inaccurate. This is because, when the

Fig. 10 Comparison between the matting results obtained using singlechannel and multichannel Poisson equations

Fig. 11 Comparison between the matting results obtained with andwithout using the color-space matte initialization

trimap is precisely defined and the unknown region is small,the gradient of alpha can be more accurately estimated.Hence, which initial values are used has little effect on thenumerical solution of the Poisson equation. However, as theunknown region gets larger and the gradient information be-comes unreliable, solving the Poisson equation numericallywith limited precision depends on better initial values.

Figure 12 shows the results generated before and af-ter the matting step for the four video sequences shownin Fig. 4. Since the boundaries of the foreground objects

118 Int J Comput Vis (2012) 97:104–121

Fig. 12 Results of automatic video matting for the four data setsshown in Fig. 4. Top row: bilayer segmentation result; bottom row:matting result. Because the boundaries of the foreground objects inthese scenes are relatively sharp, the improvement for matting is not

obvious. However, careful inspection still shows that matting helps toremove the aliasing artifacts along layer boundaries. Readers are en-couraged to zoom in to the image to see the improvement introducedby applying matting

Fig. 13 Matting results for the WL2 data set: (a) color input, (b) depthcaptured, (c) bilayer segmentation result, (d) automatically generatedtrimap, (e) alpha matte obtained using the original Poisson equationwith luminance channel, (f) alpha matte generated using the multi-

channel Poisson equation with color information only, (g–h) resultsobtained using the multichannel Poisson equation with both color anddepth information. Full matting results can be found in our supplemen-tal materials

in these scenes are relatively sharp, the effect of mattingis quite subtle. Yet, careful inspection shows that mattinghelps to remove the aliasing artifacts along the object bound-aries.

To further demonstrate the effects of video matting, twomore data sets are captured using objects with fuzzy bound-aries. Figure 13 shows the results obtained for the first of thetwo. They suggest that, when the object has fuzzy bound-aries, it is difficult for the bilayer segmentation to preciselycut out the foreground. Nevertheless, the segmentation isgood enough for automatic trimap generation. The compari-son among the alpha mattes shown in the figure reveals that:(1) the matte extracted using the original Poisson equation isquite blurry, which is caused by the imprecise Laplacian ofmatte obtained using luminance information only; (2) usingthe multichannel matting equation helps to obtain more ac-curate alpha matte, but sharp edges in the background alsoshow up in the matte; (3) utilizing the additional depth in-

formation helps to correct the artifacts caused by color dis-continuities in the background.

Figure 14 shows the result obtained for the second dataset. The results also suggest that using the additional depthinformation helps to remove artifacts caused by the non-smooth background. Full matting results of these two videosequences can be found in our supplemental materials.

6.3 Speed Performance

In terms of processing speed, the algorithm is tested ona Lenovo S10 workstation with Intel 3 GHz Core 2 DuoCPU and NVIDIA Quadro FX 1700 GPU. We evaluatethe computational steps involved in the proposed pipelinewhich are most critical from the computational side. Corre-sponding computation time measurements are summarizedin Table 3. The joint CPU/GPU bilayer segmentation im-plementation can achieve 32 fps and the matting algorithm

Int J Comput Vis (2012) 97:104–121 119

Fig. 14 Matting results for thetoy data set: (a) color input,(b) depth captured, (c) bilayersegmentation result,(d) automatically generatedtrimap, (e–f) results obtainedusing multichannel Poissonequation but without the depthinformation, (g–h) resultsobtained using the additionaldepth information. Full mattingresults can be found in oursupplemental materials

Table 3 Running times (in milliseconds) of major computationalcomponents of the system. The image resolution is 320 × 240. Pleasenote the performance gain from the GPU implementation

runs on the GPU at about 80 fps for video sequence of res-olution 320 × 240. The whole matting pipeline runs at 23fps where over 70% of the computation time is spent on thebilayer segmentation. Furthermore, it can be seen that theGPU-based implementation speeds up the computation over3 folds.

7 Discussion

7.1 Segmentation Leveraged by Active/Passive DepthSensing

Automatic separation of layers from color/contrast alone isknown to be error-prone and beyond the capability of fullyautomatic algorithms. The success of our system largelylies in the reliable depth information returned by the TOFcamera. Fusing color and depth cues for foreground extrac-tion has also been explored in the past and stereo match-ing is the most popular way to recovery scene depth. Whilestereo based foreground extraction systems can achieve high

quality segmentation (Kolmogorov et al. 2005; Zhang etal. 2011), their practicality is limited by the need to per-form dense matching. While the state of the art in stereo isadvancing rapidly, the problem is inherently ill-posed andthere are fundamental difficulties that make a high qual-ity, real-time and general solution challenging. The majordifficulty for stereo algorithms to produce consistent accu-rate depth estimate is that the foreground object or back-ground scene could be extremely textureless. Unfortunately,it is always the case that not enough textures exist in man-made scenes like office interiors (e.g., MS and CW se-quences).

In addition to traditional two-camera configuration usedin Kolmogorov et al. (2005), Zhang et al. (2011) use a sin-gle hand-held camera and motion parallax to recovery depthfor foreground extraction. The usage of multiple images(> 20 images as reported in Zhang et al. (2008)) and seg-mentation stereo improves the depth accuracy and alleviatethe problems caused by large textureless regions. However,there are also limitations when state-of-the-art multi-viewstereo methods are exploited. First efficiency is a major is-sue. Top stereo algorithms typically require global optimiza-tion, color segmentation and temporal fusion (Zhang et al.2008) hence are computationally expensive and only suit-able for offline video processing. The single camera setupin Zhang et al. (2011) assumes the camera and foregroundobjects are dynamic (static background) to satisfy the photoconsistency constraint. Pixels with significant appearancechange due to illumination variation, reflection, shadow, ormoving background tend to be recognized as foreground dueto the failure of stereo.

7.2 Limitations

The current system still has the following limitations. First,as can be seen from our supplemental materials, the mostunpleasant visual problem is the “flicking” artifacts alonglayer boundaries. This is because bilayer segmentation is

120 Int J Comput Vis (2012) 97:104–121

Fig. 15 Failure examples of our system. Segmentation errors fromprevious frames are propagated to later frames

solved for each frame independently without enforcing tem-poral consistency constraint across different frames. An in-teresting extension is to correlate pixels in temporal axis us-ing optical flow estimated from color images and incorpo-rate temporal smoothness constraint into the MRF formula-tion.

Second, from the system point of view, drifting is al-ways a problem for real-time segmentation that requiresonline learning. Although our adaptive weight method isable to reduce the chance of failure by better balancingthe color and depth cues, there is no mechanism to re-trieve the system from error accumulation and propaga-tion. Two failure examples caused by error propagation areshown in Fig 15. We plan to further investigate this is-sue and provide solutions to protect against error propaga-tion.

Another limitation of our algorithm is that, the currentmatting step requires the user to specify the width of the un-known area obtained through erosion. When the foregroundobject contains mainly sharp boundaries (Fig. 12), the widthshould be small for better efficiency, whereas when thereare fuzzy boundaries (Figs. 13 and 14), the width needsto be large for effective and high quality matte extraction.Algorithms have been proposed to automatically and adap-tively determine the unknown region based on binary seg-mentation result (Rhemann et al. 2008). However, how toincorporate the technique into our real-time system, as wellas utilize the depth information, worth further investiga-tion.

8 Conclusion

In this paper, we present a complete video matting sys-tem for processing video sequences in real-time and online,without the need of user interaction. By utilizing a TOFcamera, appearance and depth cues are intelligently fusedto achieve robust bilayer segmentation, the result of whichguides the automatic trimap generation. For alpha matte es-timation, a novel real-time matting algorithm is presented.The quantitative evaluation shows that the matting resultsproduced by our algorithm are more accurate than those ofthe original Poisson matting approach. And thanks to themultichannel Poisson equations, scene depth informationacquired from the TOF camera can be leveraged to furtherimprove the matting quality. A thorough comparative evalu-ation using ground truth data is presented to assess the per-formance of our proposed algorithms and gauge the progressof TOF camera-based real-time video segmentation.

Acknowledgements The authors would like to thank Mr. Mao Ye,Dr. Matt Steel and Dr. Melody Carswell for their help in data capture.This work is supported in part by University of Kentucky ResearchFoundation and US National Science Foundation award IIS-0448185,CPA-0811647, and MRI-0923131. Finally, the authors would like tothank anonymous reviewers for their constructive comments and sug-gestions.

References

3DV Systems. http://www.3dvsystems.com.Bai, X., & Sapiro, G. (2007). A geodesic framework for fast interactive

image and video segmentation and matting. In Proc. of ICCV.Blake, A., Rother, C., Brown, M., Perez, P., & Torr, P. (2004). Inter-

active image segmentation using an adaptive GMMRF model. InProc. of ECCV.

Boykov, Y., & Jolly, M.-P. (2001). Interactive graph cuts for optimalboundary and region segmentation of objects in N-D images. InProc. of ICCV.

Boykov, Y., Veksler, O., & Zabih, R. (2001). Fast approximate energyminimization via graph cuts. IEEE TPAMI, 23(11), 1222–1239.

Canesta Inc. http://www.canesta.com/.Chuang, Y.-Y., Curless, B., Salesin, D., & Szeliski, R. (2001).

A Bayesian approach to digital matting. In Proc. of CVPR (pp.264–271).

Chuang, Y.-Y., Agarwala, A., Curless, B., Salesin, D. H., & Szeliski,R. (2002). Video matting of complex scenes. Proceedings of theSIGGRAPH, 21(3), 243–248.

Crabb, R., Tracey, C., Puranik, A., & Davis, J. (2008). Real-time fore-ground segmentation via range and color imaging. In Proc. ofIEEE workshop on time of flight camera based computer vision.

Criminisi, A., Cross, G., Blake, A., & Kolmogorov, V. (2006). Bilayersegmentation of live video. In Proc. of CVPR.

Davis, J., & Gonzalesz-Banos, H. (2003). Enhanced shape recoverywith shuttered pulses of light. In Proc. of IEEE workshop onprojector-camera systems.

Gastal, E. S. L., & Oliveira, M. M. (2010). Shared sampling for real-time alpha matting. In Proc. of Eurographics.

Gong, M., & Yang, Y.-H. (2009). Near-real-time image matting withknown background. In Proc. of Canadian conference on computerand robot vision.

Int J Comput Vis (2012) 97:104–121 121

Gong, M., Wang, L., Yang, R., & Yang, Y.-H. (2010). Real-time videomatting using multichannel Poisson equations. In Proc. of graph-ics interface.

Gordon, G., Darrell, T., Harville, M., & Woodfill, J. (1999). Back-ground estimation and removal based on range and color. In Proc.of CVPR.

Grady, L., Schiwietz, T., Aharon, S., & Westermann, R. (2005). Ran-dom walks for interactive alpha-matting. In Proc. of VIIP (pp.423–429).

Harville, M., Gordon, G., & Woodfill, J. (2001). Foreground segmenta-tion using adaptive mixture models in color and depth. In Proc. ofIEEE workshop on detection and recognition of events in video.

Joshi, N., Matusik, W., & Avidan, S. (2006). Natural video mattingusing camera arrays. In Proc. of SIGGRAPH (pp. 779–786).

Kolmogorov, V., Criminisi, A., Blake, A., Cross, G., & Rother, C.(2005). Bilayer segmentation of binocular stereo video. In Proc.of CVPR.

Levin, A., Lischinski, D., & Weiss, Y. (2008). A closed form solutionto natural image matting. IEEE TPAMI, 30(2), 228–242.

Li, Y., Sun, J., & Shum, H.-Y. (2005). Video object cut and paste. Pro-ceedings of the SIGGRAPH, 24(3), 595–600.

McGuire, M., Matusik, W., Pfister, H., Hughes, J. F., & Durand, F.(2005). Defocus video matting. In Proc. of SIGGRAPH (pp. 567–576).

McGuire, M., Matusik, W., & Yerazunis, W. (2006). Practical, real-time studio matting using dual imagers. In Proc. of Eurographicssymposium on rendering.

MESA Imaging AG. http://www.mesa-imaging.ch/.Mishima, Y. (1993). Soft edge chroma-key generation based upon hex-

octahedral color space. US Patent 5,355,174.Open Source Computer Vision (OpenCV) Libiary. http://opencv.

willowgarage.com/wiki/.Pham, V.-Q., Takahashi, K., & Naemura, T. (2009). Real-time video

matting based on bilayer segmentation. In Proc. of ACCV.Porter, T., & Duff, T. (1984). Compositing digital images. In Proc. of

SIGGRAPH (pp. 673–678).Rhemann, C., Rother, C., Rav-Acha, A., & Sharp, T. (2008). High res-

olution matting via interactive trimap segmentation. In Proc. ofCVPR.

Rother, C., Kolmogorov, V., & Blake, A. (2004). GrabCut: interactiveforeground extraction using iterated graph cuts. Proceedings ofthe SIGGRAPH, 23(3), 309–314.

Sun, J., Jia, J., Tang, C.-K., & Shum, H.-Y. (2004). Poisson matting. InProc. of SIGGRAPH (pp. 315–321).

Sun, J., Zhang, W., Tang, X., & Shum, H.-Y. (2006). Background cut.In Proc. of ECCV (pp. 628–641).

Sun, J., Sun, J., Kang, S.-B., Xu, Z.-B., Tang, X., & Shum, H.-Y.(2007). Flash cut: foreground extraction with flash and no-flashimage pairs. In Proc. of CVPR.

Wang, J., & Cohen, M. (2005). An iterative optimization approach forunified image segmentation and matting. In Proc. of ICCV (pp.936–943).

Wang, J., & Cohen, M. (2007a). Optimized color sampling for robustmatting. In Proc. of CVPR.

Wang, J., & Cohen, M. (2007b). Image and video matting: a survey.FTCGV, 3(2), 97–175

Wang, J., Bhat, P., Colburn, R. A., Agrawala, M., & Cohen, M. F.(2005). Interactive video cutout. In Proc. of SIGGRAPH (pp. 585–594).

Wang, J., Agrawala, M., & Cohen, M. (2007a). Soft scissors: an in-teractive tool for realtime high quality matting. In Proc. of SIG-GRAPH.

Wang, O., Finger, J., Yang, Q., Davis, J., & Yang, R. (2007b). Auto-matic natural video matting with depth. In Proc. of Pacific graph-ics.

Wang, L., Zhang, C., Yang, R., & Zhang, C. (2010). TofCut: towardsrobust real-time foreground extraction using a time-of-flight cam-era. In Proc. of 3DPVT.

Wu, Q., Boulanger, P., & Bischof, W. F. (2008). Robust real-time Bi-layer video segmentation using infrared video. In Proc. of Cana-dian conference on computer and robot vision.

Yang, Q., Yang, R., Davis, J., & Nister, D. (2007). Spatial-depth superresolution for range images. In Proc. of CVPR.

Yin, P., Criminisi, A., Winn, J., & Essa, I. (2007). Tree-based classifiersfor bilayer video segmentation. In Proc. of CVPR.

Yu, T., Zhang, C., Cohen, M., Rui, Y., & Wu, Y. (2007). Monocularvideo foreground/background segmentation by tracking spatial-color Gaussian mixture models. In Proc. IEEE workshop on mo-tion and video computing.

Zhang, G., Jia, J., Wong, T.-T., & Bao, H. (2008). Recovering consis-tent video depth maps via bundle optimization. In Proc. of CVPR.

Zhang, G., Jia, J., Hua, W., & Bao, H. (2011). Robust bilayer segmen-tation and motion/depth estimation with a handheld camera. IEEETPAMI, 33(3), 603–617.

Zhu, J., Wang, L., Yang, R., & Davis, J. (2008). Fusion of time-of-flight depth and stereo for high accuracy depth maps. In Proc. ofCVPR.

Zhu, J., Liao, M., Yang, R., & Pan, Z. (2009). Joint depth and alphamatte optimization via fusion of stereo and time-of-flight sensor.In Proc. of CVPR.