Obstruction Free Photography - SIGGRAPH2015

11
A Computational Approach for Obstruction-Free Photography Tianfan Xue 1* Michael Rubinstein 2Ce Liu 2William T. Freeman 1,2 1 MIT CSAIL 2 Google Research (a) Captured images (moving camera) Background Occlusion (b) Output decomposition (our results) Occluding layer Background Background Reflection Background Reflective layer Figure 1: In this paper we present an algorithm for taking pictures through reflective or occluding elements such as windows and fences. The input to our algorithm is a set of images taken by the user while slightly scanning the scene with a camera/phone (a), and the output is two images: a clean image of the (desired) background scene, and an image of the reflected or occluding content (b). Our algorithm is fully automatic, can run on mobile devices, and allows taking pictures through common visual obstacles, producing images as if they were not there. The full image sequences and a closer comparison between the input images and our results are available in the supplementary material. Abstract We present a unified computational approach for taking photos through reflecting or occluding elements such as windows and fences. Rather than capturing a single image, we instruct the user to take a short image sequence while slightly moving the camera. Dif- ferences that often exist in the relative position of the background and the obstructing elements from the camera allow us to separate them based on their motions, and to recover the desired background scene as if the visual obstructions were not there. We show results on controlled experiments and many real and practical scenarios, in- cluding shooting through reflections, fences, and raindrop-covered windows. CR Categories: I.3.7 [Image Processing and Computer Vision]: Digitization and Image Capture—Reflectance; I.4.3 [Image Pro- cessing and Computer Vision]: Enhancement Keywords: reflection removal, occlusion removal, image and video decomposition 1 Introduction Many imaging conditions are far from optimal, forcing us to take our photos through reflecting or occluding elements. For exam- ple, when taking pictures through glass windows, reflections from indoor objects can obstruct the outdoor scene we wish to capture (Figure 1, top row). Similarly, to take pictures of animals in the zoo, we may need to shoot through an enclosure or a fence (Fig- ure 1, bottom row). Such visual obstructions are often impossi- ble to avoid just by changing the camera position or the plane of focus, and state-of-the-art computational approaches are still not robust enough to remove such obstructions from images with ease. More professional solutions, such as polarized lenses (for reflection removal), which may alleviate some of those limitations, are not accessible to the everyday user. In this paper, we present a robust algorithm that allows a user to take photos through obstructing layers such as windows and fences, producing images of the desired scene as if the obstructing elements were not there. Our algorithm only requires the users to generate some camera motion during the imaging process, while the rest of the processing is fully automatic. We exploit the fact that reflecting or obstructing planes are usually situated in-between the camera and the main scene, and as a result, have different depth than the main scene we want to capture. Thus, instead of taking a single picture, we instruct the photographer to take a short image sequence while slightly moving the camera— an interaction similar to taking a panorama (with camera motion * Part of this work was done while the author was an intern at Microsoft Research New England. Part of this work was done while the authors were at Microsoft Research.

description

Obstruction Free Photography - SIGGRAPH2015

Transcript of Obstruction Free Photography - SIGGRAPH2015

A Computational Approach for Obstruction-Free PhotographyTianfan Xue1Michael Rubinstein2Ce Liu2William T. Freeman1,21MIT CSAIL2Google Research(a) Captured images (moving camera)Background Occlusion(b) Output decomposition (our results) OccludinglayerBackgroundBackground ReflectionBackgroundReflectivelayerFigure 1:In this paper we present an algorithm for taking pictures through reective or occluding elements such as windows and fences.The input to our algorithm is a set of images taken by the user while slightly scanning the scene with a camera/phone (a), and the outputis two images:a clean image of the (desired) background scene, and an image of the reected or occluding content (b). Our algorithm isfully automatic, can run on mobile devices, and allows taking pictures through common visual obstacles, producing images as if they werenot there. The full image sequences and a closer comparison between the input images and our results are available in the supplementarymaterial.AbstractWepresent auniedcomputational approachfor takingphotosthroughreectingor occludingelements suchas windows andfences. Rather than capturing a single image, we instruct the user totake a short image sequence while slightly moving the camera. Dif-ferences that often exist in the relative position of the backgroundand the obstructing elements from the camera allow us to separatethem based on their motions, and to recover the desired backgroundscene as if the visual obstructions were not there.We show resultson controlled experiments and many real and practical scenarios, in-cluding shooting through reections, fences, and raindrop-coveredwindows.CR Categories: I.3.7 [Image Processing and Computer Vision]:DigitizationandImageCaptureReectance; I.4.3[ImagePro-cessing and Computer Vision]: EnhancementKeywords: reectionremoval, occlusionremoval, imageandvideo decomposition1 IntroductionMany imaging conditions are far from optimal, forcing us to takeourphotosthroughreectingoroccludingelements. Forexam-ple, when taking pictures through glass windows, reections fromindoor objects can obstruct the outdoor scene we wish to capture(Figure 1,top row). Similarly,to take pictures of animals in thezoo, we may need to shoot through an enclosure or a fence (Fig-ure1, bottomrow). Suchvisualobstructionsareoftenimpossi-ble to avoid just by changing the camera position or the plane offocus, andstate-of-the-artcomputationalapproachesarestillnotrobust enough to remove such obstructions from images with ease.More professional solutions, such as polarized lenses (for reectionremoval), which may alleviate some of those limitations, are notaccessible to the everyday user.In this paper, we present a robust algorithm that allows a user totake photos through obstructing layers such as windows and fences,producing images of the desired scene as if the obstructing elementswere not there. Our algorithm only requires the users to generatesome camera motion during the imaging process, while the rest ofthe processing is fully automatic.We exploit the fact that reecting or obstructing planes are usuallysituated in-between the camera and the main scene, and as a result,have different depth than the main scene we want to capture. Thus,instead of taking a single picture, we instruct the photographer totake a short image sequence while slightly moving the cameraan interaction similar to taking a panorama (with camera motion Part of this work was done while the author was an intern at MicrosoftResearch New England. Part of this work was done while the authors were at Microsoft Research.perpendicular to the z-axis being more desired than rotation). Basedondifferencesinthelayersmotionsduetovisualparallax, ouralgorithm then integrates the space-time information and producestwoimages: animageofthebackground, andanimageofthereected or occluding content (Figure 1). Our setup imposes someconstraints on the scene, such as being roughly static while captur-ing the images, but we nd that many common shooting conditionst it well.Theuseofmotionparallaxforlayerdecompositionisnot new.Rather, our papers main contribution is in a more robust and re-liable algorithm for motion estimation in the presence of obstruc-tions. Its success comes from mainly: (i) a pixel-wise ow eldmotion representation for each layer, which, in contrast to manyprevious image decomposition algorithms that use parametric mo-tion models (afne or homography), is able to handle depth varia-tion as well as small motions within each layer; and (ii) an edgeow method that produces a robust initial estimation of the mo-tion of each layer in the presence of visual obstructions, as edges areless affected by the blending of the two layers.Given an input im-age sequence, we rst initialize our algorithm by estimating sparsemotion elds on image edges. We then interpolate the sparse edgeows into dense motion elds, and iteratively rene and alternatebetween computing the motions and estimating the background andobstruction layers in a coarse-to-ne manner (Figure 3).Importantly, wealsoshowthat thetwotypesofobstructionsreections and physical occlusions (such as fences)can be han-dled by a single framework. Reections and occlusions may appeardifferent at a glance, and indeed, previous work have used differentsolutions to address each one. However, in this paper we presenta unied approach to address the two problems from a single an-gle. With minimal tweaking, our system consists of largely sharedmodules for these two problems, while achieving results of higherquality than those produced by previous algorithms addressing ei-thersubproblem. Inthispaperwespecifymanuallythetypeofobstruction present in the scene (reective or occluding) to tweakthe algorithm to each case.We test our method in various natural and practical scenarios, suchas shooting through fences, windows and other reecting surfaces.For quantitative evaluation, instead of synthetically simulating ob-structions by blending or composting images (as commonly done inprevious work), we design controlled experiments in which we cap-ture real scenes with ground truth decompositions. Our algorithmis fully automatic, can work with standard phone cameras, and onlyrequires the user to move the camera in a freeform manner to scanthe scene. In our experiments, we found that 5 images taken alonga small, approximately horizontal baseline of a few centimeters areusually enough to remove the obstructing layer.2 BackgroundTheproblemsof removingreectionsandremovingocclusionsfrom images have been explored in the past under different setups.Here we review related work in those two areas.ReectionRemoval. Separatingtransmissionandreectioninimageshasbeenwidelystudied, bothfor thepurposeof directdecomposition(e.g.[Levinetal.2002; Szeliskietal.2000]), aswell as in the context of other graphics and vision applications suchas image based rendering [Sinha et al. 2012; Kopf et al. 2013] andstereo [Tsin et al. 2006].Previous work on reection removal can be grouped into three maincategories. In the rst category are approaches that remove reec-tion from a single image. As this problem is highly ill-posed, re-searchers have proposed different priors to make the problem moreconstrained. For example, Levin et al. [2002] proposed to use imagepriors such as statistics of derivative lters and corner detectorsinnaturalscenestodecomposetheimage. Theylaterimprovedtheir algorithm using patch based priors learned from an externaldatabase [Levin et al. 2004; Levin and Weiss 2007]. However, theirmethod requires a large amount of user inputrelying on the userto mark points on the background and reected contentand doesnot work well in textured regions. Recently, Li and Brown [2014]proposedtoseparatethereectionusingasingleimagefocusedon the background, under the assumption that the reection in thatcase will be blurrier. Even with these priors, single image reectionremovalisextremelychallengingandhardtomakepracticalforreal images.The second line of work focuses on removing reections from a setof images taken through polarizers. Using a polarized lter withdifferentorientations, asequenceofimagesiscaptured, eachofwhichisalinearcombinationofthebackgroundandreection,Ii= aiIB+biIR, wherecombinationcoefcients aiandbidepend on the direction of the polarized lters. This set of imagesisthenusedtodecomposethebackgroundandreectionlayers,again, using different priors on the two layers [Kong et al. 2014;Singh 2003; Sarel and Irani 2004; Farid and Adelson 1999]. Thesemethods perform well, but the requirement of a polarized lter andtwo images from the same position limits their usefulness.The third approach for removing reections is to process an inputimage sequence where the background and reection are movingdifferently.In this setup, both the intensity and the motion of eachlayer need to be recovered. To simplify the problem, most previousapproaches in that category constrained the motion of each layer tofollow some parametric model. Be et al. [2008] assumed translativemotion, and proposed an algorithm to decompose the sequence us-ing a parameterized joint diagonalization. Gai et al. [2009] assumethat the motion of each layer follows an afne transformation, andfound a new image prior based on joint patterns of both backgroundand reectance gradients. Guo et al. [2014] assume that the motionof each layers is a homography, and proposed a low-rank approxi-mation formulation. For many practical scenarios, however, afneand perspective transformations cannot model well enough the mo-tions of the background and reectance layers. This is manifestedas artifacts in the results.Some authors used dense warp elds, as we do, to model the mo-tions of the background and reection. Szeliski et al. [2000] pro-posed a min/max alternation algorithm to recover the backgroundand reectance images, and used optical ow to recover a densemotioneldforeachlayer. Inadditiontodensemotionelds,our algorithm also incorporates image priors that were shown to beinstrumental for removing reections from image sequences [Gaiet al. 2012], which are not used in [Szeliski et al. 2000]. Li andBrown [2013] extended that approach by replacing the optical owwithSIFTow[Liuet al. 2008] tocalculatethemotioneld.However, they model the reection as an independent signal addedtoeachframe, without utilizingthetemporal consistencyofthereection layer, thus limiting the quality of their reconstructions.Occlusion Removal. Occlusion removal is closely related to im-ageandvideoinpainting[Criminisietal.2004; Bertalmioetal.2000;Bertalmio et al. 2001;Newson et al. 2014]. To remove anobject from an image or a video, the user rst marks some regionstoberemoved, andthentheinpaintingalgorithmremovesthoseregion and lls in the holes by propagating information from otherparts of the image or from other frames in the sequence. Image andvideo inpainting is mainly focused on interactive object removal,while our focus is on automatic removal, as in most of the caseswe address asking the user to mark all the obstructed pixels in theimage would be too laborious and impractical.Several other papers proposed to automatically remove visual ob-structions of particular types from either an image or a video. Sev-eralalgorithmswereproposedtoremovenear-regularstructures,like fences, from an image or a video [Hays et al. 2006; Park et al.2008;Park et al. 2011;Yamashita et al. 2010]. Mu et al. [2012]proposed to detect fence patterns based on visual parallax, and toremove them by pulling the occluded content from other frames.[Barnumet al. 2010]proposedtodetect snowandraindropsinfrequency space. [Garg and Nayar 2004] remove rain drops basedon their physical properties. All these work either focus on partic-ular types of obstacles (e.g. raindrops), or rely on visual propertiesof the obstruction, such as being comprised of repeating patterns.Our goal is develop a general purpose algorithm that could handlecommon obstructions, without relying on their specic properties.3 Problem SetupFigure 2 shows our image formation model. A camera is imaging ascene through a visual obstruction, which can be either a reectiveobject, such as glass, or an opaque object, such as a fence. In orderto remove the artifactseither the obstruction itself, or the reec-tion introduced by the obstructionthe user captures a sequence ofimages while moving the camera. In this paper we assume that boththe obstruction and background objects remain roughly static dur-ing the capture. If the obstruction is opaque (a fence, for example),we further require that each pixel in the background scene wouldbe visible (not occluded by the obstruction) in at least one frame inthe sequence, so that its information could be recovered. We alsoassume that the obstruction (or the reected content) is not too closeto the background, so that when moving the camera there would besufcient difference between the motions of the two layers.Inthispaper, weusealower-caseletter atodenoteascalar, anormal capital letter A to denote a vector, and a bold capital letterAto denote a matrix. We denote the matrix product as AB, whereA RnmandB Rm, and the element-wise product of twovectors as A B, where A, B Rn.If there is no obstruction, we will get a clean image of the back-ground,denoted asIBRn(n is the number of pixels). Now,due to the presence of obstructions, the captured imageIRnis a composition of the background imageIBand an additional,obstruction layer, IO:I= (1 A) IO +A IB, (1)whereIO Rnis the obstruction layer we want to remove, 1 Rnis a vector with all components equal 1, andA Rnis analpha blending mask, which assigns a blending factor to each pixel.Notice that A multiplies (element-wise) the background image, notthe foreground image (that is,A=1 means a background pixel).This is a less conventional notation for alpha maps, but it will helpsimplify some of the math later on.Morespecically, ifweareimagingthroughareectiveobject,such as a clean window, the obstruction layerIOis the image ofobjects on the same side of the camera, as shown in Figure 2(a).In this case we assume the alpha blending maskA is a constant,as reective objects are usually homogeneous (i.e. glass as foundinmost windowstypicallyreect light similarlythroughout it).Thisisacommonassumptioninthereectionseparationlitera-ture [Szeliski et al. 2000; Li and Brown 2013; Guo et al. 2014].If, ontheotherhand, weareimagingthroughafenceorotheropaque objects, the obstruction layer IO is the opaque object itself,as shown in Figure 2(b). In that case, the alpha map, A, equals 1 atthe region where the background is not occluded, and is between 0and 1 if the background is partially or fully occluded.DecomposingIinto the obstruction layerIOand the backgroundlayer IB is ill-posed froma single input image. We therefore ask theGlass Background scene Reflected object Image of reflected objectFence(a) Imaging through a windowBackground scene(b) Imaging through a fenceFigure 2: The image formation model. A camera image of a desiredscenethough(a)areectingsurface, andthrough(b)apartialobstruction.user to move the camera and take a sequence of images. Assumingthe obstruction layerIOis relatively closer (further away) to thecamera than the background objects, its projected motion on theimage plane will be larger (smaller) than the background objectsdue to the visual parallax. We utilize this difference in the motionsto decompose the input image to the background and obstructioncomponents.More formally, given an input sequence, we pick one frame t0 fromthe sequence as the reference frame, and estimate the backgroundcomponent IBandtheobstructioncomponent IOofthatframe,using the information from other frames. Assuming both the back-ground objects and the obstruction layer are static, we can expressthe background and obstruction components of other frames t = t0as a warped version of the respective components of the referenceframe t0. Specically, let VtO and VtBdenote the motion elds forthe obstruction and background layers from the reference frame t0to the frame t, respectively. The observed image at time t is thenIt= (1W(VtO)A) W(VtO)IO+W(VtO)A W(VtB)IB, (2)where W(VtB) Rnnis a warping matrix such that W(VtB)IB isthe warped background component IB according to the motion eldVtB. Since the obstruction is between the camera and backgroundobjects, the alpha map shares the same motion of the obstructioncomponent IO, not the background component IB. We can there-foresimplifytheformulationbydeningIO =(1 A) IO(with the abuse of the notation IO), to get the following, simpliedequation:It= W(VtO)IO +W(VtO)A W(VtB)IB. (3)Notethat inthecaseofreection, sincethealphamapiscon-stant, A=, the formulation can be further simplied asIt=W(VtO)IO +W(VtB)IB, where IO= (1 )IO and IB= IB.Thatis, thealphamapisessentiallyabsorbedintothereectionand background components. Except for a small modication intheoptimizationforthat case, whichwewill describelateron,both types of obstructions (opaque objects and reective panes) arehandled similarly by the algorithm.Our goal is then to recover the background component IB and theobstructioncomponent IOforthereferenceframeIt0, fromaninput image sequence {It}, without knowing the motion elds VtBand VtO, or the alpha map A (in the case of opaque occlusion).Input image sequence Edge Flow Sparse motion fieldsBackground ObstructionRecovered dense motion fieldsMotionestimation(Eq. 14)Decomposition(Eq. 10)Recovered layersInterpolationBackground (VB)Initialzation(Section 4.3)Alt. Optimization(Section 4.2)Motion color codingObstruction (VO)xyBackground (IB) Obstruction (IO)Figure 3: Algorithm pipeline. Our algorithm consists of two steps: initialization and iterative optimization. Initialization: we rst calculatethe motion vectors on extracted edge pixels from the input images (we thicken the edge mask for a better visualization). Then we t twoperspective transforms (one for each layer) to the edge motion and assign each edge pixel to either the background layer or the obstructionlayer. This results in two sets of sparse ow elds for the two layers (top right), which we then interpolate to produce an initial estimationof the dense motion elds for each layer (bottom right). Optimization:In this stage, we alternate between updating the motion elds, andupdating the background and obstruction components, until convergence.4 Motion-based Decomposition4.1 FormulationLet usnowdiscusstheoptimizationproblemforrecoveringthebackground and obstruction components, IB and IO, from an inputimagesequence, {It}. Wewillrstderiveanalgorithmforthemore general case of an unknown, spatially varying alpha map, A,and then show a small simplication that can be used for reectionremoval where we assume the alpha map is constant.According to the image formation model (Eq. 3), we set our dataterm to be:

tItW(VtO)IO W(VtO)A W(VtB)IB1, (4)where {VtO} and {VtB} are the sets of motion vectors for the ob-struction and background components, respectively.Toreducetheambiguityof theproblem, weincludeadditionalconstrainsbasedonpriorsonboththedecomposedimagesandtheir respective motion elds. First, because the obstruction andbackgroundcomponentsarenaturalimages, weenforceaheavytailed distribution on their gradients [Levin and Weiss 2007], asIO1 +IB1, (5)where IB are the gradients of the background component IB.We assume that the alpha map is generally smoother than a naturalimage (smooth transitions in the blending coefcients). We assumethegradientsfollowaGaussiandistributionandpenalizeitsl2-norm:A2. (6)We also assume that the background component and the obstructioncomponent are independent. That is, if we observe a strong gradientin the input image, it most likely belongs either to the backgroundcomponent or the obstruction component, but not to both. To en-force this gradient ownership prior, we penalize the product of thegradients of background and obstruction, asL(IO, IB) =

xIO(x)2IB(x)2, (7)wherex is the spatial index and IB(x) is the gradient of imageIB at position x.Finally, as commonly done by optical ow algorithms [Black andAnandan 1996], we also enforce sparsity on the gradients of themotion elds, seeking to minimize

tVtO1 +VtB1. (8)Combining all the terms above, our objective function is:minIO,IB,A,{VtO},{VtB}

tItW(VtO)IOW(VtO)A W(VtB)IB1+1A22+2(IO1+IB1)+3L(IO, IB)+4

tVtO1+VtB1Subject to:0IO, IB, A1, (9)where1, . . . , 4areweightsforthedifferent terms, whichwetuned manually. For all the examples in the paper, we used 1= 1,2=0.1, 3=3000, and 4=0.5.1Similarly to [Szeliski et al.2000], we also constrain the intensities of both layers to be in therange [0, 1].In Figure 4, we demonstrate the contribution of different compo-nentsinouralgorithmtotheresult, usingacontrolledsequencewithgroundtruthdecomposition. Oneof themaindifferencesbetween our formulation and previous work in reection removalis the use of dense motion elds instead of parametric motion. Forcomparison, wereplacedthedensemotioneldsVtOandVtBinour formulation with a homography (the results using afne motionweresimilar). AscanbeseeninFigure4(c,e), adensemotionrepresentationgreatlyimprovesthequalityandreducesartifacts.That is because the background/obstruction layer at different framescannot bealignedwell enoughusingparametricmotion. Suchmisalignments show up as blur and artifacts in the decomposition.The decomposition quality also degrades slightly when not usingthe gradient sparsity prior, as shown in Figure 4(d).1L(IB, IO) is signicantly smaller than other terms, and so we com-pensate for that by choosing a larger 3.(b) Init. with opt. flowBackgroundReflection(c) Parameteric motion (projective) (d) Without sparsity prior (e) Our result0.74010.7494 -0.1045(a) Input and ground truth decomp.Representative input frameBackgroundReflection0.8835 0.8928 0.89850.7207 0.7536Ground truth decompositionFigure 4: The contribution of different components in the algorithm to the result. We use one of our controlled image sequences with groundtruth decomposition (see Section 5), and compare our algorithm (e) with the following variants: (b) replacing the edge ow initialization withregular optical ow, (c) replacing the dense motion elds with parametric motion (we used projective transforms for both layers), and (d)when removing the sparsity prior on the image gradients (i.e. 2=0 in Eq. 9). One pair of ground truth background and reection imagesfor this 5-frame sequence are shown in (a) for reference. The normalized cross correlation (see Section 5 for details) between each recoveredlayer and the ground truth is shown at the bottom left of each image.4.2 OptimizationWe use an alternating gradient descent method to solve Eq. 9. Werstxthemotionelds {VtO}and {VtB}andsolvefor IO, IBand A, and then x IO, IB and A, and solve for {VtO} and {VtB}.Similar alternating gradient descent approach for joint estimationhas been used in video super resolution [Liu and Sun 2014].Decompositionstep: xmotionelds {VtO}and {VtB}, andsolve for IO, IB, and A. In this step, we ignore all the terms inEq. 9 that only consist of VtO and VtB:min{IO,IB,A}

tItWtOIO WtOA WtBIB1 +1A2+2 (IO1 +IB1) +3L(IO, IB), (10)Subject to0IO, IB, A1, .where we use WtOand WtBas short notes for W(VtO) andW(VtB). We solve this problem using a modied version of itera-tive reweighted least squares (IRLS). The original IRLS algorithmis designed for a non-constrained optimization with only l1-and l2-norms. To get this form, we linearize the higher-order terms in theobjective function in Eq. 10. LetIO,IBandA be the obstructioncomponent, the background component, and the alpha map of thelast iteration, respectively. Then the data term is linearized as2ItWtOIOWtOAWtBIBWtOAWtBIB+WtOAWtBIB1. (11)We also linearize the edge ownership term as:33(L(IO, IB) +L(IO, IB) L(IO, IB)). (12)2wemakeanapproximationcommonlyusedinoptimization: xyx y + xy x y, where x is very close to x and y is very close to y.3L(IO, IB) =

x IO

2IB

2 x IO

2IB

2+IO

2IB

2 IO

2IB

2=L(IO, IB)+L(IO, IB) L(IO, IB).Second, we incorporate the two inequality constraints into our ob-jective function using the penalty method [Luenberger 1973]. Forexample, for the non-negativity constraint IB 0, we include thefollowing penalty function into the objective function:p min(0, I2B), (13)where I2B denotes element-wise square and p is the weight for thepenalty(wexp =105). Thisfunctionwillapplyapenaltyproportional tothenegativityof IB(andwill bezeroif IBisnonnegative).Motion estimation step: x IO, IB, A, and solve for the motioneldsVtOandVtB. In this step, we ignore the terms dependentonly on IO, IB, and A in Eq. 9:minVtO,VtBItW(VtO)IO W(VtO)A W(VtB)IB1+4

VtO1 +VtB1

. (14)ThisequationcanagainbesolvedusingIRLS, similarlytothedecomposition step.Multi-scaleProcessing. Toacceleratethealgorithm, weopti-mizeacrossmultiplescales. WebuildaGaussianpyramidfortheinputimagesequence, andrstsolvealltheunknownsthemotions, backgroundandobstructioncomponents, andthealphablendingmaskfor thecoarsest level. Wethenpropagatethecoarse solution to the next level using standard bicubic interpola-tion, anduseitasinitializationtosolveforalltheunknownsatthe ner level. We get the nal solution by solving the problemat the original resolution. At each level, we make a few iterationsbetween solving for the motion and solving for the decomposition.The nal algorithm is summarized in Algorithm 1. The functionsDecomposeandEstimateMotionarethetwostepsdescribedabove. Scale1 is the coarsest scale andns is the nest scale (weuse 3 4 scales), and ni is the number of iterations, which variesData: {It}t, initial guess of IO, IB, A, {VtO}, and {VtB}Result: IO, IB, A, {VtO} and {VtB}.for Scale s = 1 to ns do{It} downsample input image sequence {It} to scale s ;IO, IB, A, {VtO}, {VtB} downsample/upsample IO, IB,A, {VtO}, and {VtB} to scale s;for i = 1 to ni doIO, IB, A Decompose({It}, {VtO}, {VtB}) ;{VtO}, {VtB} EstimateMotion({It}, IO, IB, A) ;endendAlgorithm 1: The motion-based decomposition algorithm.across scales. We typically use 4 iterations for the coarsest scaleand 1 iteration for each of the other scales.ReectionRemoval. Asdiscussedearlier, inthecaseofare-ectivepane, thealphamapAisessentiallyabsorbedintothebackground and reection images and there is no need to solve forit separately. We thus remove the prior term A2(Eq. 6) fromthe objective function, and only solve for IO, IB, {VtO}, and {VtB}.The data term in Eq. 9 becomes:ItW(VtO)IO W(VtB)IB1, (15)and we use the same alternating gradient descent method describedabovetosolvethedecomposition. Currentlywedistinguishbe-tween the two sources of obstructions (opaque and reective ob-struction) manually.4.3 InitializationAkeystageinouralgorithmistheinitializationofthemotioneldsanddecompositionfortheoptimization. Thatpartisvitalfor getting a clean separation as the objective function in Eq. 9 isnonlinear, andtheIRLSalgorithms(Section4.2)mayget stuckatalocalminimum. Ourinitializationworksbyrstestimatinganinitialmotioneldforeachlayer, thencalculatingtheinitialdecomposition (background component IB, obstruction componentIO, and alpha map A) from the initial motion elds. We will nowdescribe these two steps in detail.Initial motion estimation. Motion estimation in videos with re-ection/occlusionischallenging. Forvideoswithreection, forexample, eachpixel hastwomotionvectorsonefortheback-ground and one for the reectance. Therefore, we cannot directlyuse optical ow to estimate the motion elds. Notice that previousworkinmulti-layer optical ow[JepsonandBlack1993; Jojicand Frey 2001; Weiss and Adelson 1996; Liu et al. 2014] usuallyassume that the layer in the front occludes the layer on the back,while we assume the captured image is an additive superposition oftwo layers, so that both layers may be visible at the same locationin the image.Therefore, we propose to get an initial estimate of the motion eldsusinganedgeowalgorithm. That is, weestimateasparsemotion eld at each edge pixel identied in the image. As discussedbefore (Eq. 7), an observed image gradient will often belong to onlyone of the layerseither the background or the occlusionbut not toboth. Indeed, we nd that motion vectors estimated from pixelswith large image gradients are generally more robust. A similaridea was used by [Kopf et al. 2013] for multi-view stereo.More specically, for a given input sequence, we rst extract theedgemapforeachframeusingtheCannyedgedetector[Canny1986]. Then we calculate the motion of detected edge pixels bysolving a discrete Markov random eld (MRF):minV

xEdge(I1)NCC(I1(x), I2(x +V (x))) +

x,x

Edge(I1) and(x,x

)NS(V (x), V (x

)),(16)where I1and I2are two neighboring input images, Vis the motioneldfromimageI1toI2that wewant toestimate, Edge(I1)isthesetofedgepixelsinimageI1, and Nisthe4-connectedpixel neighborhood. Notice that different from the motion eldsdescribed in previous sections, this motion eld is only dened onimage edges. We also assume that Vtakes only integer values, sothat x +V (x) is also on the grid in the image I2.The rst term in Eq. 16 is the data term that describes how well thepatch located at position x in image I1matches the patch locatedatx + V (x) in imageI2. Here,NCC(I1(x), I2(x + V (x))) isthe normalized cross correlation (NCC) between these two patches.The second term S(V (x), V (x

)) in Eq. 16 is the smoothness termthat enforces neighboring edge pixels to have similar motion. WeusethesamepenaltyfunctiondescribedinEq.1in[Kopfetal.2013] for the smoothness term. We solve this Markov random eldusing belief propagation.After obtaining the sparse motion eld using edge ow, we separateit into two sparse motion elds, one for each layer. For this werst t a perspective transformation to the sparse motion eld usingRANSAC, and assign all the edge pixels that best t this transfor-mation to the background layer, assuming the background pixelsare more dominant. We then t another perspective transformationto the rest of the edge pixels (again using RANSAC), and assign thepixels best tting the second transformation to the reection layer.Finally, we compute the dense motion elds for both layers usingvisual surface interpolation [Szeliski 1990]. An illustration of thisprocess is shown in Figure 3.InFigure4wecompareourresultwiththeproposededgeowinitialization (Figure 4(e)) with the one produced when initializingusing standard optical ow, as done in [Szeliski et al. 2000] (Fig-ure 4(b)).Initial decomposition. To get an initial estimation of the decom-position, we rst warp all the frames to the reference frame accord-ing to the background motion estimated in the previous step. In thiswarped input sequence, the background pattern should be roughly-aligned, while the obstruction component should be moving (as themotions of the two components are assumed to be different).Inthecaseofanopaqueocclusion, wetaketheper-pixel meanacross the warped input frames as an initial estimation of the back-groundimage. Wealsocomputeabinaryalphamapbythresh-olding the per-pixel difference between the estimated backgroundimageandtheinput images. If thedifferenceislarger thanathreshold (we used 0.1), we set the alpha map to 0 at that location;otherwise we set the alpha map to1. We then get the obstructioncomponent by plugging the initial estimation of IB and A to Eq. 3.Forareectivepane, wetaketheinitialestimationoftheback-groundimage tobe the minimumintensityacross the warpedframes4.4wecomputetheminimumintensityinsteadof themean, sincetheminimum is an upper bound for the backgrounds intensity. See [Szeliskiet al. 2000] for details.Input (representative frame) Background ReflectionInput (rep. frame) Background Background Reflection Reflection Input (rep. frame)(b)(c) (d)(a)Figure 5: Reection removal results on four natural sequences. For each sequence we show a representative frame (each input sequence inthis gure contain 5 frames), and the background and reection images recovered automatically by the algorithm. Corresponding close-upviews are shown next to the images (on the right of each image for the sequences in the top and middle rows, and below each image for thesequences in the bottom row). More results can be found in the supplementary material.5 ResultsIn this paper, we took most of the image sequences using the cellphone cameras of HTC One M8 and Samsung Galaxy 4, except forthe sequence shown in Figure 5(c), which was taken using a CanonVIXIAHFG30. We processed the sequences on a desktop computerwith Intel Xeon CPU (8 cores) and 64GB memory. With a non-optimized MATLAB implementation, processing a high-resolutionimage sequence (1152x648) took 20 minutes and required 3GB ofRAM,andprocessingalow-resolutionsequence(480x270)took2minutesandrequired1GBmemory. Wealsocreatedanon-optimizedWindowsphoneappprototype, implementedinC++,which produces equivalent results on the phone in less then 2 min-utes for low-resolution images (480x270). Most of the sequencescontain5framessampleduniformlyformthevideo, except forgallery (Figure 8, left) that contains 7 frames.Removing Obstructions in Natural Sequences. We tested ouralgorithms under various scenarios, with different background ob-jects, reecting/occluding elements, and lighting conditions, and itworked consistently well in all these cases.The top row in Figure 1 shows a common scenario when a pho-tographer is taking a picture of an outside view through a window,while self reection appears in the captured images. The strongreection of the shirt covers most of image and obstructs a largepart of the scene. Our algorithm generates a clean separation of thebackground and reective components. Notice how the checker-board pattern on the shirt is completely removed in the recoveredbackground image, while most of the background textures, like thetrees and the building, are well-preserved.Figure 5 shows more scenarios where reections frequently appearin photos. One common case is imaging reective surfaces, such asa glass-covered billboard of a bus station (Figure 5(a)) and a glass-covered panel (Figure 5(d)). Notice that in Figure 5(a), due to thereection, many letters on the billboard can be difcult to recog-nize. Even with such strong and textured reection, our algorithmis able to produce a good separation,and words on the billboardbecome much clearer after the reection is removed.Reections are also common when imaging outdoor scenes throughwindowsduringthenight(Figure5(b))oratdusk(Figure5(c)).Our algorithm is able to separate the background image from thereection, producing a clean image of the outdoor scene, as well asrevealing a lot of information about the indoor scene that is difcultto see in the original image. While the recovered background imageis usually of most interest in this work, our high-quality reconstruc-tion of the reected scene may also be useful in some cases wheremore information needs to be extracted from an image sequence ora video.Input (representative frame) Background Occlusion Alpha map (A)fencesiggraphrainFigure 6:Occlusion removal results. For each sequence (row), we show a representative image from the input sequence (left column), therecovered background scene (second column) and the recovered occluding layer (third column). In the right column, we also show the alphamap, A, as inferred by the algorithm, with colors ranging from black (occlusion) to white (background). The alpha map, as expected, istightly correlated with the occlusion image, but we show it here for completeness.Figure 1 (bottom row) and Figure 6 show some common scenar-ioswhenphotographsaretakenthoughmoreopaque, occludingelements, such as fences, dirty/textured windows (simulated bythe SIGGRAPH logo),and surfaces covered by raindrops. In allcases, our algorithm is able to produce good reconstruction of thebackground scene with the occluding content removed.Rep. input frames [Mu et al. 2012] OursTennisSquareFigure 7: Comparison with Video De-Fencing [Mu et al. 2012]onsequencesfromtheirpaper. Leftcolumn: tworepresentativeframes from each input sequence. Middle column: backgrounds re-covered by [Mu et al. 2012]. Right column: backgrounds recoveredby our method.Comparison. We compared our algorithm with two state-of-the-art algorithms for removing reections, [Guo et al. 2014] and [Liand Brown 2013], and with the recent work of [Mu et al. 2012]for fence removal from videos. In Figure 8 we show side-by-sidecomparisonsofourresultswiththeresultsby[Guoetal. 2014]and[Li andBrown2013]ontwosequences. Onthegallerysequence5, both[Guoetal. 2014]andouralgorithmmanagetoproduceacleanbackground, whiletherearenoticeableartifactsneartheboundaryoftheimageintheresultsof[LiandBrown2013]. Moreover, the reection image produced by our algorithmis cleaner than the ones produce by the other methods. On night,our algorithm generated a much cleaner separation than the othermethods. Notice, for example, how the garbage bin is still visiblein the results by the other methods, while our algorithm produces aclean separation in that region.Tocomparewith[Muet al. 2012], weusedtheauthors ownsequencesTennisandSquare, showninFigure7. OntheTennis sequence, the two methods produced comparable results,althoughour algorithmgeneratedaslightlycleaner backgroundimage. On the Square sequence,since the direction of cameramotion is mostly horizontal with a small vertical component (seeinput frames of Square), removing the horizontal part of the fenceis challenging. Our algorithm can take advantage of this tiny verti-cal camera motion and remove both horizontal and vertical parts ofthe fence, while in the result by [Mu et al. 2012] only the verticalpart of the fence is removed.Quantitative Evaluation. To evaluate our results quantitatively,wetookthreesequenceswithgroundtruthbackgroundandob-structinglayers, asshowninFigure9. Thersttwosequenceswere taken through glass (reective layer), while the third sequencewastakenthroughanironnet(occludinglayer). Wecapturedasequence of images over time while moving the camera, where ateach time step we captured three images: a standard (composite)5Original sequence by [Sinha et al. 2012]. We sampled7 consecutiveframes from their video.Ours[Guo et al. 2014][Li et al. 2013]Background Reflection Background ReflectionInput (two rep. frames)gallery nightFigure 8: Comparison with recent methods for reection removal. More comparisons can be found in the supplementary material.imagethroughtheobstacle(Figure9, leftcolumn), animageofjust the background scene, captured by removing the obstacle, andan image of the reective/occluding layer, which we captured byplacingablacksheetofpaperbehindtheobstacle, blockingthebackground component (and turning the glass into a mirror). Allimages were taken with a DSLRcamera under fully manual control.The full sequences are available on the project web page.We evaluate our algorithm by calculating the normalized cross cor-relation(NCC)ofourrecovereddecompositionwiththegroundtruthdecomposition. TheNCCs of our recoveredbackgroundimageswiththegroundtruthbackgroundswere0.9738(Stone),0.8985 (Toy), and 0.9921 (Hanoi). On the project web page we givesomeadditional comparisonswithcompetingmethodsonthosesequences. For example, on the Stone sequence, the NCCs of the re-covered background images by [Guo et al. 2014] and [Li and Brown2013] were 0.9682 and 0.9271, respectively, and the NCCs for theirrecovered reections were 0.5662 and 0.2423, respectively.Reection-free Panoramas. Our algorithm can also be used forremoving obstructions from panoramic images, for which the im-agecapturealreadyinvolvescameramotion. InFigure10, weshowanexampleofautomaticallyremovingwindowreectionsinapanoramaofanoutdoorscenetakenfrominsideabuilding(Figure 10(a)). In this case we can use the camera motion that isalready introduced by the user for creating the panorama, to alsoremove the reections from it. We remove the reection for eachcaptured image using our algorithm as described above, and thenstitchtheresultstogethertoproduceareection-freepanoramaimage, as shown in Figure 10(b).Our approach will not work if the camera motion is purely rota-tional. In that case, we can still produce reection-free panoramas ifInput (rep. frame) Recovered background Recoverd obstructionNCC = 0.9921 NCC = 0.7079NCC = 0.9738 NCC = 0.8433NCC = 0.8985 NCC = 0.7536StoneToyHanoi`````````MethodSequence Stone ToyIBIOIBIO[Li and Brown 2013] 0.9271 0.2423 0.7906 0.6084[Guo et al. 2014] 0.9682 0.5662 0.7701 0.6860Ours 0.9738 0.8433 0.8985 0.7536Figure 9: Quantitative evaluation on controlled sequences. Top:foreachsequence(row), weshowarepresentativeframefromthecontrolledsequence(leftcolumn)andourdecompositionre-sult (middle and right columns). The normalized cross correlation(NCC)betweeneachrecoveredlayerandthegroundtruth(notshown, but available on the project web page) is written below theimage. Bottom:numerical comparison with recent techniques forreectionremoval(visualcomparisonscanbefoundonthewebpage).(a) Normal panorama processing(b) Reflection-free panoramaFigure 10: Reection-free panoramas. Often when taking panoramas of outdoor scenes through windows, reections of the indoor sceneon the window cannot be avoided. Our algorithm can be used to produce reection-free panoramas from the same camera motion used tocapture the panoramasi.e. without any additional work needed from the user. (a) The panorama produced with a mobile phone and astate-of-the-art stitching software, where indoor reection are very apparent. (b) Our reection-free panorama result. A panorama stitchingof the estimated reection is shown in the inset. On the right are close-up views of corresponding patches in the two panorama images.the user additionally translates the camera. However, we found thatin many practical scenariosa user taking a panorama with a hand-heldcameraoraphonethecameramotionwill not bepurelyrotational, and will introduce sufcient parallax for the algorithmto work.6 ConclusionIn this paper, we have demonstrated that high quality images can becaptured automatically, with mobile cameras, through various typesof common visual obstructions such as reections, fences, stains,and rain-drop covered windows. Our algorithm can dramaticallyincreasethequalityofphotostakeninsuchscenarios, andonlyrequires that the user record a short image sequence while mov-ing the cameraan interaction similar to taking a panorama. Ouralgorithm then combines the visual information across the imagesequence to produce a clean image of the desired background scenewiththevisualobstructionremoved. Wehaveshownpromisingexperimental results on various real and challenging examples, anddemonstrated signicant improvement in quality compared to priorwork.AcknowledgementsWe thank Dr. Rick Szeliski for useful discussions and feedback,and the anonymous SIGGRAPH reviewers for their comments.TianfanXue is supportedbyShell ResearchandONRMURI6923196.ReferencesBARNUM, P. C., NARASIMHAN, S., ANDKANADE, T. 2010.Analysisofrainandsnowinfrequencyspace. InternationalJournal of Computer Vision (IJCV) 86, 2-3, 256274.BE,E.,YEREDOR,A., AND MEMBER,S. 2008. Blind Separa-tion of Superimposed Shifted Images Using Parameterized JointDiagonalization. IEEE Transactions on Image Processing 17, 3,340353.BERTALMIO, M., SAPIRO, G., CASELLES, V., AND BALLESTER,C. 2000. Image inpainting. In Computer Graphics and Interac-tive Techniques.BERTALMIO, M., BERTOZZI, A. L., ANDSAPIRO, G. 2001.Navier-stokes, uid dynamics, and image and video inpainting.In IEEE Conference on Computer Vision and Pattern Recogni-tion (CVPR).BLACK, M. J., AND ANANDAN, P. 1996. The robust estimation ofmultiple motions: Parametric and piecewise-smooth ow elds.Computer Vision and Image Understanding 63, 1, 75104.CANNY, J. 1986. A computational approach to edge detection.IEEE Transactions on Pattern Analysis and Machine Intelligence(TPAMI), 6, 679698.CRIMINISI, A., P EREZ, P., AND TOYAMA, K. 2004. Region llingand object removal by exemplar-based image inpainting. IEEETransactions on Image Processing 13, 9, 12001212.FARID, H., AND ADELSON, E. H. 1999. Separating reections andlighting using independent components analysis. IEEE Confer-ence on Computer Vision and Pattern Recognition (CVPR).GAI, K., SHI, Z.,AND ZHANG, C. 2009. Blind separation of su-perimposed images with unknown motions. In IEEE Conferenceon Computer Vision and Pattern Recognition (CVPR).GAI, K., SHI, Z.,AND ZHANG, C. 2012. Blind separation of su-perimposed moving images using image statistics. IEEE Trans-actions on Pattern Analysis and Machine Intelligence (TPAMI)34, 1, 1932.GARG,K., AND NAYAR,S. K. 2004. Detection and removal ofrain from videos. In IEEE Conference on Computer Vision andPattern Recognition (CVPR).GUO, X., CAO, X.,AND MA, Y. 2014. Robust Separation of Re-ection from Multiple Images. IEEE Conference on ComputerVision and Pattern Recognition (CVPR).HAYS, J., LEORDEANU, M., EFROS, A. A., AND LIU, Y. 2006.Discovering texture regularity as a higher-order correspondenceproblem. In European Conference on Computer Vision (ECCV).JEPSON, A., AND BLACK, M. J. 1993. Mixture models for opticalow computation. In IEEE Conference on Computer Vision andPattern Recognition (CVPR).JOJIC, N., ANDFREY, B. J. 2001. Learningexiblespritesin video layers. In IEEE Conference on Computer Vision andPattern Recognition (CVPR).KONG, N., TAI, Y.-W., ANDSHIN, J. S. 2014. A physically-based approach to reection separation: from physical modelingto constrained optimization. IEEE Transactions on Pattern Anal-ysis and Machine Intelligence (TPAMI) 36, 2, 20921.KOPF, J., LANGGUTH, F., SCHARSTEIN, D., SZELISKI, R., GOE-SELE, M., AND DARMSTADT, T. U. 2013. Image-Based Ren-dering in the Gradient Domain. ACM SIGGRAPH.LEVIN, A.,AND WEISS, Y. 2007. User assisted separation of re-ections from a single image using a sparsity prior. IEEE Trans-actions on Pattern Analysis and Machine Intelligence (TPAMI)29, 9, 16471654.LEVIN, A., ZOMET, A., AND WEISS, Y. 2002. Learning to per-ceive transparency fromthe statistics of natural scenes. Advancesin Neural Information Processing Systems (NIPS).LEVIN,A., ZOMET, A.,AND WEISS, Y. 2004. Separating reec-tions from a single image using local features. IEEE Conferenceon Computer Vision and Pattern Recognition (CVPR).LI, Y., AND BROWN, M. S. 2013. Exploiting Reection Changefor Automatic Reection Removal. IEEE International Confer-ence on Computer Vision (ICCV).LI, Y., AND BROWN, M. S. 2014. Single image layer separationusingrelativesmoothness. InIEEEConferenceonComputerVision and Pattern Recognition (CVPR).LIU,C., AND SUN,D. 2014. On bayesian adaptive video superresolution. IEEE Transactions on Pattern Analysis and MachineIntelligence (TPAMI) 36, 2, 346360.LIU,C.,YUEN,J.,TORRALBA,A.,SIVIC,J., AND FREEMAN,W. T. 2008. Sift ow: Dense correspondence across differentscenes. In European Conference on Computer Vision (ECCV).LIU, S., YUAN, L., TAN, P., AND SUN, J. 2014. Steadyow: Spa-tially smooth optical ow for video stabilization. In IEEE Con-ference on Computer Vision and Pattern Recognition (CVPR).LUENBERGER, D. G. 1973. Introduction to linear and nonlinearprogramming, vol. 28. Addison-Wesley Reading, MA.MU,Y., LIU,W., AND YAN,S. 2012. Video de-fencing. IEEECircuits and Systems Society.NEWSON, A., ALMANSA, A., FRADET, M., GOUSSEAU, Y.,P EREZ, P., ETAL. 2014. Video inpainting of complex scenes.Journal on Imaging Sciences, Society for Industrial and AppliedMathematics.PARK, M., COLLINS, R. T., ANDLIU, Y. 2008. Deformedlattice discovery via efcient mean-shift belief propagation. InEuropean Conference on Computer Vision (ECCV).PARK, M., BROCKLEHURST, K., COLLINS, R. T., ANDLIU,Y. 2011. Image de-fencing revisited. In Asian Conference onComputer Vision (ACCV).SAREL, B., AND IRANI, M. 2004. Separating transparent layersthrough layer information exchange. European Conference onComputer Vision (ECCV).SINGH,M. 2003. Computing Layered Surface Representations :An Algorithm for Detecting and Separating Transparent Over-lays. IEEE Conference on Computer Vision and Pattern Recog-nition (CVPR).SINHA, S., KOPF, J., GOESELE, M., SCHARSTEIN, D., ANDSZELISKI, R. 2012. Image-basedrenderingforsceneswithreections. ACM SIGGRAPH.SZELISKI, R., AVIDAN, S., ANDANANDAN, P. 2000. Layerextraction from multiple images containing reections and trans-parency. IEEEConferenceonComputerVisionandPatternRecognition (CVPR).SZELISKI,R. 1990. Fast surface interpolation using hierarchicalbasisfunctions. IEEETransactionsonPatternAnalysisandMachine Intelligence (TPAMI) 12, 6, 513528.TSIN, Y., KANG, S. B., AND SZELISKI, R. 2006. Stereo matchingwith linear superposition of layers. IEEE Transactions on Pat-tern Analysis and Machine Intelligence (TPAMI) 28, 2, 290301.WEISS, Y., AND ADELSON, E. H. 1996. A unied mixture frame-work for motion segmentation: Incorporating spatial coherenceand estimating the number of models. In IEEE Conference onComputer Vision and Pattern Recognition (CVPR), IEEE.YAMASHITA, A., MATSUI, A., ANDKANEKO, T. 2010. Fenceremoval from multi-focus images. In International Conferenceon Pattern Recognition (ICPR).