IEEE TRANSACTIONS ON IMAGE PROCESSING, 2014. 1 Complex Background … · 2015-02-03 · IEEE...

12
IEEE TRANSACTIONS ON IMAGE PROCESSING, 2014. 1 Complex Background Subtraction by Pursuing Dynamic Spatio-Temporal Models Liang Lin, Yuanlu Xu, Xiaodan Liang, Jianhuang Lai Abstract—Although it has been widely discussed in video surveillance, background subtraction is still an open problem in the context of complex scenarios, e.g., dynamic backgrounds, illumination variations, and indistinct foreground objects. To address these challenges, we propose an effective background subtraction method by learning and maintaining an array of dynamic texture models within the spatio-temporal representa- tions. At any location of the scene, we extract a sequence of regular video bricks, i.e. video volumes spanning over both spatial and temporal domain. The background modeling is thus posed as pursuing subspaces within the video bricks while adapting the scene variations. For each sequence of video bricks, we pursue the subspace by employing the ARMA (Auto Regressive Moving Average) Model that jointly characterizes the appearance consistency and temporal coherence of the observations. During online processing, we incrementally update the subspaces to cope with disturbances from foreground objects and scene changes. In the experiments, we validate the proposed method in several complex scenarios, and show superior performances over other state-of-the-art approaches of background subtraction. The em- pirical studies of parameter setting and component analysis are presented as well. Index Terms—Background modeling; visual surveillance; spatio-temporal representation. I. I NTRODUCTION Background subtraction (also referred as foreground ex- traction) has been extensively studied in decades[1], [2], [3], [4], [5], [6], yet it still remains open in real surveillance applications due to the following challenges: Dynamic backgrounds. A scene environment is not always static but sometimes highly dynamic, e.g., rippling water, heavy rain and camera jitter. Lighting and illumination variations, particularly with sudden changes. Indistinct foreground objects having similar appearances with surrounding backgrounds. This work was supported by the Hi-Tech Research and Development (863) Program of China (no. 2012AA011504), National Natural Science Foundation of China (no. 61173082, no. 61173084), Guangdong Science and Technology Program (no. 2012B031500006), Guangdong Natural Science Foundation (no. S2013050014548), Special Project on Integration of Industry, Education and Research of Guangdong Province (no. 2012B091000101), and Fundamental Research Funds for the Central Universities (no. 13lgjc26). Copyright (c) 2014 IEEE. Personal use of this material is permitted. However, permission to use this material for any other purposes must be obtained from the IEEE by sending an email to [email protected]. L. Lin is with the Key Laboratory of Machine Intelligence and Advanced Computing (Sun Yat-Sen University), Ministry of Education, China, with the School of Advanced Computing, Sun Yat-Sen University, Guangzhou 510006, P. R. China, and with the SYSU-CMU Shunde International Joint Research Institute, Shunde, China. E-mail: [email protected] Y. Xu, X. Liang and J. Lai are with the School of Information Science and Technology, Sun Yat-Sen University, Guangzhou 510006, P. R. China. Fig. 1. Some challenging scenarios for foreground object extraction are handled by our approach: (i) a floating bottle with randomly dynamic water (in the left column), (ii) waving curtains around a person (in the middle column), and (iii) sudden light changing (in the right column). In this paper, we address the above mentioned difficulties by building the background models with the online pursuit of spatio-temporal models. Some results generated by our system for the challenging scenarios are exhibited in Fig. 1 . Prior to unfolding the proposed approach, we first review the existing works in literature. A. Related Work Due to their pervasiveness in various applications, there is no unique categorization on the existing works of background subtraction. Here we introduce the related methods basically according to their representations, to distinguish with our approach. The pixel-processing approaches modeled observed scenes as a set of independent pixel processes, and they were widely applied in video surveillance applications [7], [6] . In these methods [1], [8], [9], [2], each pixel in the scene can be described by different parametric distributions (e.g. Gaussian Mixture Models) to temporally adapt to the environment changes. The parametric models, however, were not always compatible with real complex data, as they were defined based upon some underlying assumptions. To overcome this problem, some other non-parametric estimations [10], [11], [12], [13] were proposed, and effectively improved the ro- bustness. For example, Barnich et al. [13] presented a sample- based classification model that maintained a fixed number of arXiv:1502.00344v1 [cs.CV] 2 Feb 2015

Transcript of IEEE TRANSACTIONS ON IMAGE PROCESSING, 2014. 1 Complex Background … · 2015-02-03 · IEEE...

Page 1: IEEE TRANSACTIONS ON IMAGE PROCESSING, 2014. 1 Complex Background … · 2015-02-03 · IEEE TRANSACTIONS ON IMAGE PROCESSING, 2014. 1 Complex Background Subtraction by Pursuing Dynamic

IEEE TRANSACTIONS ON IMAGE PROCESSING, 2014. 1

Complex Background Subtraction by PursuingDynamic Spatio-Temporal Models

Liang Lin, Yuanlu Xu, Xiaodan Liang, Jianhuang Lai

Abstract—Although it has been widely discussed in videosurveillance, background subtraction is still an open problemin the context of complex scenarios, e.g., dynamic backgrounds,illumination variations, and indistinct foreground objects. Toaddress these challenges, we propose an effective backgroundsubtraction method by learning and maintaining an array ofdynamic texture models within the spatio-temporal representa-tions. At any location of the scene, we extract a sequence ofregular video bricks, i.e. video volumes spanning over both spatialand temporal domain. The background modeling is thus posedas pursuing subspaces within the video bricks while adaptingthe scene variations. For each sequence of video bricks, wepursue the subspace by employing the ARMA (Auto RegressiveMoving Average) Model that jointly characterizes the appearanceconsistency and temporal coherence of the observations. Duringonline processing, we incrementally update the subspaces to copewith disturbances from foreground objects and scene changes.In the experiments, we validate the proposed method in severalcomplex scenarios, and show superior performances over otherstate-of-the-art approaches of background subtraction. The em-pirical studies of parameter setting and component analysis arepresented as well.

Index Terms—Background modeling; visual surveillance;spatio-temporal representation.

I. INTRODUCTION

Background subtraction (also referred as foreground ex-traction) has been extensively studied in decades[1], [2], [3],[4], [5], [6], yet it still remains open in real surveillanceapplications due to the following challenges:

• Dynamic backgrounds. A scene environment is notalways static but sometimes highly dynamic, e.g., ripplingwater, heavy rain and camera jitter.

• Lighting and illumination variations, particularly withsudden changes.

• Indistinct foreground objects having similar appearanceswith surrounding backgrounds.

This work was supported by the Hi-Tech Research and Development (863)Program of China (no. 2012AA011504), National Natural Science Foundationof China (no. 61173082, no. 61173084), Guangdong Science and TechnologyProgram (no. 2012B031500006), Guangdong Natural Science Foundation (no.S2013050014548), Special Project on Integration of Industry, Education andResearch of Guangdong Province (no. 2012B091000101), and FundamentalResearch Funds for the Central Universities (no. 13lgjc26).Copyright (c) 2014 IEEE. Personal use of this material is permitted. However,permission to use this material for any other purposes must be obtained fromthe IEEE by sending an email to [email protected].

L. Lin is with the Key Laboratory of Machine Intelligence and AdvancedComputing (Sun Yat-Sen University), Ministry of Education, China, with theSchool of Advanced Computing, Sun Yat-Sen University, Guangzhou 510006,P. R. China, and with the SYSU-CMU Shunde International Joint ResearchInstitute, Shunde, China. E-mail: [email protected]

Y. Xu, X. Liang and J. Lai are with the School of Information Science andTechnology, Sun Yat-Sen University, Guangzhou 510006, P. R. China.

Fig. 1. Some challenging scenarios for foreground object extraction arehandled by our approach: (i) a floating bottle with randomly dynamic water(in the left column), (ii) waving curtains around a person (in the middlecolumn), and (iii) sudden light changing (in the right column).

In this paper, we address the above mentioned difficultiesby building the background models with the online pursuit ofspatio-temporal models. Some results generated by our systemfor the challenging scenarios are exhibited in Fig. 1 . Prior tounfolding the proposed approach, we first review the existingworks in literature.

A. Related Work

Due to their pervasiveness in various applications, there isno unique categorization on the existing works of backgroundsubtraction. Here we introduce the related methods basicallyaccording to their representations, to distinguish with ourapproach.

The pixel-processing approaches modeled observed scenesas a set of independent pixel processes, and they were widelyapplied in video surveillance applications [7], [6] . In thesemethods [1], [8], [9], [2], each pixel in the scene can bedescribed by different parametric distributions (e.g. GaussianMixture Models) to temporally adapt to the environmentchanges. The parametric models, however, were not alwayscompatible with real complex data, as they were definedbased upon some underlying assumptions. To overcome thisproblem, some other non-parametric estimations [10], [11],[12], [13] were proposed, and effectively improved the ro-bustness. For example, Barnich et al. [13] presented a sample-based classification model that maintained a fixed number of

arX

iv:1

502.

0034

4v1

[cs

.CV

] 2

Feb

201

5

Page 2: IEEE TRANSACTIONS ON IMAGE PROCESSING, 2014. 1 Complex Background … · 2015-02-03 · IEEE TRANSACTIONS ON IMAGE PROCESSING, 2014. 1 Complex Background Subtraction by Pursuing Dynamic

IEEE TRANSACTIONS ON IMAGE PROCESSING, 2014. 2

samples for each pixel and classified a new observation asbackground when it matched with a predefined number ofsamples. Liao et al. [14] recently employed the kernel densityestimation (KDE) technique to capture pixel-level variations.Some distinct scene variations, i.e. illumination changes andshadows, can be explicitly alleviated by introducing the extraestimations [15]. Guyon et al. [16] proposed to utilize thelow rank matrix decomposition for background modeling,where the foreground objects constituted the correlated sparseoutliers. Despite acknowledged successes, this category ofapproaches may have limitations on complex scenarios, as thepixel-wise representations overlooked the spatial correlationsbetween pixels.

The region-based methods built background models bytaking advantages of inter-pixel relations, demonstrating im-pressive results on handling dynamic scenes. A batch ofdiverse approaches were proposed to model spatial struc-tures of scenes, such as joint distributions of neighboringpixels [11], [17], block-wise classifiers [18], structured adja-cency graphs [19], auto-regression models [20], [21], randomfields [22], and multi-layer models [23] etc. And a numberof fast learning algorithms were discussed to maintain theirmodels online, accounting for environment variations or anystructural changes. For example, Monnet et al. [20] trained andupdated the region-based model by the generative subspacelearning. Cheng et al. [19] employed the generalized 1-SVMalgorithm for model learning and foreground prediction. Ingeneral, methods in this category separated the spatial andtemporal information, and their performances were somewhatlimited in some highly dynamic scenarios, e.g. heavy rains orsudden illumination changes.

The third category modeled scene backgrounds by exploit-ing both spatial and temporal information. Mahadevan etal. [24] proposed to separate foreground objects from sur-roundings by judging the distinguished video patches, whichcontained different motions and appearances compared withthe majority of the whole scene. Zhao et al. [25] addressed theoutdoor night background modeling by performing subspacelearning within video patches. Spatio-temporal representa-tions were also extensively discussed in other vision taskssuch as action recognition [26] and trajectory parsing [27].These methods motivated us to build models upon the spatio-temporal representations, i.e. video bricks.

In addition, several saliency-based approaches providedalternative ways based on spatio-temporal saliency estima-tions [28], [24], [29]. The moving objects can be extractedaccording to their salient appearances and/or motions againstthe scene backgrounds. For example, Wixson et al. [28] de-tected the salient objects according to their consistent movingdirections over time. Kim et al. [30] used a discriminant center-surround hypothesis to extract foreground objects around theirsurroundings.

Along with the above mentioned background models, anumber of reliable image features were utilized to better han-dle the background noise [31]. Exemplars included the LocalBinary Pattern (LBP) features [32], [33], [34] and color texturehistograms [35]. The LBP operators described each pixel bythe relative graylevels of its neighboring pixels, and their

effectiveness has been demonstrated in several vision taskssuch as face recognition and object detection [32], [36], [37].The Center-Symmetric LBP was proposed in [34] to furtherimprove the computational efficiency. Tan and Triggs [33]extended LBP to LTP (Local Ternary Pattern) by thresholdingthe graylevel differences with a small value, to enhance theeffectiveness on flat image regions.

B. Overview

In this work, we propose to learn and maintain the dy-namic models within spatio-temporal video patches (i.e. videobricks), accounting for real challenges in surveillance scenar-ios [7]. The algorithm can process 15 ∼ 20 frames per secondin the resolution 352 × 288 (pixels) on average. We brieflyoverview the proposed framework of background modeling inthe following aspects.

1. Spatio-temporal representations. We represent the ob-served scene by video bricks, i.e. video volumes spanningover both spatial and temporal domain, in order to jointlymodel spatial and temporal information. Specifically, at everylocation of the scene, a sequence of video bricks are extractedas the observations, within which we can learn and updatethe background models. Moreover, to compactly encode thevideo bricks against illumination variations, we design a brick-based descriptor, namely Center Symmetric Spatio-TemporalLocal Ternary Pattern (CS-STLTP), which is inspired by the2D scale invariant local pattern operator proposed in [14]. Itseffectiveness is also validated in the experiments.

2. Pursuing dynamic subspaces. We treat each sequenceof video bricks at a certain location as a consecutive signal,and generate the subspace within these video bricks. Thelinear dynamic system (i.e. Auto Regressive Moving Average,ARMA model [38]) is adopted to characterize the spatio-temporal statistics of the subspace. Specifically, given theobserved video bricks, we express them by a data matrix,in which each column contains the feature of a video brick.The basis vectors (i.e. eigenvectors) of the matrix can be thenestimated analytically, representing the appearance parametersof the subspace, and the parameters of dynamical variations arefurther computed based on the fixed appearance parameters.It is worth mentioning that our background model jointlycaptures the information of appearance and motion as the data(i.e. features of the video bricks) are extracted over both spatialand temporal domains.

3. Maintaining dynamic subspaces online. Given thenewly appearing video bricks with our model, moving fore-ground objects are segmented by estimating the residualswithin the related subspaces of the scene, while the back-ground models are maintained simultaneously to account forscene changes. The raising problem is to update parame-ters of the subspaces incrementally against disturbance fromforeground objects and background noise. The new obser-vation may include noise pixels (i.e. outliers), resulting indegeneration of model updating [25], [20]. Furthermore, onevideo brick could be partially occluded by foreground objectsin our representation, i.e. only some of pixels in the brickare true positives. To overcome this problem, we present a

Page 3: IEEE TRANSACTIONS ON IMAGE PROCESSING, 2014. 1 Complex Background … · 2015-02-03 · IEEE TRANSACTIONS ON IMAGE PROCESSING, 2014. 1 Complex Background Subtraction by Pursuing Dynamic

IEEE TRANSACTIONS ON IMAGE PROCESSING, 2014. 3

Fig. 2. An example of computing the CS-STLTP feature. For one pixel in the video brick, we construct four spatio-temporal planes. The center-symmetriclocal ternary patterns for each plane is calculated, which compares the intensities in a center-symmetric direction with a contrasting threshold τ . The CS-STLTPfeature is concatenated by the vectors of the four planes.

novel approach to compensate observations (i.e. the observedvideo bricks) by generating data from the current models.Specifically, we replace the pixels labeled as non-backgroundby the generated pixels to synthesize the new observations.The algorithm for online model updating includes two steps:(i) update appearance parameters using the incremental sub-space learning technique, and (ii) update dynamical variationparameters by analytically solving the linear reconstruction.The experiments show that the proposed method effectivelyimproves the robustness during the online processing.

The remainder of this paper is arranged as follows. We firstpresent the model representation in Section II, and then discussthe initial learning, foreground segmentation and online updat-ing mechanism in Section III, respectively. The experimentsand comparisons are demonstrated in Section IV and finallycomes the conclusion in Section V with a summary.

II. DYNAMIC SPATIO-TEMPORAL MODEL

In this section, we introduce the background of our model,and then discuss the video brick representation and our modeldefinition, respectively.

A. Background

In general, a complex surveillance background may includediverse appearances that sometimes move and change dynam-ically and randomly over time flying [39]. There is a branchof works on time-varying texture modeling [40], [41], [42]in computer vision. They often treated the scene as a whole,and pursued a global subspace by utilizing the linear dynamicsystem (LDS). These models worked well on some naturalscenes mostly including a few homogeneous textures, as theLDS characterizes the subspace with a set of linearly combinedcomponents. However, under real surveillance challenges, itcould be intractable to pursue the global subspace. In thiswork, we represent the observed scene by an array of small andindependent subspaces, each of which is defined by the linearsystem, so that our model is able to handle better challengingscene variations. Our background model can be viewed as a

mixed compositional model consisting of the linear subspaces.In particular, we conduct the background subtraction with ourmodel based on the following observations.

Assumption 1: The local scene variants (i.e. appearanceand motion changing over time) can be captured by the low-dimensional subspace.

Assumption 2: It is feasible to separate foreground movingobjects from the scene background by fully exploiting spatio-temporal statistics.

B. Spatio-temporal Video Brick

Given the surveillance video of one scene, we first decom-pose it with a batch of small brick-like volumes. We considerthe video brick of small size (e.g., 4× 4× 5 pixels) includesrelative simple content, which can be thus generated by fewbases (components). And the brick volume integrates bothspatial and temporal information, that we can better capturecomplex appearance and motion variations compared with thetraditional image patch representations.

We divide each frame Ii , (i = 1, 2, . . . , n) into a set ofimage patches with the width w and height h. A number t ofpatches at the same location across the frames are combinedtogether to form a brick. In this way, we extract a sequenceof video bricks V = {v1, v2, . . . , vn} at every location for thescene.

Moreover, we design a novel descriptor to describe the videobrick instead of using RGB values. For any video brick vi,we first apply the CS-STLTP operator on each pixel, and poolall the feature values into a histogram. For a pixel xc, weconstruct a few 2D spatio-temporal planes centered at it, andcompute the local ternary patterns (LTP) operator [33] on eachplane. The CS-STLTP then encodes xc by combining the LTPoperators of all planes. Note that the way of splitting spatio-temporal planes little affects the operator’s performance. Tosimplify the implementation, we make the planes parallel tothe Y axis, as Fig. 2 shown.

We index the neighborhood pixels of x by {0, . . . ,M}, theoperator response of the j-th plane can be then calculated as:

Page 4: IEEE TRANSACTIONS ON IMAGE PROCESSING, 2014. 1 Complex Background … · 2015-02-03 · IEEE TRANSACTIONS ON IMAGE PROCESSING, 2014. 1 Complex Background Subtraction by Pursuing Dynamic

IEEE TRANSACTIONS ON IMAGE PROCESSING, 2014. 4

zj(x) =

M2 −1⊎m=0

sτ (pm, pm+M2

), (1)

where pixel k and k +M/2 are two symmetric neighbors ofpixel xc. pk and pk+M

2are the graylevels of the two pixels,

respectively. The sign⊎

indicates stretching elements into avector. The function sτ is defined as follows:

sτ (pm, pm+M2

) =

1, if pm > (1 + τ)pm+M

2,

-1, if pm < (1− τ)pm+M2,

0, otherwise.(2)

where τ is a constant threshold for the comparing range.Suppose that we take M = 8 neighborhood pixels for

computing the operator in each spatio-temporal plane, andthe number of planes is 4. The resulting CS-STLTP vectorcontains M/2 × 4 = 16 bins. Fig.2 illustrates an exampleof computing the CS-STLTP operator, where we apply theoperator for one pixel on 4 spatial-temporal planes displayedwith different colors (e.g., green, blue, purple and orange).

Then we build a histogram for each video brick by accumu-lating the CS-STLTP responses of all pixels. This definitionwas previously proposed by Guo et al [36].

H(k) = Σx∈viΣ4j=11(zj(x),k), k ∈ [0,K], (3)

where 1(a,b) is an indicator function, i.e. 1(a,b) = 1 onlyif a = b. To measure the operator response, we transformthe binary vector of CS-STLTP into a uniform value that isdefined as the number of spatial transitions (bitwise changes)following, as discussed in [36] . For example, the pattern(i.e. the vector of 16 bins) 0000000000000000 has a valueof 0 and 1000000000000000 of 1. In our implementation, wefurther quantize all possible values into 48 levels. To furtherimprove the capability, we can generate histograms in eachcolor channel and concatenate them together.

The proposed descriptor is computationally efficient andcompact to describe the video brick. In addition, by intro-ducing a tolerative comparing range in the LTP operatorcomputation, it is robust to local spatio-temporal noise withina range.

C. Model Definition

Let m be the descriptor length for each brick, and V ={v1, v2, . . . , vn}, vi ∈ Rm be a sequence of video bricks at acertain location of the observed background. We can use a setof bases (components) C = [C1, C2, . . . , Cd] to represent thesubspace where V lies in. Each video brick vi in V can berepresented as

vi =

d∑j=1

zi,jCj + ωi, (4)

where Cj is the j-th basis (j-th column of matrix C) of thesubspace, zi,j the coefficient for Cj , and ωi the appearanceresidual. We denote C to represent appearance consistency ofthe sequence of video bricks. In some traditional background

models by subspace learning, zi,j can be solved and kept as aconstant, with the underlying assumption that the appearanceof background would be stable within the observations. Incontrast, we treat zi,j as the variable term that can be furtherphrased as the time-varying state, accounting for temporallycoherent variations (i.e. the motions). For notation simplicity,we neglect the subscript j, and denote Z = {z1, z2, . . . , zn}for all the bricks. The dynamic model is formulated as,

zi+1 = Azi + ηi, (5)

where ηi is the state residual, and A is a matrix of d × ddimensions to model the variations. With this definition, weconsider A representing the temporal coherence among theobservations.

Therefore, the problem of pursuing dynamic subspace isposed as solving the appearance consistency C and the tempo-ral coherence A, within the observations. Since the sequencestates Z are unknown, we shall jointly solve C, A, Z byminimizing an empirical energy function Fn(C, A, Z):

min Fn(C, A, Z) =1

2n

n∑i=1

‖vi − Czi‖22 + ‖zi −Azi−1‖22.

(6)Here Fn(C, A, Z) is not completely convex but we can solveit by fixing either Z or (C, A). Nevertheless, its computationcost is expensive for learning the entire background online.Here we simplify the dynamic model in Equation (5) into alinear system, following the auto-regressive moving average(ARMA) process. In literature, Soatto et al. [40] originallyassociated the output of ARMA model with dynamic textures,and showed that the first-order ARMA model, driven bywhite zero-mean Gaussian noise, can capture a wide range ofdynamic textures. In our approach, the difficulty of modelingthe dynamic variations can be alleviated due to the brick-basedrepresentation, i.e. the observed scene is decomposed intovideo bricks. Thus, we consider the ARMA process a suitablesolution to model the time-varying variables, which can besolved efficiently. Specifically, we introduce a robustness term(i.e. matrix) B, which includes a number dε of bases, and weset ηi = Bεi, where εi denotes the noise.

We further summarize the proposed dynamic model, wherewe add the subscript n to the main components, indicatingthey are solved within a number n of observations, as,

vi = Cn zi + ωi,

zi+1 = An zi + Bn εi,

ωiIID∼ N(0,Σω), εi

IID∼ N(0, Idε).

(7)

In this model, Cn ∈ Rm×d and An ∈ Rd×d represent theappearance consistency and temporal coherence, respectively.Bn ∈ Rd×dε is the robustness term constraining the evolutionof Z over time. ωi ∈ Rm indicates the residual correspondingto observation vi, and εi ∈ Rdε the noise of state variations.During the subspace learning, ωi and εi are assumed to followthe zero-mean Gaussian distributions. Given a new brickmapped into the subspace, ωi and εi can be used to measurehow likely the observation is suitable with the subspace, so

Page 5: IEEE TRANSACTIONS ON IMAGE PROCESSING, 2014. 1 Complex Background … · 2015-02-03 · IEEE TRANSACTIONS ON IMAGE PROCESSING, 2014. 1 Complex Background Subtraction by Pursuing Dynamic

IEEE TRANSACTIONS ON IMAGE PROCESSING, 2014. 5

that we utilize them for foreground object detection duringonline processing.

The proposed model is time-varying, and the parametersCn, An, Bn can be updated incrementally along with theprocessing of new observations, in order to adapt our modelwith scene changes.

III. LEARNING ALGORITHM

In this section, we discuss the learning for spatio-temporalbackground models, including initial subspace generation andonline maintenance. The initial learning is performed at thebeginning of system deployment, when only a few foregroundobjects move in the scene. Afterwards, the system switches tothe mode of online maintenance.

A. Initial Model Learning

In the initial stage, the model defined in Equation (7) canbe degenerated as a non-dynamic linear system, as the nobservations are extracted and fixed. Given a brick sequenceV = {v1, v2, . . . , vn}, we present an algorithm to identifythe model parameters Cn, An, Bn, following the sub-optimalsolution proposed in [40].

To guarantee the Equation (7) has an unique and canonicalsolution, we postulate

n� d, Rank(Cn) = d, C>nCn = Id, (8)

where Id is the identity matrix of dimension d × d. Theappearance consistency term Cn can be estimated as,

Cn = arg minCn

|Wn −Cn [ z1 z2 · · · zn ] | (9)

where Wn is the data matrix composed of observed videobricks [v1, v2, · · · , vn]. The equation.(9) satisfies the full rankapproximation property and can be thus solved by the singularvalue decomposition (SVD). We have,

Wn = UΣQ>, (10)U>U = I,Q>Q = I,

where Q is the unitary matrix, U includes the eigenvectors,and Σ is the diagonal matrix of the singular values. Thus, Cn

is treated as the first d components of U , and the state matrix[z1 z2 · · · zn] as the product of d × d sub-matrix of Σ andthe first d columns of Q>.

The temporal coherence term An is calculated by solvingthe following linear problem:

An = arg minAn

| [ z2 z3 · · · zn ]−An[ z1 z2 · · · zn−1 ] |.

(11)The statistical robustness term Bn is estimated by the recon-struction error E

E = [ z2 z3 · · · zn ] − An [ z1 z2 · · · zn−1 ]

= Bn [ ε1 ε2 · · · εn−1 ],(12)

where Bn ∼= 1√n−1

E. Since the rank of An is d and d� n,the rank of input-to-state noise dε is assumed to be muchsmaller than d. That is, the dimension of E can be furtherreduced by SVD: E = Uε ΣεQ

>ε , and we have

Bn=1√n− 1

[U1ε · · · Udεε

] Σ1ε

. . .Σdεε

.(13)

The values of d, dε essentially imply the complexity ofsubspace from the aspects of appearance consistence andtemporal coherence, respectively. For example, video brickscontaining static content can be well described with a functionof low dimensions while highly dynamic video bricks (e.g.,from an active fountain) require more bases to generate. Inreal surveillance scenarios, it is not practical to pre-determinethe complexity of scene environments. Hence, in the proposedmethod, we adaptively determine d, dε by thresholding eigen-values in Σ and Σε, respectively.

d∗ = arg maxd

Σd > Td,

d∗ε = arg maxdε

Σdεε > Tdε ,(14)

where Σd indicates the d-th eigenvalue in Σ and Σdεε the dε-theigenvalue in Σε.

B. Online Model Maintenance

Then we discuss the online processing with our model thatsegments foreground moving objects and keeps the modelupdated.

(I) Foreground segmentation. Given one newly appearingvideo brick vn+1, we can determine whether pixels in vn+1

belong to the background or not by thresholding their appear-ance residual and state residual. We first estimate the state ofvn+1 with the existing Cn,

z′n+1 = C>n vn+1, (15)

and further the appearance residual of vn+1

ωn+1 = vn+1 − Cnz′n+1. (16)

As the state zn and the temporal coherence An have beensolved, we can then estimate the state residual εn accordingto Equation (7),

Bnεn = z′n+1 − Anzn

⇒ εn = pinv(Bn) (z′n+1 − Anzn),(17)

where pinv denotes the operator of pseudo-inverse.With the state residual εn and the appearance residual ωn+1

for the new video brick vn+1, we conduct the following criteriafor foreground segmentation, in which two thresholds areintroduced.

1) vn+1 is classified into background, only if all dimensionsof εn are less than a threshold Tε.

Page 6: IEEE TRANSACTIONS ON IMAGE PROCESSING, 2014. 1 Complex Background … · 2015-02-03 · IEEE TRANSACTIONS ON IMAGE PROCESSING, 2014. 1 Complex Background Subtraction by Pursuing Dynamic

IEEE TRANSACTIONS ON IMAGE PROCESSING, 2014. 6

2) If vn+1 has been labeled as non-background, performthe pixel-wise segmentation by comparing ωn+1 with athreshold Tω: the pixel is segmented as foreground if itscorresponding dimension in ωn+1 is greater than Tω .

(II) Model updating. During the online processing, thekey problem for model updating is to deal with foregrounddisturbance, i.e. to avoid absorbing pixels from foregroundobjects or noise.

In this work, we develop an effective approach to updatethe model with the synthesized data. We first generate a videobrick from the current model, namely noise-free brick, vn+1,as

zn+1 = Anzn,

vn+1 = Cn zn+1.(18)

Then we extract pixels from vn+1 to compensate occluded (i.e.foreground) pixels in the newly appearing brick. Concretely,the pixels labeled as non-background are replaced by the pixelsfrom the noise-free video brick at the same place. We can thusobtain a synthesized video brick vn+1 for model updating.

Given the brick vn+1, the data matrix Wn composed ofobserved video bricks is extended to Wn+1. Then we updatethe model Cn+1 according to Equation (9).

Our algorithm of model updating includes two steps: (i)update parameters for appearance consistency Cn+1 by em-ploying the incremental subspace learning technique, and (ii)update parameters of state variations An+1, Bn+1.

(i) Step 1. For the d-dimension subspace, with eigenvectorsCn and eigenvalues Λn, its covariance matrix Covn can beapproximated as

Covn ≈d∑j=1

λn,jcn,jc>n,j = CnΛnC

>n , (19)

where cn,j and λn,j denote the j-th eigenvector and eigen-value, respectively. With the newly synthesized data vn+1, theupdated covariance matrix Covn+1 is formulated as

Covn+1 = (1− α) Covn +α vn+1 v>n+1

≈ (1− α)CnΛnC>n + α vn+1 v

>n+1

=

d∑i=1

(1− α)λn,i cn,i c>n,i + α vn+1 v

>n+1,

(20)

where α denotes the learning rate. The covariance matrix canbe further re-formulated to simplify computation, as,

Covn+1 = Yn+1Y>n+1, (21)

where Yn+1 = [yn+1,1 yn+1,2 . . . yn+1,d+1] and each columnyn+1,j in Yn+1 is defined as

yi =

{ √1− αλj cn,i, if 1 < j < d,√α vn+1, if j = d+ 1.

(22)

To reduce the computation cost, we can estimate Cn+1 by asmaller matrix Y >n+1Yn+1, instead of the original large matrixCovn+1.

(Y >n+1 Yn+1) en+1,j = λn+1,j en+1,j j = 1, 2, . . . , d+ 1,(23)

where en+1,j and λn+1,j are the j-th eigenvector andeigenvalue of matrix Y >n+1Yn+1, respectively. Let cn+1,j =Yn+1en+1,j , and we re-write Equation (23) as

Yn+1 Y>n+1 Yn+1 en+1,j = λn+1,j Yn+1 en+1,j ,

Covn+1 cn+1,j = λn+1,i cn+1,j j = 1, 2, . . . , d+ 1.(24)

We thus obtain the updated eigenvectors Cn+1 and the cor-responding eigenvalues Λn+1 of the new covariance matrixCovn+1. Note that the dimension of the subspace is automat-ically increased along with the newly added data vn+1. Toguarantee the appearance parameters remain stable, we keepthe main principal (i.e. top d) eigenvectors and eigenvalueswhile discarding the least significant components.

The above incremental subspace learning algorithm hasbeen widely applied in several vision tasks such as facerecognition and image segmentation [43], [44], [45], and alsofor background modeling in [25], [46], [4]. However, the noiseobservations caused by moving objects or scene variationsoften disturb the subspace maintenance, e.g. the eigenvectorscould change dramatically during the processing. Many ef-forts [47][48] have been dedicated to improve the robustnessof incremental learning by using statistical analysis. Severaldiscriminative learning algorithms [49] were also employed totrain background classifiers that can be incrementally updated.In this work, we utilize a version of Robust Incremental PCA(RIPCA) [50] to cope with the outliers in vn+1. Note thatvn+1 consists of pixels either from the generated data vn+1

or real videos, where outliers may exist in some dimensions.In the traditional PCA learning, the solution is derived by

minimizing a least-squared reconstruction error,

min |rn+1|2 = |CnC>n vn+1 − vn+1|2. (25)

Following [50], we impose a robustness function w(t) =1

1+(t/ρ)2 over each dimension of rn+1, and the target can bere-defined as,

min∑j

(rkn+1)2 ← w(rkn+1)(rkn+1)2, (26)

where the superscript k indicates the k-th dimension. Theparameter ρ in the robustness function is estimated by

ρ = [ρ1, ρ2, . . . , ρ|vn+1|]>

ρk =d

maxi=1

β√λn,i | ckn,j |, j = 1, 2, . . . , |vn+1|

(27)

where β is a fixed coefficient. The k-th dimension of ρ isproportional to the maximal projection of the current eigen-vectors on the k-th dimension, (i.e. ρk is weighted by theircorresponding eigenvalues). Note that w(rkn+1) is a functionof the residual error which should be calculated for each vectordimension. And the computation cost for w(rkn+1) can beneglected in the analytical solution.

Page 7: IEEE TRANSACTIONS ON IMAGE PROCESSING, 2014. 1 Complex Background … · 2015-02-03 · IEEE TRANSACTIONS ON IMAGE PROCESSING, 2014. 1 Complex Background Subtraction by Pursuing Dynamic

IEEE TRANSACTIONS ON IMAGE PROCESSING, 2014. 7

Fig. 3. An example to demonstrate the robustness of model maintenance. Inthe scenario of dynamic water surfaces, we visualize the original and predictedintensities for a fixed position (denoted by the red star), with the blue and redcurves, respectively. With our updating scheme, when the position is occludedby a foreground object during from frame 551 to 632, the predicted intensitiesare not disturbed by foreground, i.e. the model remains stable.

Accordingly, we can update the observation vn+1 over eachdimension by computing the function w(rkn+1),

vkn+1 =√w(rkn+1) vkn+1. (28)

That is, we treat vn+1 as the new observation during theprocedure of incremental learning.

(i) Step 2. With the fixed Cn+1, we then update theparameters of state variations An+1, Bn+1. We first estimatethe latest state zn+1 based on the updated Cn+1 as,

zn+1 = C>n+1vn+1. (29)

An+1 can be further calculated, by re-solving the linearproblem of a fixed number of latest observed states,

An+1 [ zn−l+1 · · · zn ] = [ zn−l+2 · · · zn+1 ], (30)

where l indicates the number of latest observed states, i.e. thespan of observations. And similarly, we update Bn+1 by com-puting the new reconstruction error E = [zn−l+2 · · · zn+1]−An+1 [zn−l+1 · · · zn].

We present an empirical study in Fig. 3 to demonstratethe effectiveness of this updating method. The video forbackground modeling includes dynamic water surfaces. Herewe visualize the original and predicted intensities for a fixedposition (denoted by the red star), with the blue and red curves,respectively. We can observe that the model remains stableagainst foreground occlusion.

Time complexity analysis. We mainly employ SVD and lin-ear programming in the initial learning. The time complexityof SVD is O(n3) and the learning time of linear programmingis O(n2). For a certain location, the time complexity of initiallearning is O(n3)+O(n2) = O(n3) for each subspace, wheren denotes the number of video bricks for model learning. Asfor online learning, incremental subspace learning and linear

Algorithm 1: The sketch of the proposed algorithm.Input: Video brick sequence V = {v1, v2, . . . , vn} for

every location for the scene.Output: Maintained Background models and foreground

regionsforall the locations for the scene do

Given the observed video bricks V , extract theCS-STLTP descriptor;Initialize the subspace by estimating Cn, An, Bnusing Equation (8)-(14);for the newly appearing video brick vn+1 do

(1) Extract the CS-STLTP descriptor for vn+1;(2) Calculate its state residual εn and appearancesresidual ωn+1 by Equation (16) and (17);(3) For each pixel of vn+1, classify it intoforeground or background by thresholding thetwo residuals with εn, ωn+1;(4) Generate the noise-free brick vn+1 from thecurrent model by Equation (18);(5) Synthesize video brick vn+1 for modelupdating;(6) Update vn+1 into vn+1 by introducing arobustness function;(7) Update the new appearance parameter Cn+1

by calculating the covariance matrix Covn+1

with the learning rate α;(8) Update the state variation parametersAn+1, Bn+1 ;

endend

programming are utilized. Given a d-dimension subspace,the time complexity for component updating (i.e. step 1of the model maintenance) is O(dn2). Thus, the total timecomplexity for online learning is O(dn2) +O(l2), where l isthe number of states used to solve the linear problem.

We summarize the algorithm sketch of our framework inAlgorithm 1.

IV. EXPERIMENTS

In this section, we first introduce the datasets used inthe experiments and the parameter settings, then present theexperimental results and comparisons. The discussions ofsystem components are proposed at last.

A. Datasets and settings

We collect a number of challenging videos to validate ourapproach, which are publicly available or from real surveil-lance systems. Two of them (AirportHall and TrainStation)from the PETS database1 include crowded pedestrians andmoving cast shadows; five highly dynamic scenes 2 includewaving curtain active fountain, swaying trees, water surface;the others contain extremely difficult cases such as heavy rain,

1Downloaded from http://www.cvg.rdg.ac.uk/slides/pets.html.2Downloaded from http://perception.i2r.a-star.edu.sg

Page 8: IEEE TRANSACTIONS ON IMAGE PROCESSING, 2014. 1 Complex Background … · 2015-02-03 · IEEE TRANSACTIONS ON IMAGE PROCESSING, 2014. 1 Complex Background Subtraction by Pursuing Dynamic

IEEE TRANSACTIONS ON IMAGE PROCESSING, 2014. 8

sudden and gradual light changing. Most of the videos includethousands of frames, and some of the frames are manually an-notated as the ground-truth provided by the original databases.

Our algorithm has been adopted in a real video surveillancesystem and achieves satisfactory performances. The systemis capable of processing 15 ∼ 20 frames per second in theresolution 352 × 288 pixels. The hardware architecture isan Intel i7 2600 (3.4 GHz) CPU and 8GB RAM desktopcomputer.

All parameters are fixed in the experiments, including thecontrast threshold for CS-STLTP descriptor τ = 0.2, thedimension threshold for ARMA model Td = 0.5, Tdε = 0.5,the span of observations for model updating l = 60, and thesize of bricks 4 × 4 × 5. For foreground segmentation, thethreshold of appearance residual Tω = 3, update thresholdTε = 3 and Tω = 5, Tε = 4 for RGB. In the online modelmaintenance, the coefficient β = 2.3849, the learning rateα = 0.05 for RIPCA.

In the experiments, we use the first 50 frames of eachtesting video to initialize our system (i.e. to perform the initiallearning), and keep model updated in the rest of sequence. Inaddition, we utilize a standard post-processing to eliminateareas including less than 20 pixels. All other competingapproaches are executed with the same setting as our approach.

We utilize the F-score as the benchmark metric, whichmeasures the segmentation accuracy by considering both therecall and the precision. The F-score is defined as

F =2TP

2TP + FP + FN, (31)

where TP is true positives (foreground objects), FN falsenegatives (false background pixels), FP false positive (falseforeground pixels).

B. Experimental results

Experimental results. We compare the proposed method(STDM) with six state-of-the-art online background subtrac-tion algorithms including Gaussian Mixture Model (GMM) [1]as baseline, improved GMM [8]3, online auto-regressionmodel [20], non-parametric model with scale-invariant localpatterns [14], discriminative model using generalized Struct1-SVM [19]4, and the Bayesian joint domain-range (JDR)model [11]5. In the comparisons, for the methods [1], [8],[19], [11] we use their released codes, and implement themethods [14], [20] by ourselves. The F-scores (%) over all10 videos are reported in Table I, where the last two columnsreport results of our method using either RGB or CS-STLTP asthe feature. Note that for the result using the RGB feature werepresent each video brick by concatenating the RGB valuesof all its pixels. We also exhibit the results and comparisonsusing the precision-recall (PR) curves, as shown in Fig. 4.Due to space limitation, we only show results on 5 videos.From the results, we can observe that the proposed methodoutperforms the other methods in most videos in general. For

3Available at http://dparks.wikidot.com/background-subtraction4Available at http://www.cs.mun.ca/∼gong/Pages/Research.html5Available at http://www.cs.cmu.edu/∼yaser/

Fig. 4. Experimental results generated by our approach and competingmethods on 5 videos: first row left, the scene including a dynamic curtainand indistinctive foreground objects (i.e. having similar appearance withbackgrounds); first row right, the scene with heavy rain; second row left,an indoor scene with the sudden lighting changes; second row right, thescene with dynamic water surface; third row, a busy airport. The precision-recall (PR) curve is introduced as the benchmark measurement for all the 6algorithms.

the scenes with highly dynamic backgrounds (e.g., the #2 #5and #10 scenes), the improvements made by our method aremore than 10%. And the system enables us to well handlethe indistinctive foreground objects (i.e. small objects orbackground-like objects in the #1, #3 scenes). Moreover, wemake significant improvements (i.e. 15% ∼ 25%) in the scene#6 and #7 including both sudden and gradual lighting changes.A number of sampled results of background subtraction areexhibited in Fig. 5.

The benefit of using the proposed CS-STLTP feature isclearly validated by observing the results shown in Table I andFig. 5. In general, our approach simply using RGB values canachieve satisfying performances for the common scenes, e.g.,with fair appearance and motion changes, while the CS-SILTPoperator can better handle highly dynamic variations (e.g.sudden illumination changing, rippling water). In addition,we also compare CS-STLTP with the existing scale invariantdescriptor SILTP proposed in [14]. We reserve all settingsin our approach except replacing the feature by SILTP, andachieve the average precision over all 10 videos: 69.70%. Thisresult shows that CS-STLTP is very suitable and effective forthe video brick representation.

Page 9: IEEE TRANSACTIONS ON IMAGE PROCESSING, 2014. 1 Complex Background … · 2015-02-03 · IEEE TRANSACTIONS ON IMAGE PROCESSING, 2014. 1 Complex Background Subtraction by Pursuing Dynamic

IEEE TRANSACTIONS ON IMAGE PROCESSING, 2014. 9

Airport Floating Bottle Waving CurtainLighting

ChangesWater Surface

Frame

Ground-

truth

STDS

(RGB)

STDS

(Ftr.)

GMM

Improv-

ed GMM

Online-

AR

JDR

Struct 1-

SVM

SILTP

Fig. 5. Sampled results of background subtraction generated by our approach (using RGB or CS-STLTP as the feature and RIPCA as the update strategy)and other competing methods.

Page 10: IEEE TRANSACTIONS ON IMAGE PROCESSING, 2014. 1 Complex Background … · 2015-02-03 · IEEE TRANSACTIONS ON IMAGE PROCESSING, 2014. 1 Complex Background Subtraction by Pursuing Dynamic

IEEE TRANSACTIONS ON IMAGE PROCESSING, 2014. 10

TABLE IQUANTITATIVE RESULTS AND COMPARISONS ON THE 10 COMPLEX VIDEOS USING THE F-SCORE (%) MEASUREMENT. THE LAST TWO COLUMNS REPORT

THE RESULTS OF OUR METHOD USING EITHER RGB OR CS-STLTP AS THE FEATURE.

Scene GMM[1] Im-GMM[8] Online-AR[20] JDR[11] SVM[19] PKDE[14] STDM(RGB) STDM(Ftr.)

1# Airport 46.99 47.36 62.72 60.23 65.35 68.14 70.52 66.402# Floating Bottle 57.91 57.77 43.79 45.64 47.87 59.57 69.04 78.173# Waving Curtain 62.75 74.58 77.86 72.72 77.34 78.01 79.74 74.934# Active Fountain 52.77 60.11 70.41 68.53 74.94 76.33 76.85 85.465# Heavy Rain 71.11 81.54 78.68 75.88 82.62 76.71 79.35 75.296# Sudden Light 47.11 51.37 37.30 52.26 47.61 52.63 51.56 74.577# Gradual Light 51.10 50.12 13.16 47.48 62.44 54.86 54.84 77.418# Train Station 65.12 68.80 36.01 57.68 61.79 67.05 73.43 66.359# Swaying Trees 19.51 23.25 63.54 45.61 24.38 42.54 43.71 75.8910# Water Surface 79.54 86.01 77.31 84.27 83.13 74.30 88.54 88.68Average 55.39 59.56 57.02 60.23 59.79 63.08 68.75 76.31

0.05 0.1 0.15 0.2 0.25 0.30

0.5

1

1.5

2

2.5

3

3.5x 10

4

(a) Learning Rate

FN

FP

FN+FP

0 0.1 0.2 0.3 0.4 0.50

1000

2000

3000

4000

5000

(b) Contrast Threshold

FN

FP

FN+FP

Fig. 6. Discussion of parameter selection: (i) learning rate α for modelmaintenance (in (a)) and (ii) the contrast threshold of CS-STLTP feature τ(in (b)). In each figure, the horizontal axis represents the different parametervalues; the three lines in different colors denote, respectively, the false positive(FP), false negative (FN), and the sum of FP and FN.

1*1*5

55

60

65

70

75

80

Brick sizes

Aver

age

Pre

cisi

on

4*4*1

2*2*3

4*4*3

4*4*5

2*2*1

1*1*3

8*8*1

8*8*3

8*8*5

Fig. 7. Empirical study for the size of video brick in our approach. Wecarry on the experiments on the 10 videos with different brick size whilekeeping the rest settings. The vertical axis represents the average precisionsof background subtraction and the horizontal represents the different sizes ofvideo bricks with respect to background decomposition.

C. Discussion

Furthermore, we conduct the following empirical studiesto justify the parameter determinations and settings of ourapproach.

Efficiency. Like other online-learning background models,there is a trade-off between the model stability and mainte-nance efficiency. The corresponding parameter in our methodis the learning rate α. We tune α in the range of 0 ∼ 0.3 byfixing the other model parameters and visualize the quantita-tive results of background subtraction, as shown in Fig. 6(a).

From the results, we can observe this parameter is insensitivein range 0 ∼ 0.1 in our model. In practice, once the sceneis extremely busy and crowded, it could be set as a relativesmall value to keep the model stable.

Feature effectiveness. The contrast threshold τ is the onlyparameter in CS-STLTP operator, which affects the power offeature to character spatio-temporal information within videobricks. From the empirical results of parameter tuning, asshown in Fig. 6 (b), we can observe that the appropriate rangefor τ is 0.15 ∼ 0.25. In practice, the model could becomesensitive to noise by setting a very small value of τ (sayτ < 0.15), and too large τ (say τ > 0.25) might reduce theaccuracy on detecting foreground regions with homogeneousappearances.

Size of video brick. One may be interested in how thesystem performance is affected by the size of video brick forbackground decomposition, so that we present an empiricalstudy on different sizes of video bricks in Fig. 7. We observethat the best result is achieved with the certain brick size of4 × 4 × 3, and the results with the sizes of 4 × 4 × 1 and4×4×5 are also satisfied. As of very small bricks (e.g. 1×1×3), few spatio-temporal statistics are captured and the modelsmay have problems on handling scene variations. The bricksof large sizes (e.g. 8×8×5 ) carry too much information, andtheir subspaces cannot be effectively generated by the linearARMA model. The experimental results are also accordantwith our motivations in Section I. In practice, we can flexiblyset the size according to the resolutions of surveillance videos.

Model initialization. Our method is not sensitive to thenumber of observed frames in the initial stage of subspacegeneration. We test the different numbers, say 30, 40, 60, ontwo typical surveillance scenes, i.e. the Airport Hall (scene#1) and the Train Station (scene #8). The F-score outputsshow the deviations with different numbers of initial framesare very small, e.g. less than 0.2. In general, we requirethe observed scenes to be relatively clean for initialization,although a few objects that move across are allowed.

V. CONCLUSION

This paper studies an effective method for backgroundsubtraction, addressing the all challenges in real surveillancescenarios. In the method, we learn and maintain the dynamictexture models within spatio-temporal video patches (i.e. video

Page 11: IEEE TRANSACTIONS ON IMAGE PROCESSING, 2014. 1 Complex Background … · 2015-02-03 · IEEE TRANSACTIONS ON IMAGE PROCESSING, 2014. 1 Complex Background Subtraction by Pursuing Dynamic

IEEE TRANSACTIONS ON IMAGE PROCESSING, 2014. 11

bricks). Sufficient experiments as well as empirical analysisare presented to validate the advantages of our method.

In the future, we plan to improve the method in two aspects.(1) Some efficient tracking algorithms can be employed intothe framework to better distinguish the foreground objects. (2)The GPU-based implementation can be developed to processeach part of the scene in parallel, and it would probablysignificantly improve the system efficiency.

REFERENCES

[1] C. Stauffer and W.E.L. Grimson. Adaptive background mixture modelsfor real-time tracking. In Proceedings of IEEE Conference on ComputerVision and Pattern Recognition (CVPR), 1999.

[2] Thierry Bouwmans, Fida El Baf, Bertrand Vachon, et al. Backgroundmodeling using mixture of gaussians for foreground detection-a survey.Recent Patents on Computer Science, 1(3):219–237, 2008.

[3] L. Maddalena and A. Petrosino. A self-organizing approach to back-ground subtraction for visual surveillance applications. IEEE Transac-tions on Image Processing (TIP), 17(7):1168–1177, 2008.

[4] D.-M. Tsai and S.-C. Lai. Independent component analysis-basedbackground subtraction for indoor surveillance. IEEE Transactions onImage Processing (TIP), 18(1):158–167, 2009.

[5] H.J. Chang, H.J., and J.Y. Choi. Active attentional sampling for speed-up of background subtraction. In Proceedings of IEEE Conference onComputer Vision and Pattern Recognition (CVPR), 2012.

[6] X. Liu, L. Lin, S. Yan, H. Jin, and W. Tao. Integrating spatio-temporal context with multiview representation for object recognitionin visual surveillance. Circuits and Systems for Video Technology, IEEETransactions on, 21(4):393–407, 2011.

[7] L. Lin, Y. Lu, Y. Pan, and X. Chen. Integrating graph partitioning andmatching for trajectory analysis in video surveillance. Image Processing,IEEE Transactions on, 21(12):4844–4857, 2012.

[8] Z. Zivkovic. Improved adaptive gaussian mixture models for backgroundsubtraction. In Proceedings of IEEE International Conference on PatternRecognition (ICPR), 2004.

[9] D. Lee. Effective gaussian mixture learning for video backgroundsubtraction. IEEE Transactions on Pattern Analysis and MachineIntelligence (TPAMI), 27(5):827–832, 2005.

[10] A. Elgammal, R. Duraiswami, D. Harwood, and L.S. Davis. Backgroundand foreground modeling using nonparametric kernel density estimationfor visual surveillance. Proceedings of the IEEE, 90(7):1151–1163,2002.

[11] Y. Sheikh and M. Shah. Bayesian modeling of dynamic scenes forobject detection. IEEE Transactions on Pattern Analysis and MachineIntelligence (TPAMI), 27(11):1778–1792, 2005.

[12] C. Benedek and T. Sziranyi. Bayesian foreground and shadow detectionin uncertain frame rate surveillance videos. IEEE Transactions on ImageProcessing (TIP), 17(4):608–621, 2008.

[13] O. Barnich and M.V. Droogenbroeck. Vibe: A universal backgroundsubtraction algorithm for video sequences. IEEE Transactions on ImageProcessing (TIP), 20(6):1709–1724, 2011.

[14] S. Liao, G. Zhao, V. Kellokumpu, M. Pietikainen, and S.Z. Li. Mod-eling pixel process with scale invariant local patterns for backgroundsubtraction in complex scenes. In Proceedings of IEEE InternationalConference on Computer Vision and Pattern Recognition (CVPR), 2010.

[15] J. Pilet, C. Strecha, and P. Fua. Making background subtraction robustto sudden illumination changes. In Proceedings of European Conferenceon Computer Vision (ECCV), 2008.

[16] C. Guyon, T. Bouwmans, and E.-H. Zahzah. Foreground detectionvia robust low rank matrix decomposition including spatio-temporalconstraint. Computer Vision - ACCV 2012 Workshops, pages 315–320,2013.

[17] M. Wu and X. Peng. Spatio-temporal context for codebook-baseddynamic background subtraction. AEU - International Journal ofElectronics and Communications, 64(8):739–747, 2010.

[18] H.H. Lin, T.L. Liu, and J.H. Chuang. Learning a scene backgroundmodel via classification. IEEE Transactions on Signal Processing (TSP),57(5):1641–1654, 2009.

[19] L. Cheng and M. Gong. Realtime background subtraction from dynamicscenes. In Proceedings of IEEE International Conference on ComputerVision (ICCV), 2009.

[20] A. Monnet, A. Mittal, N. Paragios, and V. Ramesh. Backgroundmodeling and subtraction of dynamic scenes. In Proceedings of IEEEInternational Conference on Computer Vision (ICCV), 2003.

[21] L. Wixson. Segmenting foreground objects from a dynamic texturedbackground via a robust kalman filter. In Proceedings of IEEE Interna-tional Conference on Computer Vision (ICCV), 2003.

[22] Y. Wang, K.F. Loe, and J.K. Wu. A dynamic conditional random fieldmodel for foreground and shadow segmentation. IEEE Transactionson Pattern Analysis and Machine Intelligence (TPAMI), 28(2):279–289,2007.

[23] K.A. Patwardhan, G. Sapiro, and V. Morellas. Robust foregrounddetection in video using pixel layers. IEEE Transactions on PatternAnalysis and Machine Intelligence (TPAMI), 30(4):746–751, 2008.

[24] V. Mahadevan and N. Vasconcelos. Spatiotemporal saliency in dynamicscenes. IEEE Transactions on Pattern Analysis and Machine Intelligence(TPAMI), 32(1):171–177, 2010.

[25] Y. Zhao, H. Gong, L. Lin, and Y. Jia. Spatio-temporal patches for nightbackground modeling by subspace learning. In Proceedings of IEEEInternational Conference on Pattern Recognition (ICPR), pages 1–4,2008.

[26] X. Liang, L. Lin, and L. Cao. Learning latent spatio-temporal compo-sitional model for human action recognition. In Proceedings of ACMinternational conference on Multimedia (ACMMM), 2013.

[27] X. Liu, L. Lin, and H. Jin. Contextualized trajectory parsing with spatio-temporal graph. IEEE Transactions on Pattern Analysis and MachineIntelligence (TPAMI), 35(12):3010–3024, 2013.

[28] L. Wixson. Detecting salient motion by accumulating directionally-consistent flow. IEEE Transactions on Pattern Analysis and MachineIntelligence (TPAMI), 22(8):774–780, 2000.

[29] D. Gutchess, M. Trajkovics, E. Cohen-Solal, D. Lyons, and A.K. Jain.A background model initialization algorithm for video surveillance.In Proceedings of IEEE International Conference on Computer Vision(ICCV), 2001.

[30] W. Kim, C. Jung, and C. Kim. Spatiotemporal saliency detection and itsapplications in static and dynamic scenes. IEEE Transactions on PatternAnalysis and Machine Intelligence (TPAMI), 21(4):446–456, 2011.

[31] L. Lin, X. Liu, and S.-C. Zhu. Layered graph matching with compositecluster sampling. Pattern Analysis and Machine Intelligence, IEEETransactions on, 32(8):1426–1442, 2010.

[32] T. Ojala, M. Pietikainen, and M. Maenpaa. Multiresolution gray-scale and rotation invariant texture classification with local binary pat-terns. IEEE Transactions on Pattern Analysis and Machine Intelligence(TPAMI), 24(7):971–987, 2002.

[33] X. Tan and B. Triggs. Enhanced local texture feature sets for facerecognition under difficult lighting conditions. IEEE Transactions onImage Processing (TIP), 19(6):168–182, 2007.

[34] M. Heikkila and M. Pietikainen. Description of interest regions withlocal binary patterns. Patter Recognition (PR), 42(3):425–436, 2006.

[35] J. Yao and J. Odobez. Multi-layer background subtraction based oncolor and texture. In Proceedings of IEEE Conference on ComputerVision and Pattern Recognition (CVPR), 2007.

[36] Z. Guo, L. Zhang, and D. Zhang. Rotation invariant texture classificationusing lbp variance (lbpv) with global matching. Pattern Recognition,43(3):706–719, 2010.

[37] L. Lin, P. Luo, X. Chen, and K. Zeng. Representing and recognizing ob-jects with massive local image patches. Pattern Recognition, 45(1):231–240, 2012.

[38] E.J. Hannan and M. Deistler. Statistical theory of linear systems. WileySeries in Probability and Mathematical Statistics, 1988.

[39] Liang Lin, Tianfu Wu, Jake Porway, and Zijian Xu. A stochasticgraph grammar for compositional object representation and recognition.Pattern Recognition, 42(7):1297–1307, 2009.

[40] S. Soatto, G. Doretto, and Y.N. Wu. Dynamic textures. InternationalJournal of Computer Vision (IJCV), 52(2):91–109, 2003.

[41] P. Saisan, G. Doretto, Y.N. Wu, and S. Soatto. Dynamic texturerecognition. In Proceedings of IEEE Conference on Computer Visionand Pattern Recognition (CVPR), pages 58–63, 2001.

[42] G. Doretto, D. Cremers, P. Favaro, and S. Soatto. Dynamic texturesegmentation. In Proceedings of IEEE International Conference onComputer Vision (ICCV), 2003.

[43] D. Skocaj and A. Leonardis. Weighted and robust incremental methodfor subspace learning. In Proceedings of IEEE International Conferenceon Computer Vision (ICCV), 2003.

[44] M. Artac, M. Jogan, and A. Leonardis. Incremental pca for on-linevisual learning and recognition. In Proceedings of IEEE InternationalConference on Pattern Recognition (ICPR), 2002.

[45] A. Levy and M. Lindenbaum. Sequential karhunen-loeve basis extractionand its application to images. IEEE Transactions on Image Processing(TIP), 9(8):1371–1374, 2000.

Page 12: IEEE TRANSACTIONS ON IMAGE PROCESSING, 2014. 1 Complex Background … · 2015-02-03 · IEEE TRANSACTIONS ON IMAGE PROCESSING, 2014. 1 Complex Background Subtraction by Pursuing Dynamic

IEEE TRANSACTIONS ON IMAGE PROCESSING, 2014. 12

[46] L. Wang, L. Wang, M. Wen, Q. Zhuo, and W. Wang. Backgroundsubtraction using incremental subspace learning. In Proceedings of IEEEInternational Conference on Image Processing (ICIP), 2007.

[47] E. Candes, X. Li, Y. Ma, and J. Wright. Robust principal componentanalysis? Journal of the ACM, 58(3):1–37, 2011.

[48] X. Ding, L. He, and L. Carin. Bayesian robust principal componentanalysis. IEEE Transactions on Image Processing (TIP), 20(12):3419–3430, 2011.

[49] F. Diana, M. Cristina, and T. Bouwmans. Background subtractionvia incremental maximum margin criterion: A discriminative approach.Machine Vision and Applications, 23(6):1083–1101, 2012.

[50] Y. Li. On incremental and robust subspace learning. Pattern Recognition(PR), 37(7):1509–1518, 2004.

Liang Lin is a full Professor with the School of Ad-vanced Computing, Sun Yat-Sen University (SYSU),China. He received the B.S. and Ph.D. degrees fromthe Beijing Institute of Technology (BIT), Beijing,China, in 1999 and 2008, respectively. From 2006to 2007, he was a joint Ph.D. student with theDepartment of Statistics, University of California,Los Angeles (UCLA). His Ph.D. dissertation wasachieved the China National Excellent PhD ThesisAward Nomination in 2010. He was a Post-DoctoralResearch Fellow with the Center for Vision, Cog-

nition, Learning, and Art of UCLA. His research focuses on new models,algorithms and systems for intelligent processing and understanding of visualdata such as images and videos. He has published more than 50 papers in toptier academic journals and conferences including Proceedings of the IEEE, T-PAMI, T-IP, T-CSVT, T-MM, Pattern Recognition, CVPR, ICCV, ECCV, ACMMM and NIPS. He was supported by several promotive programs or fundsfor his works, such as the “Program for New Century Excellent Talents” ofMinistry of Education (China) in 2012, the “Program of Guangzhou ZhujiangStar of Science and Technology” in 2012, and the Guangdong Natural ScienceFunds for Distinguished Young Scholars in 2013. He received the Best PaperRunners-Up Award in ACM NPAR 2010, and Google Faculty Award in 2012.

Yuanlu Xu has just finished his master studies atSchool of Information Science and Technology, SunYat-sen University and is going to pursue a Ph.D.degree at UCLA. His current advisor is Prof. LiangLin and they have cooperated on publishing a coupleof papers on computer vision. Yuanlu received a BEdegree with honors from the School of Software,Sun Yat-sen University. His research interests arein video surveillance, image matching and statisticalmodeling and inference.

Xiaodan Liang received the B.B.A degree in theSchool of Software, Sun Yat-Sen University in 2010.She is now a Ph.D candidate with the School ofInformation Science and Technology, Sun Yat-SenUniversity (SYSU), China. She has published severalresearch papers in top tier academic conferences andjournals. Her research focuses on structured visionmodels and multimedia understanding.

Jianhuang Lai received his M.Sc. degree in appliedmathematics in 1989 and his Ph.D. in mathematicsin 1999 from SUN YAT-SEN University, China. Hejoined Sun Yat-sen University in 1989 as an As-sistant Professor, where currently, he is a Professorwith the Department of Automation of School ofInformation Science and Technology and dean ofSchool of Information Science and Technology. Hiscurrent research interests are in the areas of digitalimage processing, pattern recognition, multimediacommunication, wavelet and its applications. He

has published over 100 scientific papers in the international journals andconferences on image processing and pattern recognitione.g. IEEE TPAMI, IEEE TNN, IEEE TIP, IEEE TSMC (Part B), Pattern Recognition, ICCV,CVPR and ICDM. Prof. Lai serves as a standing member of the Image andGraphics Association of China and also serves as a standing director of theImage and Graphics Association of Guangdong.