Visual Tracking via Reliable Memories - arXiving problems in computer vision and artificial...

7
Visual Tracking via Reliable Memories 1 Shu Wang, 2 Shaoting Zhang, 3 Wei Liu and 1 Dimitris N. Metaxas 1 CBIM, Rutgers, The State University of New Jersey, Piscataway, NJ, USA 2 Department of Computer Science, UNC Charlotte, Charlotte, NC, USA 3 Didi Research, Beijing, China 1 {sw498, dnm}@cs.rutgers.edu, 2 [email protected], 3 [email protected] Abstract In this paper, we propose a novel visual tracking framework that intelligently discovers reliable pat- terns from a wide range of video to resist drift error for long-term tracking tasks. First, we design a Dis- crete Fourier Transform (DFT) based tracker which is able to exploit a large number of tracked sam- ples while still ensures real-time performance. Sec- ond, we propose a clustering method with tempo- ral constraints to explore and memorize consistent patterns from previous frames, named as “reliable memories”. By virtue of this method, our tracker can utilize uncontaminated information to allevi- ate drifting issues. Experimental results show that our tracker performs favorably against other state- of-the-art methods on benchmark datasets. Further- more, it is significantly competent in handling drifts and able to robustly track challenging long videos over 4000 frames, while most of others lose track at early frames. 1 Introduction Visual tracking is one of the fundamental and challeng- ing problems in computer vision and artificial intelligence. Though much progress has been achieved in recent years [Yil- maz et al., 2006; Wu et al., 2013], there are still unsolved issues due to its complexity on various factors, such as illu- mination and angle changes, clutter background, shape de- formation and occlusion. Extensive studies on visual track- ing employ a tracking-by-detection framework and achieve promising results by extending existing machine learning methods (usually discriminative) with online learning man- ner [Avidan, 2004; Avidan, 2007; Grabner et al., 2008; Saffari et al., 2009]. To adaptively model various appear- ance changes, they deal with a large amount of samples 1 at both detection and updating stages. However, all of them face the same dilemma: While more samples grant better ac- curacy and adaptiveness, they also come with higher com- putational cost and risk of drifting. In addition to discrim- inative methods, [Ross et al., 2008; Mei and Ling, 2011; Wang and Lu, 2014] utilize generative models with a fixed learning-rate to account for target appearance changes. The 1 Here “samples” refers to positive (and negative) target patches for trackers based on generative (or discriminative) models. learning-rate is essentially a trade-off between adaptiveness and stability. However, even with very small rate, for- mer samples’ influence on their models still drops expo- nentially through frames, and drift error may still accumu- late. In order to alleviate drift error, [Babenko et al., 2011; Hare et al., 2011; Zhang et al., 2014b] are designed to exploit hidden structured information around the target re- gion. Other methods [Collins and Liu, 2003; Avidan, 2007; Kwon and Lee, 2010] try to avoid drifting by making the current model a combination of the labeled samples in the first frame and the learned samples from the tracking process. However, limited number of samples (e.g., the first frame) can be regarded as “very confident”, which in turn restrict their robustness in long-term challenging tasks. Recently, several methods [Bolme et al., 2010; Danelljan et al., 2014b; Henriques et al., 2015] employ Discrete Fourier Transform (DFT) to perform extremely fast detection and achieve high accuracy with the least computational cost. However, same as other generative methods, the memory length of their models is limited by a fixed forgetting rate, and therefore they still suffer from accumulated drift error in long-term tasks. A very important observation is that, when the tracked tar- get moves smoothly, e.g., without severe occlusion or out- of-plane rotations, its appearances across frames share high similarity in the feature space (e.g., edge features). Contrar- ily, when it undergoes drastic movements such as in/out-of- plane rotations or occlusions, its appearances may not be that similar to previous ones. Therefore, if we impose a temporal constraint on clustering these samples, such that only tempo- rally adjacent ones can be grouped together, the big clusters with large intra-cluster correlation can indicate the periods when the target experiences small appearance changes. We take human memory as an analogy for these clusters, using reliable memories to represent large clusters that have been consistently perceived for a long time. In this context, earlier memories supported by more samples have higher probabil- ity to be reliable than more recent ones with less support, es- pecially when drift error accumulates across frames. Thus, a tracker may recover from drift error with preference to choose candidates that share high correlation to earlier memories. Based on these motivations, we propose a novel tracking framework, which efficiently explores self-correlated appear- ance clusters across frames, and then preserves reliable mem- ories for long-term robust visual tracking. First, we design a DFT-based visual tracker, which is capable of retrieving good arXiv:1602.01887v2 [cs.CV] 17 Feb 2016

Transcript of Visual Tracking via Reliable Memories - arXiving problems in computer vision and artificial...

Page 1: Visual Tracking via Reliable Memories - arXiving problems in computer vision and artificial intelligence. Though much progress has been achieved in recent years [Yil-maz et al., 2006;Wu

Visual Tracking via Reliable Memories1Shu Wang, 2Shaoting Zhang, 3Wei Liu and 1Dimitris N. Metaxas

1CBIM, Rutgers, The State University of New Jersey, Piscataway, NJ, USA2Department of Computer Science, UNC Charlotte, Charlotte, NC, USA

3Didi Research, Beijing, China1{sw498, dnm}@cs.rutgers.edu, [email protected], [email protected]

AbstractIn this paper, we propose a novel visual trackingframework that intelligently discovers reliable pat-terns from a wide range of video to resist drift errorfor long-term tracking tasks. First, we design a Dis-crete Fourier Transform (DFT) based tracker whichis able to exploit a large number of tracked sam-ples while still ensures real-time performance. Sec-ond, we propose a clustering method with tempo-ral constraints to explore and memorize consistentpatterns from previous frames, named as “reliablememories”. By virtue of this method, our trackercan utilize uncontaminated information to allevi-ate drifting issues. Experimental results show thatour tracker performs favorably against other state-of-the-art methods on benchmark datasets. Further-more, it is significantly competent in handling driftsand able to robustly track challenging long videosover 4000 frames, while most of others lose trackat early frames.

1 IntroductionVisual tracking is one of the fundamental and challeng-

ing problems in computer vision and artificial intelligence.Though much progress has been achieved in recent years [Yil-maz et al., 2006; Wu et al., 2013], there are still unsolvedissues due to its complexity on various factors, such as illu-mination and angle changes, clutter background, shape de-formation and occlusion. Extensive studies on visual track-ing employ a tracking-by-detection framework and achievepromising results by extending existing machine learningmethods (usually discriminative) with online learning man-ner [Avidan, 2004; Avidan, 2007; Grabner et al., 2008;Saffari et al., 2009]. To adaptively model various appear-ance changes, they deal with a large amount of samples1 atboth detection and updating stages. However, all of themface the same dilemma: While more samples grant better ac-curacy and adaptiveness, they also come with higher com-putational cost and risk of drifting. In addition to discrim-inative methods, [Ross et al., 2008; Mei and Ling, 2011;Wang and Lu, 2014] utilize generative models with a fixedlearning-rate to account for target appearance changes. The

1 Here “samples” refers to positive (and negative) target patchesfor trackers based on generative (or discriminative) models.

learning-rate is essentially a trade-off between adaptivenessand stability. However, even with very small rate, for-mer samples’ influence on their models still drops expo-nentially through frames, and drift error may still accumu-late. In order to alleviate drift error, [Babenko et al., 2011;Hare et al., 2011; Zhang et al., 2014b] are designed toexploit hidden structured information around the target re-gion. Other methods [Collins and Liu, 2003; Avidan, 2007;Kwon and Lee, 2010] try to avoid drifting by making thecurrent model a combination of the labeled samples in thefirst frame and the learned samples from the tracking process.However, limited number of samples (e.g., the first frame)can be regarded as “very confident”, which in turn restricttheir robustness in long-term challenging tasks. Recently,several methods [Bolme et al., 2010; Danelljan et al., 2014b;Henriques et al., 2015] employ Discrete Fourier Transform(DFT) to perform extremely fast detection and achieve highaccuracy with the least computational cost. However, same asother generative methods, the memory length of their modelsis limited by a fixed forgetting rate, and therefore they stillsuffer from accumulated drift error in long-term tasks.

A very important observation is that, when the tracked tar-get moves smoothly, e.g., without severe occlusion or out-of-plane rotations, its appearances across frames share highsimilarity in the feature space (e.g., edge features). Contrar-ily, when it undergoes drastic movements such as in/out-of-plane rotations or occlusions, its appearances may not be thatsimilar to previous ones. Therefore, if we impose a temporalconstraint on clustering these samples, such that only tempo-rally adjacent ones can be grouped together, the big clusterswith large intra-cluster correlation can indicate the periodswhen the target experiences small appearance changes. Wetake human memory as an analogy for these clusters, usingreliable memories to represent large clusters that have beenconsistently perceived for a long time. In this context, earliermemories supported by more samples have higher probabil-ity to be reliable than more recent ones with less support, es-pecially when drift error accumulates across frames. Thus, atracker may recover from drift error with preference to choosecandidates that share high correlation to earlier memories.

Based on these motivations, we propose a novel trackingframework, which efficiently explores self-correlated appear-ance clusters across frames, and then preserves reliable mem-ories for long-term robust visual tracking. First, we design aDFT-based visual tracker, which is capable of retrieving good

arX

iv:1

602.

0188

7v2

[cs

.CV

] 1

7 Fe

b 20

16

Page 2: Visual Tracking via Reliable Memories - arXiving problems in computer vision and artificial intelligence. Though much progress has been achieved in recent years [Yil-maz et al., 2006;Wu

time

location tracker1

tracker2

ideal path tracker1 path tracker2 path

t3 t2 t1

drift samples

drift samples our tracker path

Find consistent memory

our tracker preserved reliable memory

Figure 1: This figure illustrates the basic philosophy of our method. Here we use Snake (video game) as an analogy for learning-rate basedvisual trackers (tracker1 and tracker2): In order to track the target on ideal path, they continuously take in new samples, and forget old onesdue to limited memory length. Contrarily, our tracker discovers and preserves multiple temporally constrained clusters as memories, coveringa much wider range on the whole sequence. As shown above, tracker1, tracker2 and our tracker depart from the ideal path at time t1 and t2for drastic target appearance changes. After that, all three trackers absorb a certain amount of drifted samples. With only limited length ofmemory, tracker2 can hardly recover from drift error even if familiar target appearance shows up at t3. Similarly, tracker1 deviates from theideal path for long and is degraded by drifted samples from time t1 to t3. Even it happens to be close to the ideal path at t3 by chance, withoutkeeping memory on similar samples long before, it still drifts from the ideal path with a high probability. On the contrary, when similar targetappearance occurs at t3, our tracker corrects tracking result with consistent and reliable memories, and recovers from drift error.

memories from a vast number of tracked samples for accuratedetection, while still ensures a fast speed for real-time perfor-mance. Second, we propose a novel clustering method withtemporal constraints to discover distinct and reliable memo-ries from previous frames to help our tracker resist drift error.This method harvests the inherent correlation of the stream-ing data, and is guaranteed to converge at a fast speed2 bycarefully designing upon Integral Image. To the best of ourknowledge, our temporally constrained clustering method isnovel to vision streaming data analysis, and its high converg-ing speed and promising performance show great potentialin online streaming problems. Particularly, it is very com-petent in discovering clusters (i.e., reliable memories) con-sisted of uncontaminated sequential samples that are trackedbefore, and grants our tracker remarkable ability to resist drifterror. Experimental results show that our tracker is consider-ably competent in handling drift error and performs favorablyagainst other state-of-the-art methods on benchmark datasets.Further, it can robustly track challenging long videos withover 4000 frames, while most of the others lose track at earlyframes.

2 Circulant Structure based Visual TrackingRecent works [Bolme et al., 2010; Henriques et al., 2012;

Danelljan et al., 2014b; Henriques et al., 2015] achieve thestate-of-the-art tracking accuracy with the least computa-tional cost by exploiting the inherent relationship betweenDFT and the circulant structure of dense sampling on the tar-get region. In this section, we briefly introduce these methodsthat are highly related to our work.

Suppose x ∈ RL is a vector of an image patch with sizeM×N , centered at the target (L = M×N ), and xl denotes a2D circular shift from x bym×n (l is an index for allM×Npossible shifts, 1 ≤ l ≤ L). y ∈ RL is a vector of a designedresponse map of size M ×N with a Gaussian pulse centeredat the target, too. κ(x,x′) =< ϕ(x), ϕ(x′) > is a positivedefinite kernel function defined by mapping ϕ(x) : RL →RD. We aim to find a linear classifier f(xl) = ωTϕ(xl) +b that minimizes the Regularized Least Square (RLS) costfunction:

min ε(ω) = minω

∑l

||yl − f(xl)||2 + λ||f ||2κ. (1)

2Its computational complexity is O(n logn), which costs lessthan 30 ms for n = 1000 frames.

The first term is an empirical risk to minimize the differencebetween the designed gaussian response y and the mappingx → fL(x) ∈ RL, where fl(x) = f(xl). The second term||f ||κ is a regularization term. It is denoted by ||f ||κ since itlies in the Kernel Hilbert Space reproduced by κ.

By Representer Theorem [Scholkopf et al., 2001], costε(ω) can be minimized by a linear combination of in-puts: ω =

∑l

αlϕ(xl). By defining kernel matrix K ∈

RL×L,K(l, l′) = κ(xl,xl′), a much simpler form for Eq. 1can be derived as:

minF (α) = minα

(y −Kα)T(y −Kα) + λαTKα. (2)This function is convex and differentiable, and has a closedform minimizer α = (K+λI)−1y. As proved in [Henriqueset al., 2012], if the kernel κ is unitarily invariant, its kernelmatrix K is a circulant matrix, that K = C(k), where vectork ∈ RL, ki = κ(x, P ix). P i is a permutation matrix thatshifts vectors by i-th element(s), C(k) is a circulant matrixfrom k by concatenating all L possible cyclic shifts of k. andα can be obtained without inverting (K + λI) by:

α = F−1(F (y)

F (k) + λ1), (3)

where F and F−1 are DFT and its inverse, and 1 is ann by 1 vector with all entries to be 1. Division in Eq. 3is in Fourier domain, and is thus performed element-wise.In practice, there is no need to compute α from A, sincefast detection can be performed on given image patch zby y = F−1(F (k) � F (α)), where k ∈ RL withkl = κ(z, xl). x is the learned target appearance. Pulsepeak in y shows the target translation in input image z.Detailed derivation is in [Gray, 2005; Rifkin et al., 2003;Henriques et al., 2012].

Though recent methods MOSSE [Bolme et al., 2010],CSK [Henriques et al., 2012] and ACT [Danelljan et al.,2014b], have different configurations of kernel functions andfeatures (e.g., dot product kernel κ leads to MOSSE, and RBFkernel leads to the latter two), all of them employ a simple lin-ear combination to learn target appearance model {xp, Ap}at current frame p by

Qp = (1− γ)Qp−1 + γQp, Q = {x,A,AN,D}. (4)

While CSK updates its classifier coefficients Ap by Eq. 4 di-rectly, MOSSE and ACT update the numerator Ap

N and de-

Page 3: Visual Tracking via Reliable Memories - arXiving problems in computer vision and artificial intelligence. Though much progress has been achieved in recent years [Yil-maz et al., 2006;Wu

nominator ApD of coefficients Ap separately for stability pur-

pose. The learning-rate γ is a trade-off parameter betweenlong memory and model adaptiveness. After expanding Eq. 4we obtain:

Qp =

p∑j=1

γ(1− γ)p−j

Qj , Q = {x,A,AN,D}. (5)

This shows that, all three methods have an exponentiallydecreasing pattern of memory: Though the learning-rate γis usually small, e.g., γ = 0.1, the impact of a sample{xj ,Aj} at a certain frame j is negligible after 100 frames(γ(1 − γ)100 ≤ 10−8). In other words, these learning-ratebased trackers are unable to recourse to samples accuratelytracked long before to help resist accumulated drift error.

3 Proposed MethodAside from the convolution based visual trackers men-

tioned above, many other trackers [Jepson et al., 2003;Nummiaro et al., 2003; Ross et al., 2008; Babenko et al.,2011] also update their models Q in similar form as Qp =

(1−γ)Qp−1+γQp with a learning-rate parameter γ ∈ (0, 1]and suffers from the drifting problem.

We observe that smooth movements usually offer consis-tent appearance cues, which can be modeled as reliable mem-ories to recover the tracker from drifting issues caused bydrastic appearance change (illustrated in Fig. 1). In this sec-tion, we introduce our method that explores, preserves andmakes use of the reliable memories for long-time robust vi-sual tracking. First, we introduce our novel framework thatis capable of handling vast number of samples while still en-sures fast detection speed. Then, we elaborate the details ofintelligently arranging past samples into distinct and reliableclusters that grant our tracker resistance to drift error.

3.1 The Circulant Tracker over Vast SamplesGiven new positive sample xp at frame p, we aim to build

an adaptive model {xp, Ap} for fast detection in the comingframe p+ 1 with sample image z by

yp+1 = F−1(Ap �F (kp)), (6)where yp+1 is the response map which shows the estimatedtranslation of the target position, vector kp ∈ RL, with itsl-th entry kpl := κ(z, xpl ). As we advocated, this model{xp, Ap} should be built upon vast samples for robustnessand adaptiveness. Thus, xp should have the form:

xp = (1− γ)

p−1∑j=1

βjxj + γxp, γ ∈ (0, 1],

p−1∑j=1

βj = 1.

(7)As shown, the adaptive learned appearance xp is a combina-tion of past p samples with concentration on xp of a certainproportion γ. Coefficients {βj}p−1j=1 represent the correlationbetween the current estimated appearance xp and the pastappearances {xj}p−1j=1 . A proper choice of {βj}p−1j=1 shouldmake the model: 1) adaptive to new appearance changes, and2) consistent with past appearances to avoid risk of drifting.In this paper, we argue that the set {βj}p−1j=1 with preferenceto previous reliable memories can provide our tracker withconsiderable robustness to resist drift error. We discuss how

to find these reliable memories in Sec. 3.2, and their connec-tions with {βj}p−1j=1 are introduced in Sec. 3.3.

Now, we focus on finding a set of classifier coefficients αthat fit both the learned appearance xp for consistency andthe current appearance xp for adaptiveness. Based on Eq. 1and Eq. 2, we derive the following cost function to minimize:

F p(α) =(1− γ)[(y − Kpα)T(y − Kpα) + λαTKpα

]+ γ

[(y −Kpα)T(y −Kpα) + λαTKpα

],

(8)where the kernel matrix Kp = C(kp), and vector entrykpl = κ(xp, xpl ) (similar for Kp and kp). γ is a balance fac-tor between the memory consistency and model adaptiveness.By setting the derivative ∇F pα = 0, the accurate solution αpsatisfies a complicated condition as follows:[

(1− γ)Kp(Kp + λI) + γKp(Kp + λI)]αp

=[(1− γ)Kp + γKp)

]y.

(9)

We observe that the adaptively learned appearance xp shouldbe very close to the current one xp, since it is a linear com-bination of close appearances in the past {xj}p−1j=1 and thecurrent appearance xp, as shown in Eq. 7. Notice both ker-nel matrix Kp and Kp (and their linear combination withλI) is positive semidefinite. By relaxing Eq. 9 with Kp '(1 − γ)Kp + γKp ' Kp, we obtain an approximate mini-mizer αp in a very simple form:

αp '[(1− γ)Kp + γKp + λI)

]−1y

=[C((1− γ)kp + γkp + λδ)

]−1y

= F−1(F (y)

(1− γ)F (kp) + γF (kp) + λ1).

(10)

δ is an L-dimensional vector in the form δ = [1, 0, ..., 0]T,with property that C(δ) = I and F (δ) = 1 (1 is anL-dimension vector of ones). Note that in the bracket ofF−1(·), division is performed element-wise.

As long as we find a proper set of coefficients {βj}p−1j=1 ,we can build up our detection model {xp, Ap} by Eq. 7 andEq. 10. In the next frame p + 1, fast detection can be per-formed by Eq. 6 with this learned model.

3.2 Clustering with Temporal ConstraintsIn this subsection, we introduce our temporally constrained

clustering, which learns distinct and reliable memories fromthe incoming samples in a very fast manner. Together with theranked memories (Sec. 3.3), our tracker is robust to inaccuratetracking result, and is able to recover from drift error.

Suppose a set of positive samples are given at frame p: S ={xi}pi=1, and we would like to divide them into H subsets{sh}Hh=1 with indexing vector set M = {m1, ...,mH} ∈{0, 1}p, such that sh := {xi : mh

i = 1,∀i = 1, ..., p}. Ourobjective are as follows: 1) Samples in each subset sh arehighly correlated; 2) Samples from different subsets have rel-atively large appearance difference, so a linear combinationof them is vague or even ambiguous to describe the trackedtarget (e.g., samples from different viewpoints of the target).

Page 4: Visual Tracking via Reliable Memories - arXiving problems in computer vision and artificial intelligence. Though much progress has been achieved in recent years [Yil-maz et al., 2006;Wu

Memory 01 Frame 0001 - 0416

Memory 02 Frame 0417 - 0480

Memory 05 Frame 0529 - 0592

Memory 11 Frame 0913 - 0928

Memory 23 Frame 1281 - 1345

Memory 08

200 400 600 800 1000 1200

200

400

600

800

1000

1200

Frame 0625 - 0768

Distance Matrix and Clustering Result Six temporally constrained clusters with distinct appearances

Figure 2: Left: the distance matrix D as described in Alg. 1. Right: Six representative clusters with corresponding coloredbounding boxes are shown for intuitive understanding. The image patches in the big bounding boxes is an average appearanceof a certain cluster (memory), while the small patches are samples chosen evenly on the temporal domain from each cluster.

Algorithm 1 Temporal Constrained Clustering AlgorithmInput: Integral image J of Distance Matrix D ∈ Rp×p,s.t. Dij = ||φ(xi)− φ(xj)||2, ∀i, j = 1, ..., p;M = {mi}pi=1, mi = Piδ, ∀i = 1, ..., p;Pi is a shifting matrix and δ = [1, 0, ..., 0]T;Stoping factor ρ, and N = |M|+ 1.Output: M = {mi}Hi=1.while (|M| < N) doN = |M|;for h = 1 : 2 : |M| do do

Evaluate τ(sh, sh+1) = C(sh⋂sh+1) − (C(sh) +

C(sh+1)) using J.if τ(C(sh, sh+1)) ≤ ρ(C(sh) + C(sh+1)) thenmh = mh +mh+1, removemh+1 from M;

end ifend for

end whileM = M.

Thus, it can be modeled as a very general clustering problem:

minM

∑h

C(sh) + ηr(|M|),

s.t. 〈mi,mj〉 = 0, ∀mi,mj ∈M, i 6= j;∑h

mh = 1p×1.

(11)

Function C(sh) measures the average sample distance in fea-ture space φ(·) within subset sh, in the form: C(sh) =(1p×1

T ×mh)−1∑∀xi,xj∈sh,i<j ||φ(xi) − φ(xj)||2. Reg-

ularizer r(|M|) is a function based on the number of subsets|M|, and η is a balance factor. This is a discrete optimiza-tion problem and known as NP-hard. By fixing the number ofsubsets |M| to a certain constant k, k-means clustering canconverge to a local optimal.

However, during the process of visual tracking, we do notknow the sufficient number of clusters. While too many clus-ters cause problem of over-fitting, too few clusters may leadto ambiguity. More importantly, as long as we allow randomcombinations of samples during clustering, any cluster has arisk of taking in contaminated samples with drift error, even

wrongly labeled samples, which in turn will degrade the per-formance of models built upon them.

One important observation is that, target appearancesclosed to each other in the temporal domain may form a verydistinguished and consistent pattern, i.e., reliable memories.E.g,, if a well-tracked target moves around without big rota-tion or large change in angle for a period of time, its edge-based feature would have much higher similarity comparedwith feature under different angles. In order to discover thesememories, we add temporal constraints on Eq. 11:

mh ∈ T, ∀h = 1, ...,H;

T := {t ∈ {0, 1}p : all ti = 1 are concatenated.}.(12)

Then Eq. 11 with Eq. 12 becomes segmenting S into subsets{sh}Hh=1, that each subset only contains timely continuoussamples sh = {xi}vi=u (u, v are certain frame numbers).

Still, the constraint of this new problem is discrete and theglobal optimal can hardly be reached. We carefully designeda greedy algorithm, as shown in Alg. 1, which starts froma trivial status of p subsets. It tries to reduce the regular-izer r(|M|) in the object function of Eq. 11 by combiningtemporally adjacent subsets sh and sh+1, while penalizingthe increase of the average sample distance τ(sh, sh+1) :=C(sh

⋂sh+1)− (C(sh) + C(sh+1)).

With an intelligent use of Integral Image [Viola and Jones,2001], the evaluation operation in each combining step inAlg. 1 only takes O(1) running time with integral image J ,and each iteration takes linear O(p) operations. The wholealgorithm processes in a bottom-up binary tree structure, andruns at O(p log p) in the worst case, and runs less than 30mson a desktop for over 1000 samples. Designed experimentswill show that the proposed algorithm is very competent infinding distinguished appearance clusters (reliable memories)for our tracker to learn.

3.3 The Workflow of Our Tracking FrameworkTwo feature pools are employed in our framework, one for

coming positive samples across frames, and the second ( de-noted by U) for the learned memories. Every memory u ∈ Ucontains a certain number of samples {xuj }N

u

j=1 and a confi-dence cu:

cu = e−(σ1Bu−σ2N

u), (13)

Page 5: Visual Tracking via Reliable Memories - arXiving problems in computer vision and artificial intelligence. Though much progress has been achieved in recent years [Yil-maz et al., 2006;Wu

where Nu is the number of samples in memory u and Buis the beginning frame number of memory u. This memoryconfidence is consistent with our hypothesis that earlier mem-ories with more samples are more stable and less likely to beaffected by accumulated drift error. For each frame, we firstdetect the object using Eq. 6 to estimate the translation of thetarget, and then utilize this new sample to update our appear-ance model {xp, Ap} by Eq. 7 and Eq. 10.

The correlation coefficient {βj}p−1j=1 is then calculated by:

βj = Θ−1 exp{−∑j∈u

||φ(xp)− φ(xj)||2}, (14)

where scalar Θ is a normalization term that assures∑p−1j=1 β

j = 1, and u is the most similar memory to the cur-rent learned appearance xp in feature space φ(·).

To update memories, we use Alg. 1 to cluster positive sam-ples in the first feature pool into ‘memories’, and import allexcept the last one into U. Note when |U| reaches its thresh-old, memories with the lowest confidence would be aban-doned immediately.

4 ExperimentsOur framework is implemented in Matlab with running

speed ranges from 12fps to 20fps, on a desktop with an IntelXeon(R) 3.5GHz CPU, a Tesla K40c video card and 32GBRAM. The adaptiveness ratio γ is empirically set as 0.15through all experiments. Stoping factor ρ is decided adap-tively as 1.2 times the average covariance of the samples atthe first 40 frames on each video. HOG [Dalal and Triggs,2005] is chosen as the feature φ(·). The maximum number ofmemories |U| is set as 10 and max(Nu) is set to 100.

4.1 Evaluation of Temporally ConstrainedClustering

In order to validate our assumption that temporally con-strained clustering on sequentially tracked samples forms re-liable and recognizable patterns, we perform Alg. 1 on theoff-line positive samples based on our tracking results. Notethat our algorithm gives exactly the same result in the on-line/offline manner, since previously clustered samples haveno effect on clustering the unfixed sample afterwards. Dueto space limitation, here we present illustrative results fromsequences Sylvester and Trellis, in Fig. 2 and Fig. 3.

Fig. 2 shows our detailed results on sequence Sylvester, inwhich the target experiences illumination variation, in-planeand out-of-plane rotation through a long term of 1345 frames.The left part shows the distance matrix D as described inAlg. 1, that Dij = ||φ(xi)− φ(xj)||2, ∀i, j = 1, ..., p. PixelDij with dark blue (light yellow) implies small (large) dis-tance between sample xi and xj in feature space φ(·). Differ-ent colored diagonal bounding boxes represent different tem-porally constrained clusters. The right part shows six repre-sentative clusters, corresponding to colored bounding boxeson the matrix. Memory #1 and memory #8 are two largeclusters containing large amount of samples with high cor-related appearance (blue color). Memory #11 represents acluster with only 16 samples. Its late emergence and limitednumber of samples result in a very low confidence cu andthus it is not likely to replace any existing reliable memories.

Fig. 3 shows mid-term results on the sequence Trellis,compared with two very related trackers MOSSE and ACT,with their learned appearances at these frames. Severalother state-of-the-art methods are also shown for compari-son. Among them, MOSSE and ACT only keep one memory,which is generated from former appearances and graduallyadapts to the current one. Though they are very robust to illu-mination changes, drift error still accumulates across frames.They can hardly recover from drifts due to the concisenessof their model. Our method, with learned reliable memories,is very robust to appearance changes, and can recover fromdrifts when it observes appearances familiar to its memories.

4.2 Boosting by Deep CNNOur tracker’s inherent requirement to efficiently search fa-

miliar patterns (memories) at a global scale of the frameoverlaps with object detection task [Girshick, 2015; He etal., 2015]. Recently, with the fast development of convo-lutional neural networks (CNN) [Krizhevsky et al., 2012;Zeiler and Fergus, 2014], Faster-RCNN [Ren et al., 2015]achieves ≥ 5 fps detection speed by using shared convolu-tional layers for both the object proposal and detection. Toequip our tracker with a global vision for its reliable mem-ories, we fine-tune the FC-layers of a Faster-RCNN detec-tor (ZF-Net) once we have learned sufficient memories in avideo, which helps our tracker resolve local minimum issuescaused by limited effective detection range. Though only sup-plied with coarse detections with a risk of false alarms, ourtracker can start from a local region close to the target andthen ensure accurate and smooth tracking results. Note thatwe only tune the CNN once, with around 150 seconds runningtime on one Tesla K40c for 3,000 iterations. When the track-ing task is long, e.g., more than 3,000 frames, the average fpsis larger than 15, which is certainly worthy for significant im-provement in robustness. In the following stage, we performCNN detection every 5 frames, each taking less than 0.1s.

4.3 Quantitative EvaluationWe first evaluate our method on 50 challenging sequences

from OTB-2013 [Wu et al., 2013] against 14 state-of-the-art methods: ACT [Nummiaro et al., 2003], ASLA [Jia etal., 2012], CSK [Henriques et al., 2012], CXT [Dinh et al.,2011], DSST [Danelljan et al., 2014a], KCF [Henriques etal., 2015], LOT [Oron et al., 2012], MEEM [Zhang et al.,2014a], MOSSE [Bolme et al., 2010], SCM [Zhong et al.,2012], Struct [Hare et al., 2011], TGPR [Gao et al., 2014],TLD [Kalal et al., 2012] VTD [Kwon and Lee, 2010]. Weemployed the released code from the public resource (e.g.,OTB-2013) or the released version by the authors, and allparameters are fixed for each trackers during testing. Fig. 4shows the success plots on the whole dataset with the onepass evaluation (OPE) standard. Our tracker, represented asRMT (Reliable Memory Tracker), obtains the best perfor-mance, while MEEM, TGPR, KCF and DSST also providecompetitive results. TGPR’s idea of building one tracker onauxiliary (very early) samples and MEEM’s idea of usingtracker’s snapshot can be interpreted as making use of earlyformed reliable memory patterns, which is very relevant toour method. DSST designs a very concise pyramid represen-tation for object scale estimation, and employs robust denseHOG feature to achieve high accuracy on estimating target

Page 6: Visual Tracking via Reliable Memories - arXiving problems in computer vision and artificial intelligence. Though much progress has been achieved in recent years [Yil-maz et al., 2006;Wu

Memory 01 Memory 02 Memory 03 Memory 04 Memory 05 Memory 06 Memory 07 Memory 08 Memory 09 Memory 10 Memory 11 Memory 12 Memory 13 Memory 14

Frame 070 Frame 390 Frame 425 Frame 460 Frame 480 Frame 515 Frame 530

MOSSE MOSSE MOSSE MOSSE MOSSE MOSSE MOSSEACT ACT ACT ACT ACT ACT ACT

Figure 3: Comparison on the Trellis video. Blue, magenta, red and white represent results from MOSSE, ACT, Struck andVTD. Our results are represented as bold colored boxes with a dot on the top-right corner, and each color means the activememory at that frame, shown in the bottom row. Learned appearances from MOSSE and ACT at each frame are also shown inthe top row. As illustrated, the target experiences out-of-plane rotation in frame 390 and brings about drift error (memory 07).When he turns head back to the front (frame 425, 460), our method uses reliable memories 04 and 06 respectively to recoverfrom drifting. Note that these two memories were built before drift error accumulates at out-of-plane rotation period.

Overlap threshold0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

Suc

cess

rat

e

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1Success plots of OPE

RMT [0.596]MEEM [0.561]DSST [0.555]TGPR [0.529]KCF [0.516]SCM [0.499]Struck [0.474]ACT [0.457]TLD [0.437]ASLA [0.434]

Overlap threshold0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

Suc

cess

rat

e

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9Success plots of OPE - occlusion (29)

RMT [0.610]MEEM [0.543]DSST [0.536]KCF [0.518]TGPR [0.494]SCM [0.487]ACT [0.444]Struck [0.413]VTD [0.403]TLD [0.402]

Overlap threshold0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

Suc

cess

rat

e

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1Success plots of OPE - out-of-plane rotation (39)

RMT [0.593]MEEM [0.554]DSST [0.536]TGPR [0.507]KCF [0.499]SCM [0.470]ACT [0.456]VTD [0.434]Struck [0.432]ASLA [0.422]

Overlap threshold0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

Suc

cess

rat

e

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9Success plots of OPE - out of view (6)

RMT [0.614]MEEM [0.597]KCF [0.550]LOT [0.467]Struck [0.459]DSST [0.459]TLD [0.457]VTD [0.446]TGPR [0.431]CXT [0.427]

Overlap threshold0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

Suc

cess

rat

e

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9Success plots of OPE - scale variation (28)

RMT [0.567]DSST [0.538]SCM [0.518]MEEM [0.510]ASLA [0.452]TGPR [0.443]KCF [0.427]Struck [0.425]TLD [0.421]VTD [0.405]

Figure 4: Tracking result comparison on 50 sequences from the OTB-2013 dataset. Our tracker is represented by RMT andachieved the top performance on success plots evaluation standard. MEEM, TGPR, DSST and KCF also have close performanceto our tracker. Only top-10 out of 14 tracker results are shown for clearness.

motion. Our tracker outperforms the others on most challeng-ing scenarios, (e.g.), occlusion, out-of-plane rotation, out ofview, fast motion, as illustrated by Fig. 4. The main reason isthat our tracker possesses amount of very reliable memoriesand a global vision that help it regain focus on the target afterdrastic appearance changes.

Sequence Fr. No. MOSSE KCF ACT DSST TLD MEEM RMT

Motocross 2,035 295.9 181.5 182.5 67.5 44.7 33.4 21.5Volkswagon 4,000 60.6 114.1 41.3 122.7 15.9 51.1 12.3Carchase 4,000 125.0 129.4 98.0 132.6 34.4 38.1 34.1Panda 3,000 64.8 83.3 64.5 71.4 27.1 97.9 23.9

Overall 13,035 118.5 122.3 86.1 105.3 28.7 55.1 23.1

Table 1: Tracking result comparison based on average errorsof center location in pixels (the smaller the better) on fourlong-term videos over 13, 000 frames. Average performancesare weighted by the frame number for fairness.

In order to explore the robustness of our tracker, andvalidate its resistance to drift error on long-term chal-lenging tasks, we run our tracker on four long sequencesfrom [Kalal et al., 2012], over 13,000 frames in total. Wehave also evaluated the convolution filter based methods thatare highly related to our method: MOSSE [Bolme et al.,2010], KCF [Henriques et al., 2015], ACT [Nummiaro etal., 2003] and DSST [Danelljan et al., 2014a], together withMEEM [Zhang et al., 2014a] and a detector-based methodTLD [Kalal et al., 2012] (shown in Tab. 1). In order to makefair comparison, we have re-labeled the initial frame to ensurethat no tracker lose focus in the beginning. While MOSSE of-

ten loses track at very early frames, KCF, ACT and DSST areable to track the target stably for hundreds of frames, but usu-ally cannot maintain their focus after 600 frames. MEEM per-forms favorably on video Motocross for over 1700 frameswith its impressive robustness, but it is unadaptable to scalechanges and still leads to inaccurate results. Our tracker andTLD performs over the other five trackers on all videos sinceboth of them have a global vision to search for the target.However, based on an online random forest model, TLD takesin false positive samples slowly, which finally leads to falsedetections and inaccurate tracking results. Contrarily, guidedby the CNN detector trained with our reliable memories, ourtracker is only affected by very limited number of false de-tections. It robustly tracks the target across all frames, andgives accurate target location and target scale until the lastframe for all four videos. A video clip with more detailedillustration and qualitative comparison can be found here.

5 ConclusionIn this paper, we propose a novel tracking framework,

which explores temporally correlated appearance clustersacross tracked samples, and then preserves reliable memoriesfor robust visual tracking. A novel clustering method withtemporal constraints is carefully designed to help our trackerretrieve good memories from a vast number of samples for ac-curate detection, while still ensures its real-time performance.Experiment shows that our tracker performs favorably againstother state-of-the-art methods, with outstanding ability to re-cover from drift error in long-term tracking tasks.

Page 7: Visual Tracking via Reliable Memories - arXiving problems in computer vision and artificial intelligence. Though much progress has been achieved in recent years [Yil-maz et al., 2006;Wu

References[Avidan, 2004] Shai Avidan. Support vector tracking. PAMI,

26(8):1064–1072, 2004. 1[Avidan, 2007] S. Avidan. Ensemble Tracking. PAMI, 29(2):261,

2007. 1[Babenko et al., 2011] Boris Babenko, Ming-Hsuan Yang, and

Serge Belongie. Robust object tracking with online multiple in-stance learning. PAMI, 33(8):1619–1632, 2011. 1, 3

[Bolme et al., 2010] David S. Bolme, J. Ross Beveridge, Bruce A.Draper, and Yui Man Lui. Visual object tracking using adaptivecorrelation filters. In CVPR, pages 2544–2550, 2010. 1, 2, 5, 6

[Collins and Liu, 2003] Robert Collins and Yanxi Liu. On-line se-lection of discriminative tracking features. In ICCV, pages 346–352, 2003. 1

[Dalal and Triggs, 2005] Navneet Dalal and Bill Triggs. His-tograms of oriented gradients for human detection. In CVPR,volume 1, pages 886–893, 2005. 5

[Danelljan et al., 2014a] Martin Danelljan, Gustav Hager, FahadKhan, and Michael Felsberg. Accurate scale estimation for ro-bust visual tracking. In British Machine Vision Conference, Not-tingham, September 1-5, 2014. BMVA Press, 2014. 5, 6

[Danelljan et al., 2014b] Martin Danelljan, Fahad Shahbaz Khan,Michael Felsberg, and Joost van de Weijer. Adaptive color at-tributes for real-time visual tracking. In CVPR, pages 1090–1097,2014. 1, 2

[Dinh et al., 2011] Thang Ba Dinh, Nam Vo, and Gerard Medioni.Context tracker: Exploring supporters and distracters in uncon-strained environments. In CVPR, pages 1177–1184. IEEE, 2011.5

[Gao et al., 2014] Jin Gao, Haibin Ling, Weiming Hu, and JunliangXing. Transfer learning based visual tracking with gaussian pro-cesses regression. In ECCV, pages 188–203. Springer, 2014. 5

[Girshick, 2015] Ross Girshick. Fast r-cnn. In ICCV, 2015. 5[Grabner et al., 2008] Helmut Grabner, Christian Leistner, and

Horst Bischof. Semi-supervised on-line boosting for robust track-ing. In ECCV, pages 234–247. Springer, 2008. 1

[Gray, 2005] Robert M Gray. Toeplitz and circulant matrices: Areview. Communications and Information Theory, 2(3):155–239,2005. 2

[Hare et al., 2011] Sam Hare, Amir Saffari, and Philip HS Torr.Struck: Structured output tracking with kernels. In ICCV, pages263–270. IEEE, 2011. 1, 5

[He et al., 2015] Kaiming He, Xiangyu Zhang, Shaoqing Ren, andJian Sun. Spatial pyramid pooling in deep convolutional net-works for visual recognition. Pattern Analysis and Machine In-telligence, IEEE Transactions on, 37(9):1904–1916, 2015. 5

[Henriques et al., 2012] Joao F. Henriques, Rui Caseiro, PedroMartins, and Jorge Batista. Exploiting the circulant structure oftracking-by-detection with kernels. In ECCV, pages 702–715,2012. 2, 5

[Henriques et al., 2015] J. F. Henriques, R. Caseiro, P. Martins, andJ. Batista. High-speed tracking with kernelized correlation filters.PAMI, 2015. 1, 2, 5, 6

[Jepson et al., 2003] Allan D Jepson, David J Fleet, and Thomas FEl-Maraghi. Robust online appearance models for visual track-ing. PAMI, 25(10):1296–1311, 2003. 3

[Jia et al., 2012] Xu Jia, Huchuan Lu, and Ming-Hsuan Yang. Vi-sual tracking via adaptive structural local sparse appearancemodel. In CVPR, pages 1822–1829. IEEE, 2012. 5

[Kalal et al., 2012] Zdenek Kalal, Krystian Mikolajczyk, and JiriMatas. Tracking-learning-detection. PAMI, 34(7):1409–1422,2012. 5, 6

[Krizhevsky et al., 2012] Alex Krizhevsky, Ilya Sutskever, and Ge-offrey E. Hinton. Imagenet classification with deep convolutionalneural networks. In NIPS, pages 1097–1105. Curran Associates,Inc., 2012. 5

[Kwon and Lee, 2010] Junseok Kwon and Kyoung Mu Lee. Visualtracking decomposition. In CVPR, pages 1269–1276, 2010. 1, 5

[Mei and Ling, 2011] Xue Mei and Haibin Ling. Robust visualtracking and vehicle classification via sparse representation.TPAMI, 33(11):2259–2272, 2011. 1

[Nummiaro et al., 2003] Katja Nummiaro, Esther Koller-Meier,and Luc Van Gool. An adaptive color-based particle filter. IVC,21(1):99–110, 2003. 3, 5, 6

[Oron et al., 2012] Shaul Oron, Aharon Bar-Hillel, Dan Levi, andShai Avidan. Locally orderless tracking. In CVPR, pages 1940–1947. IEEE, 2012. 5

[Ren et al., 2015] Shaoqing Ren, Kaiming He, Ross Girshick, andJian Sun. Faster R-CNN: Towards real-time object detection withregion proposal networks. In NIPS, 2015. 5

[Rifkin et al., 2003] Ryan Rifkin, Gene Yeo, and Tomaso Poggio.Regularized least-squares classification. Nato Science Series SubSeries III Computer and Systems Sciences, 190:131–154, 2003.2

[Ross et al., 2008] David A Ross, Jongwoo Lim, Ruei-Sung Lin,and Ming-Hsuan Yang. Incremental learning for robust visualtracking. IJCV, 77(1-3):125–141, 2008. 1, 3

[Saffari et al., 2009] Amir Saffari, Christian Leistner, Jakob Sant-ner, Martin Godec, and Horst Bischof. On-line random forests.In ICCVW, pages 1393–1400. IEEE, 2009. 1

[Scholkopf et al., 2001] Bernhard Scholkopf, Ralf Herbrich, andAlex J. Smola. A generalized representer theorem. In COLT,pages 416–426, 2001. 2

[Viola and Jones, 2001] Paul Viola and Michael Jones. Rapid ob-ject detection using a boosted cascade of simple features. InCVPR, pages I–511, 2001. 4

[Wang and Lu, 2014] Dong Wang and Huchuan Lu. Visual trackingvia probability continuous outlier model. In CVPR, pages 3478–3485. IEEE, 2014. 1

[Wu et al., 2013] Yi Wu, Jongwoo Lim, and Ming-Hsuan Yang.Online object tracking: A benchmark. In IEEE Conference onComputer Vision and Pattern Recognition (CVPR), 2013. 1, 5

[Yilmaz et al., 2006] Alper Yilmaz, Omar Javed, and MubarakShah. Object tracking: A survey. ACM Comput. Surv., 38(4),2006. 1

[Zeiler and Fergus, 2014] Matthew D Zeiler and Rob Fergus. Vi-sualizing and understanding convolutional networks. In ECCV,pages 818–833. Springer, 2014. 5

[Zhang et al., 2014a] Jianming Zhang, Shugao Ma, and StanSclaroff. Meem: Robust tracking via multiple experts using en-tropy minimization. In ECCV, pages 188–203. Springer, 2014.5, 6

[Zhang et al., 2014b] Tianzhu Zhang, Si Liu, Narendra Ahuja,Ming-Hsuan Yang, and Bernard Ghanem. Robust visual track-ing via consistent low-rank sparse learning. IJCV, pages 1–20,2014. 1

[Zhong et al., 2012] Wei Zhong, Huchuan Lu, and Ming-HsuanYang. Robust object tracking via sparsity-based collaborativemodel. In CVPR, pages 1838–1845. IEEE, 2012. 5