Heuristic Approach for Robust VOTprr.hec.gov.pk/.../6742/...Sciences_2015_PIEAS_ISD.pdf · Telecom...

Heuristic Approach for Robust Visual

Object Tracking

Ahmad Ali

Submitted in partial fulfillment of the requirements for the degree of Ph.D.

August, 2015

Department of Computer and Information Sciences

Pakistan Institute of Engineering and Applied Sciences

P.O. Nilore, Islamabad, Pakistan

Taught man that which he knew not. (Al-Quran)

Thesis Examiners

Student’s Name: Ahmad Ali Department: DCIS

Registration Number: 03-7-1-029-2010 Date of Registration: 11-10-2010

Thesis Title: Heuristic Approach for Visual Object Tracking

Foreign Reviewers (Names and Affiliations)

1. Professor Mihran Tuceryan, School of Science, Indiana University- Purdue University

2.Professor Nie Jian-Wei, Beihang University

3. Professor Chris Chatwin, University of Sussex

Thesis Defense Examiners (Names and Affiliations)

1. Engr. Dr. Shahzad Khalid, Bahria University

2. Dr. Ijaz Mansoor Qurashi, Air University

3. Dr. Sikander Majeed Mirza, PIEAS

Head of the Department (Name) :___________________________Signatures/Date ______________

Thesis Submission Approval

This is to certify that the work contained in this thesis entitled Heuristic Approach

for Visual Object Tracking, was carried out by Ahmad Ali, and in my opinion, it is

fully adequate, in scope and quality, for the degree of Ph.D. Furthermore, it is hereby

approved for submission.

Supervisor: _____________________ Name: Dr. Abdul Jalil Date: 19 August, 2015 Place: PIEAS, Islamabad.

Head, Department of Computer and Information Sciences: _____________________ Name: Dr. Javaid Khurshid

Date: 19 August, 2015 Place: PIEAS, Islamabad.

Heuristic Approach for Robust Visual

Object Tracking

Ahmad Ali

Submitted in partial fulfillment of the requirements for the degree of Ph.D.

August, 2015

Department of Computer and Information Sciences

Pakistan Institute of Engineering and Applied Sciences

P.O. Nilore, Islamabad, Pakistan

Dedications

I dedicate my thesis to my late father. He died in May, 2015 after fighting with his

disease for four years. He motivated me to pursue my Ph.D. in PIEAS in 2010. May

he get eternal peace in his world.

ii

Acknowledgements

All praises to ALLAH (S.W.T.), the creator of everything, for blessing us with

knowledge and endowing the status of super creature. I am always grateful to

almighty ALLAH, the most benevolent and merciful, who blessed me throughout my

life despite my limitations, and gave me the ability to undertake such a challenging

task and proceed towards completion.

I extend my sincerest thanks and the deepest appreciation to my supervisor,

Dr. Abdul Jalil for his generous guidance and moral support during my Ph.D. His

valuable suggestions and persuasive criticism have led me to complete my goal

successfully.

A very special note of thanks goes to my parents and my wife, whose

heartfelt prayers, appreciation, and support have always been a valuable asset and a

great source of inspiration for me.

I am also indebted to Dr. Javed Ahmed, Mr. Khalid Akbar for their

cooperation and encouragement to attain my goal. Thanks are due to Mr. Imran

Khan, Mr. Naveed Haq whose encouragement led me to successful completion of

this thesis.

I gratefully acknowledge Mr. Naeem Ahmed, project director of IT and

Telecom Endowment fund, PIEAS. It was his gratifying attitude that sets me free

from my financial worries throughout Ph.D. He really deserves special thanks for his

generous support.

Last, but not the least, I would like to thank my other Ph.D. fellows (Mr.

Adnan Idris, Mr. Mehdi Hassan, Mr. Muhammad Tahir, Mr. Khurram Jawad,

Mr. Nasir, Ms. Saima Rathore, Mr. Muhammad Aksam Iftikhar, and Mr.

Gibran Javed). These colleagues and friends helped me in times of troubles, praised

me on my achievements, and cheered me whenever I was depressed.

Ahmad Ali

iii

Declaration of Originality

I hereby declare that the work contained in this thesis and the intellectual content of

this thesis are the product of my own work. This thesis has not been previously

published in any form nor does it contain any verbatim of the published resources

which could be treated as infringement of the international copyright law. I also

declare that I do understand the terms ‘copyright’ and ‘plagiarism,’ and that in case of

any copyright violation or plagiarism found in this work, I will be held fully

responsible of the consequences of any such violation.

__________________ Ahmad Ali

19 August, 2015 PIEAS, Islamabad.

iv

Copyrights Statement

The entire contents of this thesis entitled Heuristic Approach for Visual Object

Tracking by Ahmad Ali are an intellectual property of Pakistan Institute of

Engineering & Applied Sciences (PIEAS). No portion of the thesis should be

reproduced without obtaining explicit permission from PIEAS.

v

Table of Contents

Dedications .................................................................................................................... ii

Acknowledgements ...................................................................................................... iii

Declaration of Originality ............................................................................................. iv

Copyrights Statement ..................................................................................................... v

Table of Contents .......................................................................................................... vi

List of Figures ............................................................................................................... ix

List of Tables ............................................................................................................. xiii

List of Algorithms ....................................................................................................... xiv

Abstract ........................................................................................................................ xv

List of Publications .................................................................................................... xvii

1 Introduction ............................................................................................................ 1

1.1 Issues of Visual Object Tracking ....................................................................... 2

1.2 Motivation and Objective .................................................................................. 4

1.3 Contributions of Thesis ...................................................................................... 5

1.4 Thesis Organization ........................................................................................... 6

1.5 Chapter Summary .............................................................................................. 6

2 Literature Survey ................................................................................................... 7

2.1 Related Surveys ................................................................................................. 8

2.2 Contribution to Existing Surveys ....................................................................... 8

2.3 Classical Tracking Approaches.......................................................................... 9

2.3.1 Mean Shift for VOT .................................................................................. 9

2.3.2 Kalman Filter for VOT ........................................................................... 12

2.3.3 Correlation based Template Matching .................................................... 16

2.3.4 Motion Detection for Tracking ............................................................... 18

vi

2.4 Contemporary Tracking Approaches ............................................................... 20

2.4.1 Tracking by Detection............................................................................. 21

2.4.2 Particle Swarm Optimization .................................................................. 23

2.4.3 Sparse Representation ............................................................................. 25

2.4.4 Integration of Context Information ......................................................... 26

2.5 Evaluation Methods for VOT Algorithms and Benchmark Resources ........... 27

2.6 Chapter Summary ............................................................................................ 30

3 Proposed Template Updating Method ................................................................. 32

3.1 Correlation based Template Updating Methods .............................................. 32

3.1.1 Traditional Template Updating Methods ................................................ 33

3.2 Proposed Template Updating Method ............................................................. 34

3.2.1 Case 1 ...................................................................................................... 36

3.2.2 Case 2 ...................................................................................................... 37

3.2.3 Case 3 ...................................................................................................... 37

3.3 Results and Discussion .................................................................................... 37

3.3.1 Qualitative Analysis ................................................................................ 38

3.3.2 Quantitative Analysis .............................................................................. 40

3.4 Chapter Summary ............................................................................................ 42

4 Proposed Visual Tracking Method ...................................................................... 44

4.1 Related Work ................................................................................................... 44

4.2 Proposed Visual Object Tracking Framework ................................................. 46

4.2.1 Correlation and KF based Tracking ........................................................ 46

4.2.2 Adaptive Threshold ................................................................................. 48

4.3 Occlusion Handling with Kalman Filter .......................................................... 49

4.4 Adaptive Fast Mean Shift Algorithm ............................................................... 50

4.5 Combining Correlation, Kalman Filter and Adaptive Kernel Fast Mean Shift

Algorithms ................................................................................................................ 51

vii


4.6.1 Data Set ................................................................................................... 55

4.6.2 Analysis for Proposed Tracking Algorithm ............................................ 56

4.6.3 Adaptive Threshold with Different Parameter Values ............................ 57

4.6.4 Comparison of Proposed Tracking Method with Its Constituents .......... 58

4.6.5 Performance Comparison of Proposed Tracking Methods with Other

Methods............................................................................................................... 59

4.7 Chapter Summary ............................................................................................ 63

5 Stabilized Active Camera Tracking System ........................................................ 70

5.1 Pan-Tilt Control ............................................................................................... 71

5.2 Video Stabilization........................................................................................... 72

5.3 Proposed Pan-tilt Control Algorithm ............................................................... 73

5.4 Proposed Video Stabilization Algorithm ......................................................... 74


5.5.1 Performance of Stabilization Algorithm ................................................. 79

5.5.2 Performance of Active Camera Tracking System .................................. 81

5.5.3 Performance of Stabilized Active Camera Tracking System ................. 86

5.6 Chapter Summary ............................................................................................ 86

6 Conclusion and Future Work ............................................................................... 89

6.1 Summary .......................................................................................................... 89

6.2 Future Work ..................................................................................................... 90

REFERENCES ................................................................................................................ 92

viii

List of Figures

Figure 1.1 Different applications of visual object tracking ........................................... 1

Figure 1.2 Different issues that arise during tracking .................................................... 3

Figure 2.1 Different classical as well as contemporary approaches for visual object

tracking . ........................................................................................................................ 8

Figure 2.2 (Up) Normal tracking, estimated position by Kalman filter follows the

measured position, (Down) Tracking during occlusion using Kalman filter ............... 14

Figure 2.3 (Source [1]): Adaptive tracking-by-detection process, i.e., tracking the

target and updating the classifier. ................................................................................ 21

Figure 2.4 Positive and negative samples for online AdaBoost [2] ............................. 22

Figure 2.5 Positive and negative bags for MIL classifier [3] ...................................... 22

Figure 2.6 A few tracked frames of Liquor video sequence. The yellow rectangle

shows the tracked window, the more closely to the target, the better the result. ......... 28

Figure 3.1 Comparison of different updating schemes (i.e., Naive, α, and β methods

shown in first three rows, respectively) with the proposed method (i.e., forth row) for

Girl video. The video involves two times out-of-plane rotation of the target (see

Frames 101 and 211). The proposed method updates the template better than any of

these methods, and minimizes the template drift. ........................................................ 38



Woman video which contains occlusions, appearance change of the target, clutter and

illumination change in the scene. It is clear that the proposed method works better

than the methods in comparison. ................................................................................. 39



Faceocc video The proposed method successfully handles slow occurring long term

occlusion. ..................................................................................................................... 40

Figure 3.4 Center distance error between ground truth value and calculated value by

naive, alpha, beta, and the proposed template updating methods for Girl video. The

template drift is much less by the proposed method. ................................................... 41

ix


naive, alpha, beta, and the proposed template updating methods for Woman video.

The template drift is much less by the proposed method............................................. 41


naive, alpha, beta, and the proposed template updating methods for Faceocc video.

The template drift is much less by the proposed method............................................. 42

Figure 4.1 Proposed Tracking Algorithm .................................................................... 52

Figure 4.2 Comparison of results for simple correlation tracker, correlation and KF

tracker, and adaptive fast mean shift embedded with correlation and KF tracker for

ThreePastShop2cor video (from Caviar dataset). It proves the claim that adding mean

shift approach with correlation and KF tracker (in the proposed way) improves the

results. .......................................................................................................................... 56

Figure 4.3 Comparison of results for simple correlation tracker, correlation and KF

tracker, and adaptive fast mean shift embedded with correlation and KF tracker for

Liquor video. It proves the claim that adding mean shift approach with correlation and

KF tracker (in the proposed way) improves the results. .............................................. 56

Figure 4.4 Comparison of Pascal score of correlation KF tracker with and without

adaptive fast mean shift algorithm ............................................................................... 57

Figure 4.5 Comparison of mean distance error of correlation KF tracker with and

without adaptive fast mean shift algorithm .................................................................. 58

Figure 4.6 Center distance error for Box video sequence ............................................ 62

Figure 4.7 Pascal Score for Box video sequence ......................................................... 63

Figure 4.8 Distance Score for Board video sequence .................................................. 64

Figure 4.9 Pascal Score for Board video sequence...................................................... 65

Figure 4.11 Pascal Score for Liquor video sequence ................................................... 66

Figure 4.10 Distance Score for Liquor video sequence ............................................... 66

Figure 4.12 Sample tracked frames of Box video sequence. The proposed algorithm

successfully tracks the target during occlusions, scale changes, 3D motion causing

blurriness, and clutter background. .............................................................................. 67

Figure 4.14 Results for Board video sequence. The proposed algorithm successfully

handles the out of plane motion of the target in cluttered background. ....................... 67

Figure 4.13 A few tracked frames of Liquor video sequence. The proposed approach

successfully tracks during occlusions, 3D motion causing blurriness, and background

clutter. .......................................................................................................................... 67

x

Figure 4.15 Frames of Car video sequence. The proposed algorithm successfully

tracks the target in low light conditions. ...................................................................... 68

Figure 4.16 Some frames from David video sequence. The proposed algorithm tracks

the target in varying illuminations and appearance changes. ...................................... 68

Figure 4.17 A few frames of Faceocc2 video sequence. The proposed algorithm

tracks the target with large appearance changes and slowly occurring heavy

occlusions. .................................................................................................................... 68

Figure 4.18 A few frames of Singer video sequence. The proposed algorithm

successfully handles high illumination effects as well as large scale changes. ........... 69

Figure 4.19 Some tracked frames from the sequence ThreePastShop2Cor2 (Caviar

dataset). The main challenges in the video include the existence of similar objects,

and the occlusions which occur while the persons in the sequence cross each other.

The proposed method successfully tracks the target. ................................................... 69

Figure 5.1 Simplified block diagram of the proposed stabilized active camera tracking

system. ......................................................................................................................... 70

Figure 5.2 Relationship between α and cut-off frequency of the low-pass filter ......... 75

Figure 5.3 Magnitude of frequency response of the low-pass filter at α = 0.11 .......... 76

Figure 5.4 Original (left side) versus stabilized (right side) frames of a video recorded

from a vibratory flying helicopter ................................................................................ 77

Figure 5.5 Original versus stabilized x-coordinates of the left truck shown in Figure

5.4................................................................................................................................. 78

Figure 5.6 Original versus stabilized y-coordinates of the left truck shown in Figure

5.4................................................................................................................................. 79

Figure 5.7 Original (left side) versus stabilized (right side) frames of a video recorded

from a vibratory hovering helicopter ........................................................................... 80

Figure 5.8 Original versus stabilized x-coordinates of the building shown in Figure 5.7

...................................................................................................................................... 81

Figure 5.9 Original versus stabilized y-coordinates of the building shown in Figure 5.7

...................................................................................................................................... 82

Figure 5.10 A helicopter is being tracked persistently and precisely with the proposed

tracking system even when the user has initialized the template inaccurately, and the

size, the appearance, and the velocity of the helicopter is continuously varying. ....... 83

Figure 5.11 Tracking the face of a person during severe illumination variation, noise,

low detail, and occlusion. All the lights in the room were turned off in this experiment

xi

to create a challenging scenario. The dark yellow rectangle in Frame 495 indicates

that the tracker is currently working in its occlusion handling mode. ......................... 84

Figure 5.12 Results of un-stabilized (left column) vs. stabilized active camera tracking

(right column) of a distant airplane .............................................................................. 85

Figure 5.13 Results of un-stabilized (left column) vs. stabilized active camera tracking

(right column) of a pedestrian ...................................................................................... 87

xii

List of Tables

Table 2.1 Several related surveys .................................................................................. 7

Table 2.2 Comparison of different VOT algorithms using Mean Shift (S/M - Single

target or multiple target, O - occlusion, IV - high illumination variations, SV - sudden

and large change in target velocity, SC - scale change). Symbols √ and ⅹ,

respectively, show that algorithm does or does not handle the issue. .......................... 11

Table 2.3 Comparison of different VOT approaches exploiting KF (OS- optimum

search, O-occlusion, LM - large target movement, SV- sudden change in velocity).

Symbol √ shows that the tracking algorithm handles the issue and symbol ⅹ means it

does not tackle the issue. .............................................................................................. 16

Table 2.4 Comparison of different correlation metrics. ............................................... 18

Table 2.5 Representative work of tracking-by-detection technique. ........................... 23

Table 2.6 Representative work of using different variants of PSO in VOT ................ 24

Table 2.7 Representative work of exploiting context information for VOT ............... 27

Table 2.8 List of a few online publicly available tracking resources. ......................... 29

Table 3.1 Description of test videos ............................................................................ 37

Table 3.2 Mean center location error for test video sequences using naive, α, β, and

the proposed template updating methods. .................................................................... 42

Table 4.1 Description of dataset .................................................................................. 53

Table 4.2 Pascal score on test video sequences with different values of ψ ................. 54

Table 4.3 Mean distance error on test video sequences with different values of ψ ..... 55

Table 4.4 Comparison of correlation KF tracker with and without adaptive fast mean

shift algorithm .............................................................................................................. 59

Table 4.5 Mean center location error for video sequences of dataset .......................... 60

Table 4.6 Pascal VOC score for video sequences of dataset ....................................... 61

Table 5.1 Maximum steady state error of the tracker ................................................. 74

xiii

List of Algorithms

Algorithm 3.1 Proposed template updating method .................................................... 36

Algorithm 4.1 Correlation and Kalman filter tracking ................................................ 47

Algorithm 4.2 Adaptive threshold ............................................................................... 48

Algorithm 4.3 Occlusion handling with Kalman filter ................................................ 49

Algorithm 4.4 Adaptive fast mean shift algorithm ...................................................... 50

Algorithm 4.5 Combining correlation, Kalman filter and adaptive fast mean shift

algorithms .................................................................................................................... 51

xiv

Abstract

Visual Object Tracking (VOT) is an important field of computer vision which has a

number of applications in different fields, including military as well as commercially

available security and surveillance systems. The contribution of the thesis in this field

is many folds.

Firstly, a comprehensive survey of different classical and contemporary

approaches for VOT is presented. It enables swift understanding of old as well as new

trends in this field.

Secondly, a novel method for template (appearance model of the target)

updating is presented. It adaptively updates the template according to the rate of

change of target’s appearance. Comparison with existing template updating

techniques shows the robustness of the proposed template updating method against

the template drift as well as the stagnation to the old appearance problems.

Thirdly, a new approach for VOT is proposed which combines correlation,

Kalman filter and adaptive kernel fast mean shift algorithms, heuristically. Correlation

tracker is, generally, computation intensive (if the search space or the template is

large) and it suffers from the template drift problem. Moreover, it fails in case of fast

maneuvering target, rapid appearance changes, occlusion, and clutter in the scene.

These issues are handled by using the proposed template updating method and

Kalman filter (KF) with correlation tracker. The threshold for template updating is

made adaptive by using current peak correlation value in the proposed tracking

framework. KF predicts the target coordinates for the next frame, if the measurement

vector is supplied to it by a correlation tracker. Thus, a relatively small search space

can be determined where the probability of finding the target in the next frame is high.

This way, the tracker becomes fast and rejects the clutter which is outside the search

space in the scene. However, if the tracker provides wrong measurement vector due to

the clutter or the occlusion inside the search region, the efficacy of the filter is

significantly deteriorated. In this case, KF predicted position is far apart from the

correlation measured position. Similar situation arises, if a moving target suddenly

xv

changes its direction. In order to handle such scenarios, Fast Mean Shift (FMS) vector

is computed inside the difference image of two consecutive search windows to find

out the cluster of template size in it, which is considered as a target candidate. FMS

kernel is made adaptive according to varying size of target. The proposed tracker

considers the KF prediction position as the true target position if it is close to the FMS

generated position. Otherwise, the correlation measurement is followed. Comparison

with state-of-the-art tracking algorithms on publicly available standard datasets shows

that the proposed algorithm outperforms the other algorithms in most of the cases.

Fourthly, a stabilized active camera tracking system is presented. It comprises

of a camera mounted on a Pan-Tilt Unit (PTU) which is placed on a moving platform.

Jitters are produced in video from the camera due to vibrations in the moving platform

which may cause strains in the eyes of the viewer. The outcome of the proposed

tracking algorithm is employed to digitally stabilize the video without any significant

computational overhead. Experimental results show the efficacy of the proposed

algorithm.

Index terms – visual object tracking, template updating, video stabilization, Kalman

filter

xvi

List of Publications

• Ahmad Ali, Abdul Jalil, JianWei Niu, Xioke Zhao, Javed Ahmed, Muhammad

Aksam Iftikhar, Saima Rathore, “Visual Object Tracking- Classical and

Contemporary Approaches”, accepted in Frontiers of Computer Science, Springer,

Verlog, 2015.

• Ahmad Ali, Abdul Jalil, Javed Ahmed, Muhammad Aksam Iftikhar, Mutawarra

Hussain, “Correlation, Kalman Filter and Adaptive Fast Mean Shift based

Heuristic Approach for Robust Visual Tracking”, Journal of Signal, Image, and

Video Processing, Springer Verlog, pp. 1-19, Jan. 2014, doi: 10.1007/s11760-014-

0612-0.

• Javed Ahmed, Ahmad Ali, Asifullah Khan, “Stabilized Active Camera Tracking

System”, Journal of Real-Time Image Processing, Springer Verlog, pp. 1-20, May

2012, doi: 10.1007/s11554-012-0251-z.

• Irum Anayat, Rooh-ul-Amin, Ahmad Ali, “Moving Object Tracking in Video

Sequences: Moving Object Tracking in Video Sequences through Template

Matching, Fast Mean Shift and Kalman Filter”, Publisher: VDM Verlag Dr.

Muller, 2011, ISSBN: 978-3639377552.

• Muhammad Imran Khan, Javed Ahmed, Ahmad Ali, Asif Masood, “Robust

Edge-Enhanced Fragment Based Normalized Correlation Tracking in Cluttered

and Occluded Imagery,” International Journal on Advanced Science and

Technology, vol. 12, pp. 25-34, 2009, doi:10.1.1.359.7828.

• Ahmad Ali, Hameed Kausar, Muhammad Imran Khan, “Automatic Visual

Tracking and Firing System for Anti-Aircraft Machine Gun”, in Proc 6th

International Bhurban Conference on Applied Sciences & Technology (IBCAST),

2009.

xvii

• Ahmad Ali, Sikander Majeed Mirza, “Object Tracking using Correlation, Kalman

Filtering and Fast Mean Shift Algorithms”, in Proc International Conference on

Emerging Technologies (ICET), 2006.

• Ahmad Ali, Abdul Jalil, Javed Ahmed, Saima Rathore, Muhammad Aksam

Iftikhar, “A New Template Updating Method for Correlation Tracking”, to be

submitted soon.

• Muhammad Aksam Iftikhar, Abdul Jalil, Saima Rathore, Ahmad Ali, Mutawarra

Hussain, ”An Extended Nonlocal Means Algorithm: Application to Brain MRI”,

accepted in International Journal of Imaging Systems and Technology, Wiley,

2014.

• Ahmad Ali, Ilyas Butt, Asifullah Khan, “Browse-Back Post Event Analyzer”, in

Proc. Of IEEE Conference on Frontiers of Information Technology, 2011.

• Muhammad Aksam Iftikhar, Abdul Jalil, Saima Rathore, Ahmad Ali, Mutawarra

Hussain, ”Brain MRI Denoizing and Segmentation based on Improved Adaptive

Nonlocal Means”, International Journal of Imaging Systems and Technology,

Wiley, pp. 234-248, 2013.

• Saima Javed, Mutawarra Hussain, Ahmad Ali, Asifullah Khan, "A Recent Survey

on Colon Cancer Detection Techniques", IEEE/ACM Transaction of

Computational Biology and Bioinformatics, 2013.

• Saima Rathore, Aksam Iftikhar, Ahmad Ali, Mutawarra Hussain, Adul Jalil,

"Capture Largest Included Circles: An Approach for Counting Red Blood Cells",

in Emerging Trends and Applications in Information Communication

Technologies, Springer, pp. 373-384, 2012, ISBN: 978-3-642-28961-3.

• Saima Rathore, Madeeha Naiyar, Ahmad Ali, " Comparative study of entity and

group mobility models in MANETs based on underlying reactive, proactive and

hybrid routing schemes", in Proc. 15th IEEE International Multi Topic

Conference (INMIC), December 13-15, 2012.

xviii

1 Introduction

Visual Object Tracking (VOT) is a well-known research area in computer vision. Its

main objective is to find the locus of points that target of interest follows in image

coordinates. This information may be of significant importance for further analysis,

e.g., to calculate the area, perimeter, center of mass, and motion vector of the target,

etc. Thus, target tracking may play an important role in high level image analysis

tasks, e.g., object recognition [72, 73], activity analysis [5, 74], and intelligent scene

understanding [75]. With the easy accessibility of low cost, high performance

computing power and ubiquitously available digital cameras, usability spectrum of

VOT has become wider and it has found its applications in several real world systems.

A few of its applications are shown in Figure 1.1, which includes:

Human Machine Interaction (HMI): VOT plays an important role to

improve community life by providing easy-to-use interaction with machines, e.g.,

sixth-sense (a wearable gesture interface) [76], perceptual user interfaces [77], eye

gaze tracking for disabled people [78], etc.

Figure 1.1 Different applications of visual object tracking

VOT

Visual surveillance and security

systems Activitity

recognition

Video games

Vehicle tracking

Traffic montoring

Human machine

interaction

Idustrial robotics

Medical diagnosis system

1

Introduction

Visual Surveillance and Security Systems (VSSS): These systems are

ubiquitous in recent times and VOT is an important part of intelligent visual

surveillance, e.g., 3rd Generation Surveillance Systems (3GSS) [66], Siemens Sistore

CX EDS [79], surveillance of places and buildings related to public and defense

interests for intruder detection [80], monitoring human activities [81-86], etc.

Traffic Monitoring: VOT provides solution for monitoring and management

of traffic on roads, e.g., detection of traffic accidents [87, 88], counting of pedestrian

[89], etc.

Industrial Robotics: VOT is applied in the control system of industrial and

humanoid robotics, e.g., using vision sensor with tracking algorithm in feedback loop

[90], ASIMO humanoid robot [91], visual control for Unmanned Aerial Vehicle

(UAV) [92], etc.

Vehicle Tracking: VOT is used for automobile tracking, e.g., tracking a

vehicle by UAV [93], tracking vehicles on road to assist driver [94], [18], and

autopilot of UGV [95], etc.

Video Games: VOT is used in video games to provide better user control,

e.g., tracking user movements [96], face tracking for playing game [97], etc.

Medical Diagnosis Systems: VOT has shown its importance in the medical

field for diagnosis of different diseases, e.g., tracking of ventricular wall [98],

reconstruction of vocal tract shape [99], [100], etc.

Activity Recognition: VOT is an important component of activity recognition

systems for indoor and outdoor monitoring, e.g., learning activity patterns [101],

human activity recognition [102], etc.

1.1 Issues of Visual Object Tracking Immense efforts have been engaged by researchers in the field of VOT for the last

four decades [56], [103]. Nonetheless, it is still a nontrivial task due to various issues

as depicted in Figure 1.2. The issues are described as follows:

Occlusion: It is the state when the target is hidden (partially or fully) by

another object. Occlusion detection and handling is an important issue, but there is no

2

Introduction

universal technique to tackle it. Therefore, strategies are opted according to the nature

of the target and environment of tracking.

Appearance change: Most of the targets, especially non-rigid objects, change

their appearance during motion. Therefore, it is mandatory for the target model to

adapt these changes for a long term tracking session. Small inaccuracies include in the

target model during updating, which accumulate as time passes and ultimately result

in unstable tracking due to sliding off the template from target to background. This

issue is called template drift problem. On the contrary, if the model is made fixed, i.e.,

not updated, or slowly updated, the template cannot incorporate changes in the target

appearance and lose the target due to stagnation to the old appearance problem. Thus,

(a) Occlusion (b) Appearance change (c) Cluttered background

(d) Changing size in image (e) Illumination variations (f) Noise in image

(g) Similar objects (h) Complex object motion

Figure 1.2 Different issues that arise during tracking

3

Introduction

a trade-off between drift and stagnation is required, it is called stability vs. plasticity

dilemma [53].

Cluttered background: When the background of the target contains many

other objects, it is called cluttered environment. If the background is already known

(e.g., indoor tracking), it is easy to handle cluttered environment, but for unknown

background or outdoor tracking, the severity of the problem is increased.

Changing target size in Image: When the target moves towards or away

from the camera, its size in image increases or decreases, respectively. Therefore, the

size of the target appearance model is required to be changed, accordingly, for robust

tracking.

Illumination variations: Many features of the target which are prominent in

high luminance become obscure in low luminance, and vice versa. It causes

deteriorating tracking performance. Therefore, illumination change needs to be

handled for robust visual tracking.

Noise in Image: Image of the target scene may be noisy (e.g., electronic

circuit noise). Therefore, some preprocessing is required to remove the noise from the

image for robust tracking.

Similar Objects: When there are similar objects near the target,. It is likely

that appearance model has high matching score with the nearby objects and

discrimination between the target and the rest of the objects becomes tough

Complex object motion: When target motion is complex, such as out-of-

plane movement, or abrupt variations in its speed and direction (e.g., motion of a

fighter plane or motion of people during skating), tracking becomes a difficult task

due to the inexact approximation of the underlying motion model.

1.2 Motivation and Objective Technological advancement in the field of digital video cameras and continuously

increasing computation power has captured the attention of researchers and

developers to build different visual applications. VOT usefulness as well as usability,

as described in Section 1, is flourishing day and night. Nonetheless, it is a challenging

task in general due to missing prior information about the target and its environment.

4

Introduction

The main objective of this research is to propose an algorithm which can work

robustly in general, if faced with occlusion, clutter, changing target appearance,

illumination variations, etc.

1.3 Contributions of Thesis The contributions of the thesis are many fold including following:

A comprehensive summary of relevant literature, which introduces a new

taxonomy of VOT algorithms into classical and contemporary approaches, and

discussion of different tracking algorithms. This way, the reader may quickly

understand old as well as recent trends of VOT algorithms

A novel template updating method, which updates the template according to

the rate of appearance change of target. It tackles the drift as well as stagnation of old

appearance problems.

A new tracking method framework, which heuristically integrates three

elementary tracking algorithms, namely correlation tracker, Kalman filter, and mean

shift algorithms, in a selective and adaptive manner. The proposed tracking

framework includes: (1) adaptive method for updating the template size, appearance,

and search area, (2) heuristic technique for switching back and forth between

correlation measured output and Kalman filter predicted output based on the closeness

of the mean shift tracker output to either measured or predicted target’s position,

respectively, and (3) heuristic techniques for updating some of the thresholds

associated with different decision steps throughout the algorithm.

The stabilized active camera tracking system, which uses the tracking

algorithm on the video captured from the camera mounted on a pan-tilt unit, along

with its motion control algorithm for active tracking. The active camera tracking

system produces jitters in the video if it is fixed on a moving platform, e.g.,

Unmanned Aerial Vehicle (UAV), Unmanned Ground Vehicle (UGV), helicopter,

etc., therefore, video stabilization is required in order to provide ease to the user. A

new stabilization algorithm is presented in the thesis, which uses the tracking

algorithm to filter out the jitters in the video without adding any significant

computation overhead.

5

Introduction

1.4 Thesis Organization The rest of the thesis is organized into various chapters as follows.

Chapter 2 presents literature survey for VOT. It classifies tracking algorithms

in classical and contemporary approaches. The different techniques in each approach

are discussed. Moreover, tracking resources available online are presented. Thus, a

reader quickly gets the idea of conventional as well as modern trends in this field.

Chapter 3 proposes a new template updating method, which adapts different

variations in appearance of target, according to its rate of change. Experimental

results show that proposed updating method significantly avoids template drift as

compared to the other methods.

Chapter 4 describes the proposed tracking algorithm. It combines correlation,

Kalman filter and adaptive fast mean shift algorithm heuristically such that these

elementary algorithms complement each other for robust tracking. Experimental

results show the efficacy of the algorithm.

Chapter 5 presents stabilized active camera tracking system which consists of

Pan-Tilt Unit (PTU) for active tracking. Video becomes shaky if PTU is fixed on a

moving platform The proposed video stabilization method uses the tracking algorithm

to digitally stabilize the video without any significant computational burden.

Chapter 6 concludes the thesis and sums up the techniques presented for

template updating and robust visual tracking. Moreover, it discusses future directions

for VOT.

1.5 Chapter Summary This chapter discusses applications of VOT in different fields such as human-

computer interaction, industrial robotics, traffic monitoring, video games, vehicle

tracking, medical diagnosis system, and security and surveillance systems. Although,

a lot of work has been done in this field, but there exists no universal solution for

VOT due to the absence of any prior information of target and its background.

Moreover, there are many issues faced by tracking algorithms such as occlusion,

clutter, similar objects, noise, variations in lighting conditions, complex object

motion, changing target appearance and size, etc.

6

2 Literature Survey

By recent years, VOT has made significant progress due to the availability of low

cost, high quality video cameras as well as fast computational resources. Many

modern techniques have been proposed to handle the challenges faced by VOT. This

chapter introduces the readers to (1) various classical as well as contemporary

approaches for object tracking, (2) evaluation methodologies for VOT, and (3) online

resources, i.e., annotated datasets and source code available for various tracking

techniques.

Table 2.1 Several related surveys

Related Surveys Year Topic

Chau et al. [5] 2013

Tracking any object

Yilmaz et al. [12] 2006

Joshi et al. [20] 2012

Yang et al. [30] 2011

Cannons [35] 2008

Geronimo et al. [41] 2010

Pedestrian tracking Ogale et al. [47] 2006

Trucco et al. [51] 2006

Surveillance and motion analysis

Aggarwal et al. [55] 1997

Zhan et al. [58] 2008

Kang et al. [60] 2007

Arikan et al. [63] 2006

Kim et al. [66] 2010

Moeslund et al.[68] 2006

Arulampalam et al.

[69]

2002 Bayesian tracking

Jalal et al. [70] 2012 Wavelet for object tracking

Li et al. [71] 2013 Appearance Models

7

Literature Survey

2.1 Related Surveys Several surveillance and tracking related surveys can be found in literature as shown

in Table 2.1. The most of these surveys are old (i.e., of last decade), e.g., [12], [35],

[47], [51], [68, 69], [55], [60], [63], [58], etc; some cover only a specific field or

technique for tracking, (e.g., pedestrian tracking [41], Bayesian method [69], tracking

under sea water [51], wavelet for tracking [70], etc); a few discuss tracking within

different principle category (e.g., crowd analysis [58], human motion analysis [68],

intelligent visual surveillance [60], appearance model [71], etc); and the recent

surveys discuss only modern trends in VOT, e.g., [30], or recent algorithms using

variants of classical techniques, e.g., [5], [70].

2.2 Contribution to Existing Surveys The present survey discusses: (1) classical and contemporary approaches for visual

object tracking as shown in Figure 2.1, (2) comparison of different tracking

algorithms, and (3) online available resources for different tracking algorithms such as

source code, annotated dataset, etc. The survey will help the readers to briskly

understand the old as well as the current trends and approaches in visual object

tracking.

Figure 2.1 Different classical as well as contemporary approaches for visual object tracking .

VOT Classical

approaches

Mean shift

Kalman filter

Correlation based

template matching

Motion detection

for tracking

Contemporary approaches

Tracking by

detection

Particle swarm

optimization

Sparse representa

tion

Integration of context

information

8

Literature Survey

2.3 Classical Tracking Approaches In this chapter, following widely known classical approaches for visual object

tracking are discussed: (1) mean-shift, (2) Kalman filtering, (3) correlation based

template matching, and (4) motion detection based tracking algorithms. The main aim

of this section is to highlight different tracking algorithms using aforementioned

approaches.

2.3.1 Mean Shift for VOT Mean shift is a non-parametric statistical iterative method, originally developed by

Fukunaga and Hostetler [104]. It is used to find the mode of a distribution provided

its discrete points are given, therefore, it is useful in data clustering. Cheng [105]

unleashed it to the image processing community. It is a very simple and

straightforward algorithm. It randomly picks image pixels as representatives of cluster

centers. A hypothesized multidimensional ellipsoid is centered on each cluster center

and moved to the mean of the data lying inside the ellipsoid. The similar process is

repeated for all the clusters. The mean is iteratively calculated and cluster centers are

moved accordingly until there is no change in the mean value. Adjacent and similar

regions, (similarity depends upon application type and user defined criteria) are

merged during iterations and the number of final clusters may be much less than the

initial number of clusters. Eq. (2.1) describes the calculation of mean shift vector as

given by [106]:

2

1

2

1

ni

ii

ni

i

gh

gh

=

=

− = − −

∑

∑

x xx

m(x) xx x

(2.1)

where g(.)is kernel function, x is the center point and xi is the data points.

Mean-shift based schemes suffer from a few drawbacks. They require manual

adjustment of system parameters such as smallest and largest possible window size,

spatial kernel bandwidths, etc. Starting with its application in image segmentation

[105, 107], mean shift gained popularity in the field of VOT following the research

work of Comaniciu et al. [10]. In this paper, mean shift was used for real-time

tracking of non-rigid objects when viewed using a moving camera. Probability density

9

Literature Survey

(histogram) was used to model the target and color was used as feature for tracking.

The mean shift algorithm finds the most probable position of the target in each

upcoming frame. Comparison of probable target candidates with the original target

model was made using metric based on Bhattacharya coefficient. This work was

extended as kernel-based object tracking in [17]. The proposed scheme proved to be

computationally fast, and robust against clutter, occlusions, camera orientation, and

scale changes, in several scenarios, but it was not successful against illumination

changes and unpredicted object motion. Moreover, spatial information about the

target is lost due to color histogram as target representative, and Bhattacharya

coefficient is not a strong discriminative measure [108]. Yang et al. [24] introduced a

new similarity measure using RBF kernel which is expectation of the spatially

smoothed density estimates over the model image, which improved the robustness and

frame rate of tracking. Beleznai et al. [25-29] exploited mode seeking capability of

mean shift and applied to difference image for detecting and tracking humans in a

video. They used a fast version of mean shift for change detection in video. The

model based validation scheme was used to approve detected change as humans. Fast

mean shift finds clusters in the difference image and updated cluster parameters (e.g.,

cluster centers) are used for tracking purposes. The idea of mean shift for VOT was

extended by Zirkovic et al. [34], and it was used not only for finding local mode of

the density function, but also for estimation of the local mode shape. The algorithm

shows robustness for scale changes and adoption of shape, but it was fragile in the

case of clutter in the background, the presence of multiple targets, rapidly changing

appearance of the target and its motion. Zhou et al. [39] introduced new cost function

for tracking of non-rigid objects and improved the performance of Zivkovic et al. [34]

in complex scenes. Their algorithm optimally adapts ellipse for marking the TOI. The

new cost function contains Lagrange base regularization factor which decreases the

difference between estimated and expected probability distributions. The algorithm

shows better results, but it requires more prior information and is computationally

slower than [34]. Ning et al. [46] used mean shift approach with joint color texture

feature to track the target in complex environment. Shan et al. [52] introduced mean

shift with particle filtering for its sampling efficiency. Their work generated good

results for rapid motion with less number of particles than that of particle filter alone,

but it did not perform well for occlusion and cluttered background. Wang et al. [57]

applied mean shift on infrared imagery to track humans. They used motion guided

10

Literature Survey

gray and edge cues to improve mean shift results. Their algorithm works only for

fixed camera. Following issues arise when the mean shift approach is used for VOT

with histogram as target representative:

• Mean shift approach converges locally due to local basin of attraction

Table 2.2 Comparison of different VOT algorithms using Mean Shift (S/M - Single target or multiple target, O - occlusion, IV - high illumination variations, SV - sudden and large change in target velocity, SC - scale change). Symbols √

and ⅹ, respectively, show that algorithm does or does not handle the issue.

Representative work Target representation Similarity

measure

Issues

S/M O IV SV SC

D. Comaniciu et al. [10],

[17]

Color histogram Bhattacharya

Coefficient

S √ ⅹ ⅹ √

C. Yang et al. [24] Joint Spatial-feature

space

Expectation of

density estimates

S √ ⅹ √ √

C. Beleznai et al. [25-29] Difference image No similarity

measure

M √ √ √ ⅹ

Zirkovic et al. [34] Color histogram Expectation

Maximize like

Algorithm

S √ ⅹ ⅹ √

Zhou et al. [39] Color histogram Expectation

Maximize like

algorithm with

ellipse outlining

the target

S √ ⅹ ⅹ √

Ning et al. [46] Joint color-texture

histogram

Bhattacharya

Coefficient

S √ ⅹ ⅹ √

Shan et al. [52] Motion color Distance

function

S √ √ √ ⅹ

X. Wang et al. [57] Motion and gray edge

cues

Bhattacharya

Coefficient

S √ √ √ ⅹ

A. Adam et al. [4] Fragment base

histogram

representation

Earth Mover's

Distance

S √ ⅹ ⅹ √

J. Jiakar et al. [61] Fragment base

representation

Bhattacharya

Coefficient

S √ √ ⅹ √

M.I. Khan et al. [65] Fragment base edge

representation

Normalized

correlation

S √ √ ⅹ √

11

Literature Survey

• Spatial information is lost due to use of the histogram

• Due to the global nature of template model, it cannot handle occlusion

(even if it is partial) with good accuracy.

First two issues were handled using different variants of the mean shift

algorithm such as in [24, 109] and third one was tackled using the fragment based

approach. Adam et al. [4] proposed the fragment based VOT using mean shift to

handle the last two of the aforementioned issues. They selected fragments, randomly,

in spite of using model based patches (e.g., head, limb, torso). These spatially non-

overlapping patches help in preserving the spatial information. Multiple histograms

were used to represent each sub-region or patch of the template. The template position

in upcoming image frame is calculated using vote map formed by each patch

individual vote. Integral histogram technique was used to make the algorithm

efficient. Their algorithm shows robustness to partial occlusion but it lacks the

method of selection of different patches. Jiakar et al. [61] combined fragment based

approach with mean shift. User was taken into the loop for selection of fragments

manually. The patches may be overlapping or non-overlapping. Bhattacharya

coefficient based metric was used for similarity measure. The algorithm showed

impressive results in case of partial occlusion; it also handled illumination,

appearance, scale changes, and cluttered background. Khan et al. [65] used

normalized correlation for patch based template matching to track the target in the

presence of occluded and cluttered imagery. The template was partitioned into nine

non-overlapping fragments. Table 2.2 summarizes the comparison of different

algorithms using mean shift approach. It has four columns, the first column contains

the representative work and the rest of the columns describes the target

representation, similarity measure, and the issues handled by each method.

2.3.2 Kalman Filter for VOT Kalman Filter (KF) is a statistical parametric recursive algorithm specially

designed for discrete time systems. It is based on a motion model of a linear dynamic

system; therefore, requires its state space representation as shown in Eq. (2.2) and Eq.

(2.3) [110]

1n n n+ = +X ΦX U (2.2)

12

Literature Survey

n n n= +Y MX V (2.3)

where Xn symbolizes the state vector, Φ represents state transition matrix, Un

denotes the system noise vector, Vn is the observation noise vector, Yn is the

measurement vector, and M shows the observation matrix. KF estimates states of the

dynamic system in the presence of (1) noisy measurement (Gaussian noise), and (2)

uncertainty in the model of dynamic system. It works in prediction-correction cycle

format. KF, based on observed (measured) states, corrects its predicted states as well

as update its gain matrix for better future predictions as described by Eq. (2.4) to

Eq.(2.9) [111-113].

( )| | 1 | 1n n n n n n n n∗ ∗ ∗

− −= + −X X K Y MX (2.4)

where |n n∗X represents the posteriori measurement, | 1n n

∗−X the prior

measurement, and Kn the Kalman gain matrix defined as:

1| 1 | 1

T Tn n n n n n

−∗ ∗− − = + K S M R MS M (2.5)

where Rn is the observation noise covariance calculated by Eq. (2.6), | 1n n∗

−S

represents the predictor error covariance defined by Eq.(2.7) .

( ) [ ]Tn n n nCOV E= =R V V V (2.6)

where E[.] is the expected value.

| 1 | 1 1| 1( ) Tn n n n n n nCOV∗ ∗ ∗

− − − −= = +S X ΦS Φ Q (2.7)

[ ]1| 1 1| 1 1 1| 2( )n n n n n n nCOV∗ ∗ ∗− − − − − − −= = −S X I K M S (2.8)

where Qn is the noise covariance matrix and is calculated by Eq. (2.9).

( ) [ ]Tn n n nCOV E= =Q U U U (2.9)

13

Literature Survey

Derivation of KF equations can be found in [111]. In VOT, KF is widely used

in conjunction with other algorithms [9, 17, 21, 32, 36, 44, 114-119]. KF normally

acts in two modes during tracking, which are: (1) normal tracking mode, in which KF

predicts the target coordinates in the image plane for the next frame on the basis of its

position in the current frame. It helps to find the position of the search window [44],

and (2) occlusion mode, in which KF ignores the measured value and uses its

predicted value for next state prediction. Thus, it is used to handle short-term

occlusion. Figure 2.2 (Up) shows the normal tracking mode, in which KF predicted

and actual measured target position overlaps each other [120]. Figure 2.2 (Down)

illustrates the occlusion mode during tracking in which KF does not rely on measured

Figure 2.2 (Up) Normal tracking, estimated position by Kalman filter follows the measured position, (Down) Tracking during occlusion

using Kalman filter

14

Literature Survey

target position, and uses its own prediction to yield to target position in the next

frame. Ahmed et. al. [44] combined KF with normalized correlation to handle short

term occlusion as well as to find the most likely position and size of the search

window for the next frame. Jang et al. [15] used KF for target motion prediction in

order to reduce the search space for matching the target. Comaniciu [9] combined KF

with the mean shift tracker but it could not handle the large movement of the target.

Ali et al. [36] used correlation and fast mean shift algorithms with KF to handle the

complex maneuvered motion of an airborne object. Li et al. [21] used KF with mean

shift and fast motion estimation algorithm to handle the large and sudden movement

of the target. Li et al. [32] used Bhattacharya coefficient for adjusting the KF

estimation parameters adaptively. Their results show the robustness against partial or

full occlusions, fast target motion, and sudden changes in the target velocity. Ridder

[49] used KF for the discriminative tracking approach. They model each pixel with

KF in order to handle the variation in illumination. This way, KF is used for adaptive

background estimation and foreground detection. Peterfreund [121] used KF with

active snakes for robust tracking of position and velocity of non-rigid as well as rigid

objects. He used image gradient along the contour and its optical flow was used as

system measurement.

KF assumes a linear dynamic model of target motion and Gaussian noise in

measurement, which is not always true in the real world. Therefore, its different

variants such as Extended KF (EKF), and unscented KF (UKF) are introduced [122].

EKF applies first order Taylor series to approximate a nonlinear system. Whereas,

UKF does not apply such approximation, it uses unscented transformation and

generate a set of sigma points, which are transferred to a dynamic model of state and

observation for the final result. UKF results are better than that of EKF but it assumes

Gaussian distribution for posterior, therefore, it cannot work in case of multi-modal

distributions. Therefore, Particle Filter (PF) [69] is used to cater these issues. PF is a

non-parametric Monte Carlo simulation based method [123]; it was used in tracking

application first time by Isard and Blake [59] with the name of Condensation. PF

represents state of the target by a set of weighted particles. The weight to each particle

is assigned according to its contribution in finding the target's location. The position

of each particle is updated according to the motion model and the measurement data.

PF suffers from the problem of sample improvishment in which samples contribute no

15

Literature Survey

useful information in estimating the target position. Further detail can be found in [69,

124]. Table 2.3 briefly describes the different VOT approaches exploiting KF,

representative work, and issues handled by each technique.

2.3.3 Correlation based Template Matching Template matching or correlation tracking is the classical method in the field of VOT

with its history since 1973 [56, 103, 125]. The process of tracking is started by

selecting the target in the first frame manually or by some automatic target detection

system. The representation of the target is called a template, which is used to locate

the target by correlating it with the video frame in each iteration. The location with

the highest correlation score is considered as the new target position. Different

correlation metrics, e.g., standard correlation (SC) [126] (Eq. (2.10)), phase

correlation (PC) [127] (Eq. (2.11)), normalized correlation (NC) [126] (Eq. (2.12)),

normalized cross correlation (NCC) [128, 129] (Eq. (2.13)), are usually used as

similarity measure in tracking applications. The detail of these metrics can be found in

[44].

Table 2.3 Comparison of different VOT approaches exploiting KF (OS- optimum search, O-occlusion, LM - large target movement, SV- sudden

change in velocity). Symbol √ shows that the tracking algorithm handles the issue and symbol ⅹ means it does not tackle the issue.

VOT approaches exploiting KF Representative

work

Issues

OS O LM SV

Mean Shift and KF D. Comaniciu [9] √ √ ⅹ ⅹ

Jang et al.[15] √ ⅹ ⅹ ⅹ

Zhulin Li et al. [21] ⅹ ⅹ √ √

Xiaohe Li et al. [32] ⅹ √ √ √

Correlation and KF A. Ali et al. [36] ⅹ √ √ √

J. Ahmed et. al. [44] √ √ ⅹ ⅹ

Background / Foreground Detection and

KF

C. Ridder [49] ⅹ ⅹ √ √

16

Literature Survey

1 1

0 0( , ) ( , ) ( , )

K L

i jc m n f m i n j t i j

− −

= =

= + +∑ ∑ (2.10)

.F Tc real idftF T

∗ =

(2.11)

1 1

0 0

1 1 1 12 2

0 0 0 0

( , ) ( , )( , )

( , ) ( , )

K L

i j

K L K L

i j i j

f m i n j t i jc m n

f m i n j t i j

− −

= =

− − − −

= = = =

+ +=

+ +

∑∑

∑∑ ∑∑ (2.12)

1 1

0 0

1 1 1 12 2

0 0 0 0

[ ( , ) ][ ( , ) ]( , )

[ ( , ) ] [ ( , ) ]

K L

s ti j

K L K L

f ti j i j

f m i n j t i jc m n

f m i n j t i j

µ µ

µ µ

− −

= =

− − − −

= = = =

+ + − −=

+ + − −

∑∑

∑∑ ∑∑ (2.13)

where f is the image, t is the template, F and T their Fourier transforms, T* is

conjugate of T, idft(.) is inverse discrete Fourier transform operator, real(.) extracts

real part of its operand, μf and μt shows mean of image and template respectively.

SC does not have any bounding value, therefore, no threshold value can be set

to validate the match score and update the template. Moreover, it is sensitive to

illumination and produces a peak matching value at the brightest spot in the image.

PC computes correlation in the Fourier domain. It is insensitive to variations in image

intensity because it ignores the Fourier magnitude and calculates the phase component

only. It has strong discriminatory power and produces a sharp peak, but it is not as

robust to noise when compared to SC [130]. Moreover, it assigns equal weight to all

of its components which seem inappropriate as significant components should ideally

be more weights as opposed to other components [128]. Due to these discrepancies,

PC may yield false positive [44, 131, 132]. Different variants of PC have also been

proposed [133-135], yet they are not as robust to variation in appearance,

illumination, and contrast as NC and NCC. These two metrics have their values in the

ranges [0, 1] and [-1, 1] respectively. Therefore, it is easy to set a threshold for

17

Literature Survey

template updating and occlusion handling. Updating the template is mandatory for

tracking an object changing its appearance. Ali [36] updated the template completely

on every next frame if the peak correlation value is higher than a threshold. The

updating scheme suffers from fast template drift problem if the newly found template

position is not the exact position. To handle this problem, Ahmed et al. [44] updated

the template smoothly using a first order IIR filter. Most of the times, NCC is used as

a similarity measure in image registration [126, 128, 136, 137], but NC performs

better than NCC when edge-enhancement is performed as preprocessing step of target

tracking [44]. Aasgrizadeh et al. [138] integrated Region Mutual information (RMI)

with edge correlation tracking for more robust tracking of aerial objects. RMI

provides information about the clutter and clear backgrounds as well as high

luminance changes. Table 2.4 shows the comparison of above mentioned correlation

metrics with respect to their discriminatory power and robustness to noise.

2.3.4 Motion Detection for Tracking There are various methods for motion detection, including background subtraction,

temporal differencing, background modeling, and optical flow.

Background Subtraction

Target or area of interest in a scene is referred to as the foreground and anything else

in the image is termed as background. Background subtraction or foreground

detection may be used for two purposes. First, it may be used to initialize the tracking,

and second, it is used to detect the target of interest from frame to frame. The simplest

method for foreground detection is to subtract each frame from a fixed background

model in case of stationary background. The pixels corresponding to background will

Table 2.4 Comparison of different correlation metrics.

Sr. No. Correlation Type Discriminatory

power

Robustness to noise

1 Standard correlation Poor Poor

2 Phase correlation Strong Poor

3 Normalized correlation Strong Strong

4 Normalized cross correlation Strong strong

18

Literature Survey

yield a very low value and the pixels related to foreground will create high values in

the subtracted image. Thus, a threshold can be set to distinguish the foreground pixels

from the background pixels. Connected component algorithm is used to group the

foreground pixels, and the target is searched only at the foreground regions. This way,

exhaustive search is avoided and efficiency of the algorithm is improved. This

straightforward method of background subtraction normally works in a structured

environment, and it fails in the case of un-structured or outdoor environment where

illumination and background does not remain stationary.

Fixed background does not work in case of outdoor, therefore, adaptive

background model is used. Wren et al. [139] proposed a unimodal Gaussian model for

each pixel using its mean and variance in YUV color space. Their algorithm works

well in order to handle small illumination changes, but it does not show efficacy in

case of sudden illumination changes (e.g., flashing light, swaying trees or bushes,

moving fountains, and rotating fins of a fan, etc.). These issues are handled by the

work of Stauffer et al. [101, 140, 141]. They adopt multi-model Gaussian

representation for each pixel, and update the model online to learn the changing

background. Usually, 3 to 5 models are used to model each pixel distribution. If

match of a current pixel is found with any Gaussian distribution, it is considered as

background, otherwise, it is classified as foreground pixel, and background model is

updated according. This multi-model Gaussian approach does not tackle the problems

of drastic illumination change, and moving shadows. KaewTraKulPong et al. [142]

improved the learning rate of Guassian mixture model and introduced shadow

detection method. Their algorithm compares the foreground pixel with the

background model. If the difference between chromatic and brightness components is

within a certain threshold, it is considered to be a part of the shadow. A similar

technique was presented by Horprasert et al. [143, 144]. Haritaoglu et al. [81]

developed a real-time surveillance system that trains background model by three

features, i.e., minimum pixel value (m), maximum pixel value (n), and maximum

intensity difference between consecutive frames (d). A pixel is classified as

foreground, if its difference from m or n is greater than d, otherwise, it is taken as

background pixel. Oliver et al. [145] used principle component analysis and Eigen

decomposition to build Eigen background. The projected image of the current frame

is subtracted from the Eigen background to detect foreground objects.

19

Literature Survey

Temporal Differencing

Temporal differencing means subtraction of previous frame from current frame to

detect any change or moving objects in the scene. Lipton et al. [146] use temporal

differencing between two consecutive frames for foreground detection. Their

approach used multiple hypotheses for classification of foreground regions as targets

of interest. The classification metric employs perimeter and area to identify the targets

in the difference image. In order to improve the detection of foreground regions, three

inter-frames temporal difference scheme may also be used [147, 148]. Temporal

difference methods are sensitive to the threshold used to discriminate between

foreground and background regions, as well as illumination changes. Moreover, when

the target stops moving, it cannot be detected as a foreground object.

Optical Flow

Optical flow is the apparent motion pattern in an image of a scene due to relative

motion between objects of the scene and the camera. The calculation of optical flow

assumes brightness consistency between corresponding pixels in the scene. There are

various methods for calculating dense optical flow in an image such as Lucas and

Kanade [56], Horn and Schunck [149], Black and Anandan [150], and Szeliski and

Couglan [151]. Optical flow is used as a feature in segmentation and tracking

applications based on the motion of objects. Shi and Tomasi [152] exploited optical

flow to find out the motion of a region in an image and developed their well-known

KLT tracker. The tracker is sensitive to illumination changes and large frame motion.

Rangarajan and Shah [153] used optical flow to find initial inter-frame

correspondence between first two frames to their proposed greedy search algorithm.

Papageorgiou et al. [154] used optical flow to reduce the search space of their SVM

based pedestrian and face detection algorithm. Cremers et al. [155] used optical flow

as a feature in contour based tracking algorithm. Li et al. [156] used optical flow for

silhouette tracking. Bertalmio et al. [157], and Mansouri [158] used optical flow for

minimization of contour energy.

2.4 Contemporary Tracking Approaches In this Section, we will investigate the recent approaches for VOT which includes, (1)

tracking by detection, (2) sparse representation, (3) particle swarm optimization, and

(4) integration of context information.

20

Literature Survey

2.4.1 Tracking by Detection It includes the class of algorithms which considers the tracking phenomenon as the

detection process of target in consecutive image frames by training a binary classifier

to discriminate the target from its background. The class of algorithms is termed as

tracking-by-detection or tracking-by-repeated-recognition algorithms [3]. These

methods have gained popularity in recent years due to their efficacy in performance

and simplicity of classification task [2, 6, 18, 53, 159-161]. Detailed discussion on

different classifiers can be found in [162, 163]. Normally, a classifier requires data for

its for its training, but no prior knowledge is available about the target position in case

of tracking application. Therefore, training data are generated online during tracking

and the classifier is updated accordingly. It is called Adaptive tracking-by-detection.

Collin et al. [6] presented an approach for online selection of features to discriminate

the target from its background. The estimated position of the target in each frame is

considered as a positive example and its nearby locations are treated as negative

examples for updating the classifier. This step is called Generation and Labeling of

Samples [11]. During tracking, classifier finds the target position by maximizing the

classification score in a local region, normally around the target position found in the

previous frame, using the sliding window method. Figure 2.3 explains this tracking

Figure 2.3 (Source [1]): Adaptive tracking-by-detection process, i.e., tracking the target and updating the classifier.

21

Literature Survey

and updating process. Avidan [18] introduced Support Vector Tracking (SVT), which

integrates Support Vector Machine (SVM) classifier with optical flow for vehicle

tracking. Grabner et al. [2] presented the online version of AdaBoost approach for

real-time tracking. The tracking approaches [2, 6, 18] update their classifiers by

considering a single only positive example consisting of current position of the target,

and many negative examples, i.e., samples around the current target position as shown

in Figure 2.4. Small inexactness in the target position results in poorly labeled training

samples. It is called label jittering which abates the performance of classifiers and

ultimately causes the drift problem. Therefore, most of the recent tracking-by-

detection approaches try to improve tracking performance by making the classifier

more robust to incorrectly labeled examples [3, 40, 50, 164-166]. Babenko et al. [3]

presented Online Multiple Instance Learning Boosting (Online MILBoost) algorithm

for robust tracking. Instead of assigning label to each individual example, their

algorithm combines instances into bags, and a label is assigned to each bag. A

positive bag should contain at least one positive example; otherwise, negative label is

assigned to it as shown in Figure 2.5. Their algorithm shows prominent results against

drift issue, but it fails to recapture the target if it gets out of the scene and returns. This

is the problem with all adaptive appearance model based algorithms that they start

updating themselves with a false object if the target is fully occluded or it gets out of

Figure 2.4 Positive and negative samples for online AdaBoost [2]

Figure 2.5 Positive and negative bags for MIL

classifier [3]

22

Literature Survey

scene for a while. Grabner et al. [40] solved this issue by using semi-supervised

appearance model updating method. Their method combines the labeled data (prior

knowledge), i.e., the target selected by the user in the first frame, along with current

unlabeled data. This way, the method becomes robust to drift issue, but it shows less

adaptability to appearance changes. Zeisl et al. [50] combined the strength of semi-

supervised and multiple instance learning into a single framework. Zhang et al. [45]

improved the work of Babenko et al. [3] by assigning different weights to different

instances. The closer the instance to the target, the higher the weight assigned to it.

William et al. [167] points out that the highest classification score does not

necessarily belong to the target position, as there is no explicit relationship between

classification confidence and the target spatial position. Sam Hare et al. [11] presents

a framework based on structured output prediction which explicitly incorporates the

tracker need of labeled training examples into the output space. Instead of learning

classifier, the framework focuses on estimating the target transformations by using

structured output SVM. Thus, it gets rid of intermediate step of Generation and

Labeling the Samples. Table 2.5 summarizes the representative work of tracking-by-

detection technique by mentioning discriminatory technique used in each work.

2.4.2 Particle Swarm Optimization Particle swarm optimization, inspired by the birds searching for food, was first

introduced by Kennedy et al. [168, 169] in 1995. Since then, its use in different

Table 2.5 Representative work of tracking-by-detection technique.

Sr. No. Representative work Discriminatory technique

1. Collins et al. [6] Selection of features by ranking

2. Hare et al. [11] Structured SVM learning

3. Avidan [18] SVM learning

4. Grabner et al.[2] Boosting by selection of features using feature-

ranking

5. Babenko et al. [3] Boosting by multiple instance learning

6. Grabner et al. [40] Semi supervised boosting

7. Zhang et al. [45] Boosting by weighted multiple instance learning

8. Zeisl et al.[50] Semi-supervised multiple instance learning

23

Literature Survey

applications is increasing day by day and it has drawn attention of researchers in

different fields [170-172]. It is a stochastic process exploiting the phenomenon of

swarm intelligence and works on collective wisdom, which prevails by each of its

particles. Each particle in PSO updates its position by considering its own best

position as well as best position of its neighborhood, until all the particles find a

common converging position or the maximum number of iterations are completed.

The size of neighborhood may vary from one particle to the entire swarm around the

current particle. An objective function is required to calculate the fitness map of each

particle. The position of a particle with the highest value on fitness map is considered

its best position and the best among these best positions is the global or the swarm

best position. PSO has very simple formulation consisting of velocity and position

update equations described by Eq. (2.14) and Eq. (2.15).

1 1 2 2( 1) ( ) ( ( )) ( ( ))i i i i id d d d d dv n wv n c r p x n c r g x n+ = + − + − (2.14)

( 1) ( ) ( )i i id d dx n x n v n+ = + (2.15)

Where ( )idv n and ( )i

dx n represent the velocity and position of ith particle in d-

dimension at iteration n, respectively, idp is the particle personal best position, dg is

the global best position of the swarm, w is the inertia weight, c1, c2 are constants, r1

and r2 are random values have values in the range of [0, 1]. Different variants of PSO

Table 2.6 Representative work of using different variants of PSO in VOT

Sr. No. Representative work PSO variants

1. Zhang et al. [7] Sequential PSO

2. Zhang et al. [16] Species PSO

3. Akbari et al. [23] Standard PSO

4. Kwolek et al.[31] Standard PSO

5. Anton et al. [38] Predator-Prey PSO

6. Zheng et al.[43] Standard PSO

7. A. Tawab et al. [48] Standard PSO

8. Borra et al. [54] PSO-FCM

24

Literature Survey

and their applications can be found in the [170, 173, 174]. In visual tracking

applications, PSO is used to search for the best candidate position of the target in the

current frame. Zhang et al. [7] introduced the sequence of temporal information of the

target into PSO and name it as Sequential PSO. They present particle filter based

tracking algorithm with hierarchical importance sampling process guided by

sequential PSO. Thus, their approach helps to cope with the classic sample

impoverishment problem of particle filter. Zhang et al. [16] propose species based

PSO for tracking of multiple objects. Each species are used to track the individual

object. Thus, different trackers run under a single framework. The inter-objects

occlusion is handled by species competition and repulsion. The number of objects and

hence species are initialized by the user at the very beginning of the tracking process.

R. Akbari et al. [23] combine PSO and KF to track multiple objects in cluttered

environments. They use non-overlapping fragments based representation of objects.

Each fragment is represented by a particle. The particles of PSO are guided by KF in

a hybrid framework using the region as well as object information. Kwolek et al. [31]

proposed multi-object tracking algorithm which used PSO to improve the target

position found by discriminative appearance models. The objective function is based

on fragments based representation of targets and their covariance matrix. A. Canalis et

al. [38] use PSO for tracking target in predator-prey style. Each particle directly

interacts with a pixel, and tracking is performed by interaction of particle with its

environment. Zheng et al. [43] represent the target into multi-dimensional feature

space and employ PSO algorithm to expedite the search process. Bhattacharya

coefficient is used as fitness function for PSO in their algorithm to track faces and

vehicles. Abdel Tawab et al. [48] proposed PSO based fast gray level object tracking.

They employ combination of SIMilarity (SIM) and Bhattacharya coefficients as

fitness functions to evaluate the score of particles of PSO. Borra et al. [54] proposed

PSO-Fuzzy C Means (FCM) based tracking algorithm. PSO-FCM is used to segment

the objects in the scene and a pattern matching approach is used to track the target.

Table 2.6 summarizes the representative work of PSO in VOT.

2.4.3 Sparse Representation Compressive or sparse representation [175, 176] of a signal exhibit the signal as linear

combination of small number of basis vectors. The representation is becoming

popular in various pattern recognition and image processing applications [177-180].

25

Literature Survey

X. Mei et al. [19, 181, 182] used sparse representation for object tracking. The

algorithm is capable to cope with occlusion problem using trivial templates and l1

minimization approach for sparse representation. Trivial templates are used to model

targets as well as background, which make reconstruction error small for both the

regions. The candidate region with minimum reconstruction errors is considered the

target. The algorithm is expensive with respect to computations due to l1 minimization

algorithm. Liu et al. [183] improved tracking efficiency and robustness by exploiting

sparseness and using a set of discriminative features. The algorithm uses a fixed

number of features, therefore, it is not effective in complex or dynamic environment.

Liu et al. [184] used mean shift and histograms based local sparse representation for

appearance model. However, histogram representation is unable to distinguish

between the targets and the background due to its inherent problem of missing spatial

information. Zhong et al. [67] made the object tracking robust to occlusion using

hybrid approach of sparsity-based discriminative classifier (SDC) and sparsity-based

generative model (SGM). SDC assigns a higher confidence to foreground objects than

background objects. SGM proposes a new method to calculate histograms which

preserves the spatial position of each patch as well. Jia et al. [64] proposed structure

sparse representation of the appearance model for target tracking. The algorithm

proposes an alignment-pooling method for partial as well as spatial information in

order to tackle the occlusion problem. Moreover, the algorithm proposes a novel

template updating method based on incremental subspace and sparse representations.

Present tracking algorithms update appearance model using current image frame.

Therefore, these methods are data dependent. Zhang et al. [62] exploit multi-space

features for appearance modeling based on data independent basis. The algorithm uses

random projections to protect the feature space of objects in the image. Sparse

representation is employed to extract the features. Detail review and experimental

comparison of sparse coding based visual tracking may be studied from the paper of

Zhang et al. [185]

2.4.4 Integration of Context Information Integration of context with the target of interest for robust object tracking has gained

significant importance in recent years. Various psychophysics studies have

emphasized the role of context in image understanding for human perception system

[186]. Detailed discussion of the role of context in object detection can be found in

26

Literature Survey

[187]. Yang et al. [8] proposed a number of nearby objects to the target as spatial

context to enhance the appearance model. These objects are automatically extracted

from the video during run-time and are named as auxiliary objects. Auxiliary objects

are chosen at least for short time interval according to following criterion: (1)

straightforward to track, (2) persistent co-occurrence with the target, and (3)

consistent motion correlation with the target. Li et al. [13] modelled contextual

relationship with a dynamic Markov random field for simultaneously recognizing,

localizing, and tracking multiple objects of different categories in meeting room

videos. The Spatio-temporal relationship is used to get information about object

category and its state. Nguyen et al. [22] use spatio-temporal context for multi-target

tracking. Spatio context includes nearby objects and temporal context contains all

previous target models based on Probabilistic Principle Components Analysis

(PPCA). Wen et al. [33] also proposed spatio-temporal context relationship for robust

object tracking. Grabner et al. [37] used Hough transform to integrate temporal

context (supporter) and distinguish between strong and weak coupled motions. Their

algorithm works well in case of full occlusion and target changing its appearance

heavily and rapidly. Table 2.7 summarizes the representative work of exploiting

context information for VOT.

2.5 Evaluation Methods for VOT Algorithms and

Benchmark Resources VOT algorithms are evaluated qualitatively as well as quantitatively. For qualitative

comparison, sample image frames are shown and visually examined. The visually

better results are considered those which have tracked rectangle closer to the target of

Table 2.7 Representative work of exploiting context information for VOT

Sr. No. Representative work Contextual information

1. Yang et al. [8] Spatial position

2. Li et al. [13] Spatio-temporal relationship

3. Nguyen et al. [22] Spatio-temporal relationship

4. Wen et al. [33] Spatio-temporal relationship

5. Grabner et al. [37] Temporal context

27

Literature Survey

interest as shown in Figure 2.6. Qualitative analysis does not provide a fair

comparison between different algorithms. Therefore, quantitative solution is

calculated to have a better understanding of the robustness of the algorithms. For this,

two measures are employed, one is the mean distance from center location, it provides

the error between the center location of tracking rectangle and its ground truth value.

The overall performance of the algorithm is summarized by computing the mean of

the center location errors for all the frames in a video. The problem with this method

is that if a particular method successfully tracks a target in most of the video frames

and lost the target in a few frames with a large distance, its performance will be poor

in comparison with a method which mostly does not track the target but clings to

background nearby the target. Therefore, this method is not a true representative for

performance of a particular method. One modification to this method is made by

Babenko et al. [3] and Henriques et al. [188]. They calculate the percentage of frames

in which the distance between the tracked location and the ground truth location is

less than a fixed threshold (e.g., 20 pixels). The other quantitative measure is termed

Frame 360 Frame 607 Frame 776 Frame 1115



Figure 2.6 A few tracked frames of Liquor video sequence. The yellow rectangle shows the tracked window, the more closely to the target, the better

the result.

28

Literature Survey

as Pascal score. It finds out the overlapping area between the tracked target and its

ground truth value as described by Eq. (2.16)

( )( )

t g

t g

area R Rp

area R R∩

=∪

(2.16)

Table 2.8 List of a few online publicly available tracking resources.

Sr. No.

Name Dataset Ground truth

Source Code

URL:

1. Fragtrack [4] √ √ √ www.cs.technion.ac.il/~amita/fragtrack/fragtrack.htm

2. Incremental visual tracker [14]

√ √ √ www.cs.utoronto.ca/~dross/ivt/

3. 1 tracker [19] ⅹ ⅹ √ www.ist.temple.edu/~hbling/code data.htm

4. Kernel based tracker [17]

ⅹ ⅹ √ code.google.com/p/detect/

5. Boosting tracker √ ⅹ √ www.vision.ee.ethz.ch/boostingTrackers/

6. MIL tracker [3] √ √ √ vision.ucsd.edu/~bbabenko/project_ miltrack.shtml

7. Visual tracking decomposition [42]

√ √ √ cv.snu.ac.kr/research/~vtd/

8. Structural SVM tracker [11]

ⅹ ⅹ √ www.samhare.net/research/struck

9. PROST tracker [53] √ √ √ gpu4vision.icg.tugraz.at/index.php?content=subsites/prost/prost.php

10. KLT tracker [56] ⅹ ⅹ √ www.ces.clemson.edu/~stb/klt/

11. Condensation tracker [59]

√ ⅹ √ www.robots.ox.ac.uk/~misard/condensation.html

12. Caviar sequences √ √ ⅹ homepages.inf.ed.ac.uk/rbf/CAVIARDATA1/

13. PETS sequences √ √ ⅹ www.hitech-projects.com/euprojects/cantata/datasets cantata/dataset.html

14. Compressive tracking [62]

√ √ √ www4.comp.polyu.edu.hk/~cslzhang/CT/CT.htm

15. Structural local sparse tracker [64]

√ √ √ ice.dlut.edu.cn/lu/Project/cvpr12 jia project/cvpr12 jia project.htm

16. Sparsity-based collaborative tracker [67]

√ √ √ ice.dlut.edu.cn/lu/Project/cvpr12 scm/cvpr12 scm.htm

29

Literature Survey

where Rt and Rg are the tracked target region and its ground truth region,

respectively. ∩ and U shows the intersection and union symbols respectively. Pascal

score may have a value from 0 to 1 in a closed interval. If there is no overlapping

region, its value is 0, it gains value of 1 in case of full overlap. The target is

considered to be successfully tracked in a frame if its Pascal score is greater than 0.5,

(i.e., at least fifty percent overlap). In order to have a fair comparison between

different tracking algorithms, two things are required. One is the test videos with

annotations, and the other is implementations of the algorithms. Wu et al. [189] have

organized a dataset comprising of fifty videos with their ground truth values and a

code library having implementation of 29 tracking algorithms. They provide

performance evaluation and comparison of these algorithms over different parameters,

e.g., scale changes, illumination changes, occlusion handling, overall tracking

performance, etc. In order to make this chapter self-contained, list of a few publicly

available VOT resources is shown in Table 2.8.

2.6 Chapter Summary In this chapter, different object tracking algorithms have been investigated. The

proposed taxonomy categorizes the VOT algorithms into classical and contemporary

approaches. Thus, Mean shift, Kalman filter, motion detection, and template matching

based algorithms are presented as classical approaches for visual tracking, whereas

tracking-by-detection, swarm intelligence, and sparse representation, and integration

of context have made their place in contemporary tracking algorithms. Representative

work in classical and contemporary approaches has been investigated. It is clear from

the literature discussed in this chapter that no universal tracker exists which may work

equally well in all kinds of situations and environments. Most of the tracking

algorithms work in a structured environment or track a specific type of target. The

reason for this is that there is a lack of accurate mathematical models for complex

target motion and appearance change which may be the future research area for the

community working in computer vision. Distinct feature selection with high

discriminatory power in case of a cluttered background, motion blur and occlusion is

another future avenue to be explored by researchers. Online updating the classifier

and the template representing the target is also required to cope with the varying

appearance of a target. Current updating methods are prone to error due to the

inclusion of the background pixels in the model or classifier and suffer from the

30

Literature Survey

template drift problem. An accurate updating method is an active research field for

experts dealing with tracking problems. Context awareness based tracking approaches

have generally shown better results in recent years. Therefore, integration and

automatic extraction of contextual objects (supporters or auxiliary objects) may get

attention of researchers in future.

31

3 Proposed Template Updating Method

Visual object tracking can be considered as the process consisting of representation of

target, called template, and its localization in consecutive image frames. Template

updating is required to handle the changing appearance of the target. During updating,

a template allows some background pixels to enter into its model due to inaccuracy in

calculating the target position. As time passes, these errors are accumulated and

template start sliding off the target and finally, it is stuck in the background. This

problem is called template drift, and it is one of the most challenging problems faced

by tracking algorithms. Slow template updating will slow down the drift problem, but

it will not track the target changing its appearance rapidly, it is called stagnation to the

old appearance problem. On the other hand, frequent updating will prone more to the

drift problem. Thus, stability of tracking algorithm and its template adaptability at the

same time requires a trade-off. It may also be named as stability-plasticity dilemma

[53]. The existing template updating methods do not take into account the actual

appearance changes of the targets, therefore, they are not much effective against

template drift as well as stagnation to old appearance problems. This chapter proposes

a new template updating method for correlation based tracking algorithms which

updates the template according to the rate of change in appearance of the target and

finds a good tradeoff between stability-plasticity dilemma. Moreover, the method is

capable to revert the template near to some previously better representation if more

recent updating is incorrect. Thus, the proposed algorithm helps to overcome the

problems of template drift, especially during occlusion and complex (e.g., out-of-

plane) motion of the target. Experimental results and comparison with other

algorithms on different publicly available challenging videos prove the efficacy of the

algorithm.

3.1 Correlation based Template Updating Methods Target detection and tracking algorithms based on correlation are common in

computer vision community since 1979 [56, 103, 190-192]. In such algorithms, the

target is represented by an appearance model which is named as the template and it is

32

Proposed Template Updating Method

matched with the upcoming image or part of the image (called search window) to find

the position of best possible candidate for the target. Target may be represented by

different features such as intensity [36, 193], color [17], texture [194], etc. Detail for

selection of features can be studied from Yang et al. [30]. Recently, Edge Enhanced

(E2) template representation has proved its efficacy for robust object tracking [44,

138, 191, 195]. In E2 tracking, target may be selected by the user or it may be

detected by some target detection method to initialize the tracking process. The

success of correlation tracking is mainly based on two factors: one is the correlation

metric, which should have as less error as possible in calculating the target position

and the other is a template updating method, which finds tradeoff between stagnation

to old appearance and template drift problems. In this chapter, we will use

Normalized Correlation (NC) for template matching as it is better than other

correlation metrics (e.g., phase correlation, normalized correlation coefficient,

Bhattacharya coefficient) when the template is edge-enhanced [44, 191]. A new

method is proposed which updates the template according to the rate of change in

appearance of the target, i.e., the lower the changes in the appearance of the target, the

slower is the template updating, and the higher the changes, the greater the update

rate. Moreover, if the template is contaminated by background pixels or noise, it will

be restored to its previously found rather good appearance. Thus, the template is

updated properly, which in turn provides support to tackle the complex (e.g. out-of-

plane) motion of the target and slowly occurring long term occlusion.

3.1.1 Traditional Template Updating Methods In this section, we describe three traditional template updating methods.

Naive Template Updating Method

In this method, template is updated on every next frame or after a number of frames

provided peak correlation value is greater than a certain threshold, as follows:

1

if

otherwisen p

nn

b ct

tτ

+

>=

(3.1)

where bn is the best matched region in the current image, tn and tn+1 represent

the current and the updated templates respectively, cp is the peak correlation value,

and τ is the fixed threshold value. This scheme assumes bn as the true target (which is

33


not the case in reality) and completely replaces the current template. Furthermore, it is

highly prone to template drift problem.

α-Template Updating Method

This method does not replace the current template with the best match region at once,

rather it introduces a parameter α, 0 < α < 1, to smoothly update the template, as

follows:

1

( ) if

otherwisen p

nn

n nt b t ct

tα τ

+

+ >=

− (3.2)

If α is assigned a small value (e.g., 0.02) [196], it solves the template drift

problem, but it does not cater for rapid change in the appearance of the target and

remains stagnant in the target's old state. In order to address this issue, the idea of

using α =cp was presented [190]. However, during the normal tracking, the value of cp

is greater than 0.9, thus, it behaves the same way as the naive method does.

β-Template Updating Method

This method was proposed by Ahmed et. al. [44] and has same mathematical

formulation as that of alpha template updating method (see Eq. (3.3)). The only

difference is that α is replaced by β, where β = 0.15*cp. It smoothly updates the

template, but it does not work in case of the fast maneuvered target.

1

( ) if

otherwisen p

nn

n nt b t ct

tβ τ

+

+ >=

− (3.3)

3.2 Proposed Template Updating Method A good template updating scheme should handle the problems of template drift as

well as stagnation to old appearance. For this, the updating scheme should be such

that (1) it may incorporate maximum target changes; i.e., template updating scheme

should be dynamic based on the fact that whether target is changing its appearance

rapidly or slowly, (2) it should contain as small background as possible, and (3) if the

template is poorly updated with some background or noisy pixels, the updating

scheme should be able to restore the template to a better representation. In the

proposed method, the first template (which is selected by the user) is considered as

34


the most trusted one and it is kept in buffer throughout the tracking session. Let it be

denoted by t1.The last updated template is assumed to contain the maximum change in

the target's appearance and is represented by tn, where subscript shows the frame

number where template will be used to find targets. It is possible that tn may have

been corrupted due to occlusion or clutter, therefore, the second last template,

symbolized by tn-1, is also kept in memory, both the templates, tn and tn-1, are

correlated with the search window. Their peak correlation values are respectively

represented as cp(n) and cp(n-1). If cp(n) ≥ cp(n-1) , it is considered that last updated

template is the correct one, otherwise tn is replaced by tn-1. The next step is template

updating. For this, t1 is correlated with the search window. Its peak correlation value

is represented by cp(1) . If t1 fails to get, at least, 50% match in the search window, it is

assumed that slowly occurring occlusion is being faced by the target which corrupted

both tn and tn-1. This assumption holds because the datasets used have targets which

do not change their appearance so much. Therefore, we start updating tn partly by t1.

Equations (3.4) and (3.5) describe the process,

1 ( ) ( 1)if

otherwisen p n p n

nn

t c ca

t− −<=

(3.4)

1 (1)(1 ) if 0.5

otherwisen p

nn

a t cd

tω ω+ − <=

(3.5)

where 0 < ω ≤ 1. The latter process of template updating is mathematically

represented by Equations (3.6) - (3.10).

1

1

( ) if

if ( ) and ( )

(1 ) if ( ) and ( )

n n n p

n n p

n p

b d ct d c f

d t c f

d γ τ

τ λ

σ σ τ λ+

+ − ≥

= < ≤ + − < >

(3.6)

0 if

1 otherwisepc

ff

τ>=

+ (3.7)

* (1 )refc cγ δ δ= ∆ + − ∆ (3.8)

35


1ref pc c∆ = − (3.9)

1| |n np pc c c −∆ = − (3.10)

where superscript n shows the frame number, 0 ≤ γ ≤ 1, 0 ≤ σ ≤ 1, 0 ≤ δ ≤ 1, λ

> 0, and f is counter whose value in increased by 1 if peak correlation value, cp, is less

than threshold τ, otherwise its value is set as zero. These equations are explained as

follows.

3.2.1 Case 1 In this case, the template is updated as a weighted average of current template and

best match found in the image. The weight, γ, is calculated dynamically, as described

by Eq. (3.8). It is made the function of difference of peak correlations, ∆c, in the two

(When cp ≥ τ)

Algorithm 3.1 Proposed template updating method

Input: Current template, tn , previous template, tn - 1, Initial template, t1, Search window, previous peak correlation value cp

Output: Updated template

1. Initialize σ, δ, λ, and τ 2. Correlate tn , tn - 1, and t1 with the search window

and calculate cp(n), cp(n-1), and cp(1) respectively. 3. if cp(n) < cp(n-1) 4. tn ← tn-1 5. end if 6. if cp(1) < 0.50 7. tn ← ωtn + (1 - ω)tn-1 8. end if 9. oldcp ← cp 10. Correlate tn with the search window and

calculate current peak correlation value cp and the best match target candidate, bn

11. if cp ≥ τ 12. f ← 0 13. ∆cref ← 1 - cp 14. ∆c ← abs(cp - oldcp) 15. γ ← δ∆cref + (1 - δ) ∆c 16. tn ← tn + γ(bn - tn) 17. else 18. f ← f + 1 19. if f > λ 20. tn ← σ tn + (1 - σ) t1 21. end if 22. end if

36


latest frames, and the difference of peak correlation from its upper limit (which is 1),

∆cref,. For an object changing its appearance heavily and rapidly, ∆c, will be higher,

otherwise it will have smaller value. Thus, the template will be updated accordingly.

Ideally, updated template should have 100% or, at least, near 100% match in the next

image. This is achieved by ∆cref, term in Eq. (3.8). Furthermore, ∆cref, accelerates the

updating process which is normally slow due to very small value of, ∆c, in

consecutive image frames.

3.2.2 Case 2 In this case, we do not update the template considering that the template has been

occluded. If this case holds for a certain number of frames, λ, then it may be due to

the following reasons: (1) the template is poorly updated, therefore, it is consistently

failing to find good match in the frames, and (2) the target moved so fast that it had

gone outside the search window.

3.2.3 Case 3 In this case, the template is smoothly updated with the most trusted one, i.e., t1, to

solve the first issue mentioned in Section 3.2.2. In order to handle second issue,

search area of the template is iteratively increased. The pseudo code of the proposed

method is provided in Algorithm 3.1.

3.3 Results and Discussion The qualitative as well as quantitative results of the proposed template updating

method are shown on different challenging test videos such as Girl, Woman, Faceocc.

The videos are publicly available and can be downloaded from [197, 198] . Girl video

has 502 numbers of frames and it mainly contains the challenges of high appearance

Table 3.1 Description of test videos

Sequence # of frames Challenges involved

woman 552 Occlusions, appearance changes, pedestrian motion

Faceocc 887 Slow occurring long term occlusions, high appearance changes

Girl 502 3600 out of plane rotation, appearance change, occlusion

(When (cp < τ ) and ( f ≤ λ))

(When (cp < τ ) and ( f > λ))

37


changes and out of plane rotations, Woman video contains 552 frames and it provides

challenges of high appearance changes as well as heavy and long term occlusions,

Faceocc video comprises of 887 frames and it challenges the tracking algorithms with

slowly occurring heavy occlusions. Table 3.1 summarizes the description of these

videos. Edge-enhance based normalized correlation method [191], [44] have been

employed for object tracking.

3.3.1 Qualitative Analysis Figure 3.1, Figure 3.2, and Figure 3.3 show a few frames of Girl, Woman, and

Faceocc videos respectively. The first three rows in each figure are the results of

naive, α, and β methods, respectively, and the fourth row represents the result of the

proposed method. The current template is shown at the top-right corner of each frame.

The yellow rectangle in the figures shows the position of the best match of template in

the image. White rectangle in figures expresses that the tracker has lost the target and

is in prediction mode. The empirically determined parameter settings are as follows: α

Frame 27 Frame 101 Frame 133 Frame 211 Frame 267

Figure 3.1 Comparison of different updating schemes (i.e., Naive, α, and β methods shown in first three rows, respectively) with the proposed method

(i.e., forth row) for Girl video. The video involves two times out-of-plane rotation of the target (see Frames 101 and 211). The proposed method

updates the template better than any of these methods, and minimizes the template drift.

38


=cp (as suggested by Wong [190]), β = 0.15*cp, (as proposed by Ahmed et. al. [44]),

and τ = 0.70, δ = 0.3, λ = 3, ω = 0.25, and σ = 0.035. These parameter values are kept

same for all the videos.

In Figure 3.1, there is out-of-plane rotation of the target for two times (as

shown by Frames 101 and 211). Naive and alpha methods show almost similar

behavior (as expected) and start drifting the template at Frame 101. β-method does not

let the template drift at Frame 101 due to relatively slow adaptive rate but it fails at

Frame 211. In comparison, the proposed algorithm updates the template according to

its rate of change of appearance and keeps locking the target accurately without any

drift.

In Figure 3.2, a large part of the target (woman walking at foot-path) gets

occluded when it passes through behind the cars parked at the road side (see frames

119, 213, 291, and 380). Moreover, there are appearance changes, clutter and


Figure 3.2 Comparison of different updating schemes (i.e., Naive, α, and β methods shown in first three rows, respectively) with the proposed method

(i.e., forth row) for Woman video which contains occlusions, appearance change of the target, clutter and illumination change in the scene. It is clear

that the proposed method works better than the methods in comparison.

39


illumination variations in the scene (e.g., frames 119, 213). Both naive and alpha

updating methods show almost similar results and complete lose the target due to

template drift before Frame 119. The ß-method performs better, but it also lets the

template drift at Frame 291. The proposed method updates the template better than the

other methods and does not allow it to slide off the target throughout the video.

Figure 3.3 shows the results of the proposed algorithm on a few frames of

Faceocc video. The video involves slowly occluding face of the woman by a book

(e.g., Frames 189, 302, 412 and 466). Naive, alpha, and β methods for target updating

started drifting off at Frame 466. The proposed method updates the template in such a

way that minimizes its drift from the target area.

3.3.2 Quantitative Analysis For quantitative analysis, the difference between ground truth of the target center and

target center found by the individual algorithm, i.e., naive, alpha, and β methods, is

calculated. It is named as center location error. Figure 3.4, Figure 3.5, and Figure 3.6


Figure 3.3 Comparison of different updating schemes (i.e., Naive, α, and β methods shown in first three rows, respectively) with the proposed

method (i.e., forth row) for Faceocc video The proposed method successfully handles slow occurring long term occlusion.

40


show the center location error for Girl, Woman, and Faceocc video sequences in a

graphical way. The horizontal axis represents the frame number and the vertical axis

Figure 3.4 Center distance error between ground truth value and calculated value by naive, alpha, beta, and the proposed

template updating methods for Girl video. The template drift is much less by the proposed method.

Figure 3.5 Center distance error between ground truth value and calculated value by naive, alpha, beta, and the proposed template updating methods for Woman video. The template

drift is much less by the proposed method

41


shows the center distance between calculated value and ground truth of the targets. It

is clear from these figures that the proposed method has the lowest error (shown in

black color) as compared to the other methods. Error! Reference source not found.

summarizes the graphical results by showing the mean distance of each target for each

video sequence, which also shows that the proposed method, on the average, has

significantly less center errors in comparison to the other methods.

3.4 Chapter Summary This chapter presented a new template updating method for correlation based tracking

algorithms. The proposed method updates the template according to the rate of

Figure 3.6 Center distance error between ground truth value and calculated value by naive, alpha, beta, and the proposed template updating methods for Faceocc video. The template

drift is much less by the proposed method.

Table 3.2 Mean center location error for test video sequences

using naive, α, β, and the proposed template updating methods.

Naive method α-method β-method Proposed method

Girl 54.216 53.711 47.157 21.427

Woman 109.590 129.219 60.589 2.353

Faceocc 27.146 20.523 48.944 11.066

42


appearance changes of the target. This way, the template incorporates the maximum

changes of the target and minimum background of the scene. Thus, it avoids the

template drift problem and performs better in case of occlusion and complex motion

of the target. Edge-enhanced normalized correlation based tracking scheme has been

employed. The proposed method may also be used with other similarity measures and

tracking algorithms. Experimental results for different challenging publicly available

videos show the efficacy of the proposed algorithm in comparison with three other

template updating methods.

43

4 Proposed Visual Tracking Method

Correlation tracker is computation intensive; its efficiency depends on the size of the

search space and template. Moreover, it suffers from the template drift problem, and

may fall short in case of fast maneuvering target, rapid variations in its appearance,

occlusion faced by it and clutter in the background. In order to address these

problems, Kalman filter (KF) can be employed. KF predicts the target coordinates in

the next frame based on the measurement vector yielded by a correlation tracker. This

way, a relatively small search space can be defined around the position where the

target in the next frame is more likely to be found. Thus, the tracker can become

efficient and discard the clutter which is outside the search space in the scene.

However, if the tracker produces wrong measurement vector due to the clutter or the

occlusion inside the search space, the performance of the filter is considerably

depreciated. This chapter proposes the solution to this problem by incorporating the

mean shift method in the tracking framework. The mean shift tracker is fast and has

shown good results in the literature, but it fails when the histograms of the target and

the candidate region in the scene are similar (even when their appearance is different).

In order to make the overall visual tracking framework robust to the aforementioned

problems, the three approaches, i.e., correlation, KF, and mean shift, are combined,

heuristically, in such a way that they strengthen each other’s weakness for robust

tracking results. The template updating method presented in Chapter 3 has been used

in the proposed tracking framework. Furthermore the framework uses novel methods

for (1) adaptive threshold for similarity measure, which uses the variable threshold for

each upcoming image frame based on the peak similarity value of the current frame

with the template, and (2) adaptive kernel size for fast mean shift algorithm based on

varying size of the target.

4.1 Related Work A brief summary of different tracking techniques can be studied in Chapter 2. This

section discusses a bit more and relevant to this chapter, the tracking algorithms

related to correlation, Kalman filter and mean shift algorithms.

44

Proposed Visual Tracking Method

Different correlation based similarity measuring metrics, e.g., phase

correlation, normalized correlation, and normalized correlation coefficient, are used

for visual tracking. Phase correlation has been used by [199], [127], [200] for image

registration and tracking, but it is not robust to noise [130] and sometimes produce

higher peaks at wrong positions [131], [132], [44]. This problem was overcome by

[190] by using edge-enhanced image instead of a grayscale image. Ahmed et al. [135]

used extended flat top Gaussian weighting function with grayscale image to handle it.

Some other papers such as [201], [133], [134] also propose algorithms to enhance the

performances of phase correlation. All these methods do not produce as much good

tracking results in case of variations in appearance, shape, and brightness, etc., as

normalized correlation does [202], [190], [44]. Normalized Correlation Coefficient

(NCC) is another widely used similarity measure for object localization [136], [126],

[128], [203], [137]. NCC imposes the constraint of non-uniformity on template and

search window. The issue of occlusion handling using NCC was tackled in [193] with

the help of Kalman filter. It checks the value of NCC against an empirically

determined threshold, if NCC is less than the threshold, it is considered as occlusion

and next position of the target is calculated by the Kalman filter. A similar technique

for occlusion handling was used in [44], [191] with normalized correlation which is

computationally more efficient than NCC in the spatial domain and does not restrict

the template as well as search window to be non-uniform. It was shown in [202] that

normalized correlation produces better results than NCC when edge-enhanced image

is used for matching instead of grayscale images. Ali et al. [36] combined NCC with

Kalman Filter and Fast Mean Shift to handle complex object motion; but their

approach was not robust against clutter and occlusion.

Baleznai et al. used fast mean shift algorithm for the detection of humans in

groups [29], [28]. They further extended their work to track humans [27] [26], [25] .

Wang et al. [57] used multi-cue fusion based mean shift algorithm to track a human in

infrared imagery. Sutor et al. [204] presented efficient mean shift clustering to detect

and track humans. Shan et al. [52] proposed mean shift embedded particle filter for

hand tracking. Yilmaz et al. [205] used mean shift with motion compensation to track

target in Forward Looking Infra-Red (FLIR) imagery. Comaniciu et al. [206]

employed color histogram for real-time visual object tracking of non-rigid objects

using mean shift. They used Bhattacharya coefficient as a similarity metric to find out

45


the candidate target, obtained by the mean shift algorithm, that is the most similar to

the target. Afterwards, Comaniciu and Ramesh [9] combined mean shift and KF for

object tracking based on color histogram. They used mean shift iterations to get the

best candidate target and KF is used for next target position in the upcoming image

frame. As the next frame arrives, mean shift is initialized at the target position

predicted from the previous frame. Li et al. [32] suggested adaptive KF with mean

shift for object tracking. It adaptively updates the parameters of KF as opposed to

previous techniques that keep KF parameters constant. Similar to [206] and [9], color

histogram based target representation is considered in [32]. Since the color histogram

does not carry spatial information of pixels [4], it is likely to detect a wrong target

with similar histogram as that of the target [44]. Therefore, the idea of heuristically

combining correlation, Kalman filter and adaptive kernel fast mean shift algorithm for

better visual tracking results is proposed in this chapter.

4.2 Proposed Visual Object Tracking Framework The proposed VOT algorithm proposes a method which combines the strength of

three basic trackers, i.e., correlation, KF, and mean shift. Except this, the other

contributions of the proposed tracking method include: (1) novel approach of template

updating (described in Chapter 3), (2) adaptive thresholding, and (3) adaptive kernel

fast mean shift algorithm. The detail of the proposed tracking framework and each of

its components is given as follows.

4.2.1 Correlation and KF based Tracking In order to initiate the correlation based tracking process, the target is initially selected

by a user. The sub-image representing the target is called its template. In the proposed

method, Edge-Enhanced (E2) representation of the template has been employed as the

target appearance. The search window of the template is also made E2. The edge-

enhancement is a four step process consisting of Gaussian smoothing, calculation of

gradient magnitude, normalization of intensity, and thresholding. Interested readers

may study [44] for further detail of these steps. The size of the search window is not

kept constant rather it is dynamically adjusted with the help of KF throughout the

tracking session. Thus, tracker becomes computationally efficient, gets rid of clutter

outside the search space and earns better tracking results. Detail about the dynamic

search window can be studied from [191]. Variations in the size of the target in image

46


plane are handled by following two processes: One is, by correlating the original

template, its smaller, and larger versions, i.e., 0.90 and 1.10, with the search space.

The size of the template which has the highest correlation value is considered for

matching in the next image frame. The same technique has been proposed for scale

handling in many other papers such as [17], [207], [191], [95]. The limitation of the

technique is that it works in discrete steps, i.e., 10% scale change, on full template,

therefore, it does not work well if the template is required to change its size in a

particular direction. Therefore, the other technique, i.e., Best Match Rectangle

Adjustment (BMRA) algorithm [208] is used to resize the template according to target

size and to keep the target at the center of the template. BMRA divides the template

into nine non-overlapping patches and calculates the energy of each patch. The

majority voting scheme is used for adjustment of the best match rectangle. This way,

it keeps the target at the center and tackles the problem of template drift, especially in

case of tracking an airborne object such as an airplane, flying kites, birds, helicopter,

etc. The detail of BMRA algorithm can be studied from [208]. After deciding the size

to of the template, it is matched in the search window using normalized correlation

and the spatial location of the peak correlation value is considered as the current

position of the target in the search window. The matching is considered successful if

the peak value of the normalized correlation is greater than a threshold. The

normalized correlation is used for similarity measure in the proposed tracking method

Algorithm 4.1 Correlation and Kalman filter tracking

Input: Video sequence of n frames, template image of the target, t, and target bounding rectangle in the1st frame, r Output: target position in each frame of the video sequence

for 1st frame to n frames 1. Make the template, t, edge enhanced 2. Extract search window, s 3. Make the search window, s, edge enhanced 4. Match T with S using normalized correlation (NC) 5. cp ← max(NC) 6. Update size of the T 7. Occlusion handling using Kalman filter 8. Update t 9. Output bounding rectangle of T according to cp end for

47


because it works relatively better for object localization in case of edge enhanced

images [202]. The next step after matching is to update the template. It is already

explained in Chapter 3. Algorithm 4.1 summarizes the correlation and KF based

tracking methodology.

4.2.2 Adaptive Threshold The fixed threshold method is used in many papers, e.g., [44], [191], [193], [36],

which sets a fixed value for all frames of a video and does not take into account any

local information obtained through correlation surface at each image frame.

Therefore, it always puts the same criteria at every image frame without bothering

about the scene and target dynamics. Thus, the method is highly probable to fail in

case of fast maneuvering target changing its appearance heavily and rapidly. The peak

correlation value provides clues about changes in target, therefore, it may be used as

heuristic information to introduce adaptability in the threshold; e.g., if the current

peak value of normalized correlation is 0.85, it means that this value may drop more

in the next image frame. So, the threshold value should be set well below the current

peak correlation value for upcoming frames. This way, the scheme uses local

information about target matching score to set its threshold at each frame instead of

using the global value for all the frames. To avoid the possibility of too low threshold

value to be accepted as good matching, a lower limit is put on adaptive threshold.

Mathematically, the process is described by Eq. (4.1).

if

otherwisep l

l

c ψ τ ττ

τ

− ≥=

(4.1)

Algorithm 4.2 Adaptive threshold

Input: current threshold, τ, peak correlation value, cp Output: updated threshold

Initialize τ l and ψ if τ ≥ τ l τ ← cp - ψ else τ ← τ l endif

48


where 0.10 ≤ ψ ≤ 0.17 and 0 < τl < 1, i.e. it is being assumed that the target

may change its appearance by at most 17% in the next image frame. This limit is

found empirically and it works well for slow and fast maneuvering object changing its

appearance slowly or rapidly. The Pseudo code of the adaptive threshold method is

presented in Algorithm 4.2.

4.3 Occlusion Handling with Kalman Filter When the target is hidden, completely or partially, by another object in the scene, it is

said that the occlusion has occurred. It is a vital task for all visual object tracking

algorithms to handle this situation. The peak correlation value may be used as an

occlusion indicator because its value drops as target suddenly get occluded by another

object. As its value becomes less than the threshold, we stop updating the template

and assume that the target coordinates provided by the correlation tracker are no more

trust worthy. Previously Kalman filter predicted position is considered as the current

position of the target and Kalman filter is updated according to its own prediction.

The value of the threshold is iteratively reduced. It is due to the fact that changes in

target during occlusion are not incorporated in the template, so the peak correlation

value may drop below the threshold. Moreover, size of the dynamically created search

window is made larger in each iteration in order to take into account the possible

variations in the direction and speed of the target during occlusion. The template is

correlated with the search window at each image frame, but the tracker remains in a

Kalman mode (i.e., the bounding box for the target is decided by Kalman predicted

coordinates) till best-match score exceeds the threshold. Algorithm 4.3 sums up these

steps.

Algorithm 4.3 Occlusion handling with Kalman filter

1. Consider previously Kalman Filter predicted position as current position of the target.

2. Kalman filter is updated according to its own prediction in the previous iteration.

3. Template is not updated during occlusion. 4. Value of threshold is iteratively reduced. 5. Size of dynamically created search window is made larger and larger at

each iteration.

49


4.4 Adaptive Fast Mean Shift Algorithm Mean shift is used for segmentation and tracking due to its clustering and mode

seeking capability. It is an iterative algorithm which starts by considering a random

point as its center finds mean value in its neighborhood and shift the center point to

the newly found mean position. The process ends up when change in position is

extremely small or maximum number of iterations is reached. Mathematical detail of

mean shift is simple and it is easy to apply to images [29]. In order to find the

weighted mean of data points, a kernel function is used to assign weights to each data

point. In case of uniform kernel, integral image is calculated for fast calculation of

mean shift [29]. Difference of two consecutive frames usually shows moving regions;

the regions can be considered as potential candidates for target in tracking scenario.

The mean shift technique can be used to find these regions in different images.

Beleznai et. al. Exploited fast mean shift approach with a uniform kernel for human

detection and tracking [25-29]. The same technique is adopted in this thesis, but the

novelty is introduced in it by making the size of kernel adaptive at each frame. The

size of the kernel is made equal to the size of the template. The template size is made

adaptive by the following two methods: (1) correlating the original template as well as

10% smaller and 10% larger templates with the search space. The size of the template,

which provided the highest peak correlation value, is considered as the new template

size [17], [207], [191], [95]. (2) Best Match Rectangle Adjustment (BMRA)

algorithm is used to resize the template according to the target size and keep the target

at the center of the template. It divides the template into nine non-overlapping

fragments and checks the energy contents in each fragment. A voting scheme is used

Algorithm 4.4 Adaptive fast mean shift algorithm

1. Calculate difference of search windows. 2. Calculate size of the template in the current image by BMRA as well

as correlating 10 % larger and 10 % smaller template with the search window.

3. The size of the Kernel is set as the size of the template calculated in step 2.

4. Apply fast mean shift algorithm with the difference image calculated at step 1 and the rectangular Kernel calculated at step 3.

50


for adjustment of best match rectangle [208]. Moreover, we compute the difference of

search windows instead of full frames. The size of the both search windows is kept

same and their difference is obtained by subtracting the previous search window from

the current one. In this way, too many moving regions and outliers in difference

image can be avoided. Furthermore, the process becomes more computationally

efficient, because the mean shift is now calculated in the search window only.

Algorithm 4.4 summarizes these steps.

4.5 Combining Correlation, Kalman Filter and Adaptive

Kernel Fast Mean Shift Algorithms Kalman filter is a measurement follower algorithm. It predicts the position of the

target in next frame based on its position (determined by the correlation tracker) in the

current and the previous frames. It works in prediction-correction cycle, i.e., it

predicts the next position of the target and corrects itself by exploiting the actual

position of the target. During steady state, its accuracy is determined by the closeness

of its predicted value with the measured value at each image frame. When the

difference between predicted and measured values gets larger than a threshold, it

indicates an alarming situation for tracking scenario. It may be due to any of

following reasons: (1) correlation tracker provided wrong measurement due to clutter,

Algorithm 4.5 Combining correlation, Kalman filter and adaptive fast mean shift algorithms

1. Calculate difference between measured and predicted target position at each image frame.

2. If the difference is greater than a threshold, get the difference search window by subtracting the current search window from the previous one.

3. Apply the proposed adaptive fast mean shift algorithm in difference search window and find the position of potential candidate for the target, i.e. the candidate with the highest correlation value with the template.

4. Check whether the position calculated in step 3 is the nearest neighbor of measured value or the predicted value. If it is measured value, we consider it correct position, otherwise, confidence is given to the predicted value.

5. Template is not updated . 6. Area of search window is iteratively increased to avoid the possibility

of getting the target out of the search window.

51


blurriness, occlusion, out-of-plane rotation of the target, or any other issue in the

search window, or (2) target has suddenly changed its direction (e.g., the target may

be moving back and forth briskly); correlation measurement is correct one in this

case. The problem becomes worse when there is no significant decrease in peak

correlation value, i.e., no indication of occlusion. In order to tackle this issue and to

Figure 4.1 Proposed Tracking Algorithm

52


decide whether to follow the Kalman Filter prediction or correlation tracker

measurement, an algorithm is proposed in this chapter, which combines the strengths

of correlation, Kalman filter and adaptive kernel fast mean shift algorithm. For this,

the difference between the measured and the predicted target position in each image

frame is calculated. If the difference is greater than a threshold (which is template

size), the difference of the current and the previous search window is calculated, and

adaptive fast mean shift algorithm is applied on the difference search window to find

the position of the potential candidate for the target. It is checked whether the

measured or the predicted target position is closer to this value (i.e., mean shift

calculated value). If it is the measured one, we consider it the correct position of the

target, otherwise, the predicted position is considered correct. Moreover, the template

Table 4.1 Description of dataset

Sequence # of frames Challenges involved

Faceocc2 812 Slowly occurring heavy occlusions, high appearance changes

ThreePastShop2Corr2 351 Similar objects, Heavy occlusion, appearance and scale changes

Woman 552 Occlusions, appearance changes

Car11 393 Low light conditions

David 462 Illuminations changes, appearance changes

Singer 351 Illuminations changes, scale changes

Board 698 3D motion, cluttered background

Box 1161 Fast 3D motion, occlusions, motion blur, cluttered background, scale changes

Liquor 1741 Fast 3D motion, occlusions, motion blur

Faceocc 887 Slow occurring long term occlusions, high appearance changes

Girl 502 3600 out of plane rotation, appearance change, occlusion

53


is not updated in this case and the area of the search window is increased iteratively so

that the possibility of the target going out of the search window may be avoided.

Algorithm 4.5 presents these steps briefly and Figure 4.1 shows a flow chart of the

proposed tracking method.

4.6 Results and Discussion This section discusses: (1) data set used to evaluate the proposed tracking algorithm,

(2) methods used for analysis of the tracking algorithm, (3) the effect of different

values of ψ for adaptive threshold on tracking results, (4) the comparison of the

proposed method with its base trackers, i.e., correlation tracker, and correlation and

KF tracker, and (5) the comparison of the proposed tracking strategy with nine state-

of-the-art tracking methods on different publicly available videos.

Table 4.2 Pascal score on test video sequences with different values of ψ

Value of ψ 0.1

0.12 0.14 0.16 0.17

Sequence

Faceocc2 1.00 1.00 1.00 1.00 1.00

Caviar 0.629 0.894 0.211 0.211 0.731

woman 0.091 1.00 1.00 1.00 1.00

Car11 1.00 1.00 1.00 1.00 1.00

David 1.00 1.00 1.00 1.00 1.00

Singer 0.277 1.00 1.00 1.00 1.00

Board 0.312 0.75 0.77 0.78 0.78

Box 0.927 0.923 0.884 0.901 0.901

Liquor 0.831 0.954 0.736 0.711 0.562

Faceocc 0.983 0.933 0.865 0.865 0.865

Girl 0.663 0.891 0.743 0.743 0.743

54


4.6.1 Data Set Eleven publicly available challenging videos have been used for different

experimentation to show the robustness of the proposed algorithm. The videos are

Girl, Faceocc, Faceocc2, ThreePastShop2Cor2 (from Caviar dataset), Woman,

Car11, David, Singer, Board, Box, Liquor. Several researchers have used these videos

for benchmarking their algorithms in recent years [3, 4, 42, 53, 64, 65, 184, 209]. So,

the videos may be considered as de-facto standard for tracking algorithm evaluation.

Girl, Faceocc, Faceocc2, and David videos can be downloaded from [197], Board,

Box, and Liquor videos are available at [210], and Woman, ThreePastShop2Cor2,

Singer, Car11 videos can be downloaded from [198, 211-213], respectively. Table 4.1

provides description of these videos.

Table 4.3 Mean distance error on test video sequences with different values of ψ

Value of ψ

0.1 0.12 0.14 0.16 0.17

Sequence

Faceocc2 9.463 9.463 9.463 9.463 9.463

Caviar 43.131 4.963 66.539 66.315 24.805

woman 111.211 2.353 2.353 2.353 2.353

Car11 1.559 1.559 1.559 1.559 1.559

David 6.079 6.079 6.079 6.079 6.079

Singer 88.129 2.630 2.630 2.630 2.630

Board 75.571 34.960 33.524 33.125 33.125

Box 10.703 12.122 13.818 13.130 13.130

Liquor 35.566 20.469 62.426 63.640 73.473

Faceocc 6.357 11.066 17.321 17.321 17.321

Girl 40.236 21.428 25.017 25.017 25.017

55


4.6.2 Analysis for Proposed Tracking Algorithm The proposed algorithm is analyzed qualitatively as well as quantitatively. For

qualitative analysis, sample tracked frames of the proposed method are shown and

compared visually with the results of benchmark algorithms. The processed frames, in

Figure 4.2 Comparison of results for simple correlation tracker, correlation

and KF tracker, and adaptive fast mean shift embedded with correlation and KF tracker for ThreePastShop2cor video (from Caviar dataset). It

proves the claim that adding mean shift approach with correlation and KF tracker (in the proposed way) improves the results.

Figure 4.3 Comparison of results for simple correlation tracker,

correlation and KF tracker, and adaptive fast mean shift embedded with correlation and KF tracker for Liquor video. It proves the claim that adding mean shift approach with correlation and KF tracker (in the

proposed way) improves the results.

56


which the tracked rectangle is close to the target of interest, are considered as visually

better results. Quantitative solution is calculated to have a better understanding of the

robustness of the proposed algorithm. For this, two measures have been used, one is

the mean distance from center location, it provides the error between center location

of tracked rectangle and its ground truth value, and the other is Pascal VOC criteria

[214], it outputs the number of correctly tracked frames. Pascal score can be

computed using Eq. (4.2):

( )( )

t g

t g

area R Rs

area R R∩

=∪

(4.2)

where Rt is tracked rectangle, and Rg is its ground truth. A frame is considered

as correctly tracked if s > 0.5.

4.6.3 Adaptive Threshold with Different Parameter Values Value of ψ plays a pivot role in choosing the adaptive threshold. Various experiments

with different values of ψ in the range of 0.10 to 0.17 have been performed to

calculate The value of τl is set 0.65. Pascal score and mean distance error of all test

video sequences. The results are summarized in Table 4.2 and Table 4.3, respectively.

It can be concluded from these results that ψ = 0.12 provides better results for most of

the test video sequences.

Figure 4.4 Comparison of Pascal score of correlation KF tracker with and without adaptive fast mean shift algorithm

57


4.6.4 Comparison Of Proposed Tracking Method with Its Constituents

In this Section, three tracking algorithms are compared, i.e., (1) simple correlation

tracker, (2) correlation and KF based tracker, and (3) the proposed correlation, KF,

and adaptive fast mean based tracking algorithm. This way, the claim that

heuristically switching (with the help of mean shift approach) between correlation

based measured and KF based predicted target coordinates makes the tracking robust,

can be examined. The proposed adaptive threshold (discussed in Section 4.2) and

template updating methods (described in Chapter 3) have been used. Figure 4.2 and

Figure 4.3 show the center location error for ThreePastShop2Corr2 and Liquor

videos, respectively. It is clear from Figure 4.2 that the occlusion occurring during

frames 107 to 130 is not handled by simple correlation tracker and produces the mean

center error is 13.196, KF helps in this situation and reduces the average center

distance to 9.266, the performance of correlation-KF tracker improves significantly

when embedded with adaptive fast mean shift approach with the average score center

distance to 4.963. Similar situation can be seen in Figure 4.3 for Liquor video with

mean center errors of 57.239, 55.492, and 20.469 for these three approaches,

respectively. The occluded region is marked in Figures 4.2 & 4.3 by a downward

directed arrow with the label of occlusion. In order to elaborate the advantage of

integrating adaptive fast mean shift approach in correlation-KF tracker, mean distance

error and Pascal score is calculated on all the test videos with and without adaptive

Figure 4.5 Comparison of mean distance error of correlation KF tracker with and without adaptive fast mean shift algorithm

58


fast mean shift algorithm as shown in Table 4.4. It is clear from the table that

integration of mean shift approach into correlation KF tracker significantly improves

the results. Figure 4.4 and Figure 4.5 summarize these results for Pascal score and

mean distance error, respectively.

4.6.5 Performance Comparison of Proposed Tracking Methods with Other Methods

Tracking results of the proposed tracking method are compared with the nine state-of-

the-art tracking methods, namely, incremental visual tracking (IVT) [14], l1 tracker

[19], PN learning [215], visual tracking decomposition (VTD) [42], MIL tracker [3],

FragTrack [4], local sparse appearance model (LSAM) [64], PROST [53], and EENC

tracker [44, 191]. Already cited results of these trackers from the papers [53, 64] are

mentioned in this thesis. Therefore, if the result on a certain video is not found, it is

not mentioned (except EENC tracker, it was run on all the videos).

Table 4.4 Comparison of correlation KF tracker with and without adaptive fast mean shift algorithm

Sequence Pascal VOC score Mean distance error

Without mean shift

With mean shift

Without mean shift

With mean shift

Faceocc2 0.939 1.00 14.014 9.463

Caviar 0.477 0.894 9.266 4.963

Woman 1.000 1.00 2.353 2.353

Car11 1.000 1.00 1.608 1.559

David 0.645 1.00 15.735 6.079

Singer 1.000 1.00 3.035 2.630

Board 0.064 0.75 206.746 34.960

Box 0.335 0.923 213.631 12.122

Liquor 0.756 0.954 55.492 20.469

Faceocc 0.908 0.933 16.961 11.066

Girl 0.713 0.891 23.282 21.427

59


Table 4.5 summarizes the results of mean center location error in pixels and

Table 4.6 shows the mean Pascal score. First row of both tables shows the name of the

algorithm and its publication year. The best result for each video is shown in bold-

underline, the second best is in italic-underline and the third best result is in italic

format. The last row of each table shows the average score of the algorithms for all 9

videos. It is clear from the tables that the proposed algorithm, overall, performs better

than each of the other algorithms. The proposed tracking algorithm was implemented

using OpenCV on Core i5 machine with 4 GB RAM. The number of frames

processed per second (fps) depends upon the size of the template and search window.

The normalized correlation is calculated in Fourier domain or spatial domain

depending on the sizes of the template and the search window for fast processing. The

adaptive fast mean shift is also efficient as compared to the original mean shift

algorithm due to usage of the integral histogram technique. Furthermore, it is

calculated only when there is no overlap between predicted and measured target

Table 4.5 Mean center location error for video sequences of dataset

IVT

(2008)

L1

(2009)

PN

(2010)

VTD

(2010)

MIL

(2011)

FragTrack

(2006)

LSAM

(2012)

EENC

(2008)

PROST

(2010)

Proposed

method

Faceocc2 10.2 11.1 18.6 10.4 14.3 15.5 3.8 41.309 17.2 9.463

Caviar 66.2 65.9 53.0 60.9 83.9 94.2 2.3 91.867 -------- 4.963

woman 167.5 131.6 9.0 136.6 122.4 113.6 2.8 104.549 -------- 2.353

Car11 2.1 33.3 25.1 27.1 43.5 63.9 2.0 2.332 -------- 1.559

David 3.6 7.6 9.7 13.6 15.6 46.0 3.6 17.418 15.3 6.079

Singer 8.5 4.6 32.7 4.1 15.2 22.0 4.8 15.589 -------- 2.630

Board 165.5 177.0 97.0 96.1 51.2 90.1 7.3 165.347 37.0 34.960

Box ------ 196.0 ------ ------ 104.6 57.4 -------- 117.866 12.696 12.122

Liquor ------ ------ ------ ------ 115.1 30.7 -------- 100.733 21.487 20.469

Faceocc ------ ------ ------ ------ 18.4 6.5 -------- 48.641 7.0 11.066

Girl 48.5 62.5 23.2 21.5 31.5 26.5 -------- 53.711 19.0 21.427

Average 59.013 76.622 33.537 46.287 55.973 51.491 3.8 69.033 18.526 11.554

60


coordinates, or the peak correlation value is less than the threshold. On the average,

the whole algorithm runs in real-time (i.e., 25 fps).

Figure 4.6, Figure 4.7, Figure 4.8, Figure 4.9, Figure 4.11, and Figure 4.10

depict graphically the center distance error and Pascal score for the Box, Board, and

Liquor videos for each fifth frame (as ground truth is available for only these frames).

The graphs show the results of the proposed, EENC, MIL, PROST, and FragTrack in

blue, red, green, black, and magenta colors, respectively. It is evident from these

Figures that the proposed method outperforms all the other methods.

Figure 4.12 shows the performance of the proposed tracking method for Box

video during occlusions (e.g., Frames 297, 486), scale changes, complex target

motion including its 3D rotation creating motion blur in cluttered background (for

example, Frame 555 and 928).

Table 4.6 Pascal VOC score for video sequences of dataset

IVT

(2008)

L1

(2009)

VTD

(2010)

PN

(2010)

MIL

(2011)

FragTrack

(2006)

LSAM

(2012)

EENC

(2008)

PROST

(2010)

Proposed method

Faceocc2 0.59 0.84 0.59 0.49 0.96 0.60 0.82 0.515 0.82 1.00

Caviar 0.21 0.20 0.19 0.21 0.19 0.19 0.84 0.309 ------- 0.894

woman 0.19 0.18 0.15 0.60 0.16 0.20 0.78 0.182 ------- 1.00

Car11 0.81 0.44 0.43 0.38 0.17 0.09 0.81 0.886 ------- 1.00

David 0.72 0.63 0.53 0.60 0.70 0.47 0.79 0.742 0.80 1.00

Singer 0.66 0.70 0.79 0.41 0.33 0.34 0.74 0.246 ------- 1.00

Board 0.17 0.15 0.36 0.31 0.679 0.679 0.74 0.136 0.75 0.75

Box ------ 0.05 ------ ------ 0.245 0.614 ------- 0.506 0.914 0.923

Liquor ------ ------ ------ ------ 0.206 0.799 ------- 0.504 0.854 0.954

Faceocc ------ ------ ------ ------ 0.93 1.00 ------- 0.449 1.00 0.933

Girl 0.42 0.32 0.51 0.57 0.70 0.70 ------- 0.287 0.89 0.891

Average 0.471 0.390 0.444 0.446 0.479 0.516 0.789 0.433 0.861 0.940

61


Figure 4.14 shows a few tracked frames for Liqour video sequence. The

proposed algorithm successfully tracks the target during occlusions (as shown in

Frames 360, 607, 776, 1115, 1183, 1236, 1319, 1355, 1438, and 1462) and 3600

rotation causing motion blur (e.g., frames 1404, 1407).

Figure 4.13 explains that the proposed algorithm successfully handles the out-

of-plane rotation of target in the cluttered background for Board video.

Figure 4.15 (Car11 video frames) shows that the proposed algorithm tracks

the target in low light conditions

Figure 4.16 shows a few frames of David video. The proposed algorithm

handles varying illumination conditions (e.g., Frame 1 and 25), complex target motion

(e.g., Frame 160), and target appearance changes (e.g., Frame 383).

Figure 4.17 depicts the results of the proposed algorithm on Faceocc2 video.

The video contains large appearance changes (for example, Frame 19 and Frame 577)

and slowly occurring heavy occlusions (e.g., more than 90% of the face is occluded as

Figure 4.6 Center distance error for Box video sequence

62


shown in Frame 720). The proposed template updating and tracking strategy keeps

locking the target successfully.

Figure 4.19 shows some frames of Singer video. The proposed algorithm

successfully handles the high illumination effects on the target (e.g., Frame 115) and

large change in its scale (e.g., Frame 333).

Figure 4.18 shows some of the frames from ThreePastShop2Cor2 video of

Caviar dataset. The video contains similar objects, which makes it difficult to track

the target. The situation becomes worse due to the occlusions of other objects with the

target (e.g., Frames 83 and 120). The proposed method shows prominent results and

successfully tracks the target.

4.7 Chapter Summary Correlation based methods have been in use since very start of visual tracking field

[56], [103], [138], [216], and it has shown its strength for long term tracking session

[65], [191], but classically, there are a few inherent issues with this approach, which

are as follows: (1) It is computation intensive, (2) It has template drift problem, and

Figure 4.7 Pascal Score for Box video sequence

63


(3) it may fail in case of fast maneuvering target, rapid changes in its appearance, or

occlusion and clutter in the scene. These issues are handled, to some extent, by

integrating KF with the correlation based tracking and temporarily updating the

template [202], [193]. Considering the position of peak correlation value as the

position of target in the current image frame, KF predicts its position in the upcoming

image frame. Thus, a relatively small search window can be determined where the

occurrence of the target is highly likely [44]. Moreover, KF gets tracker out of

occlusion faced by the target. Occlusion is assumed to be happening if the correlation

value of the target in search window falls below a threshold. Therefore, choosing the

right value of the threshold is very important.Many papers [190], [36], [191], [44],

[193] use fixed threshold, but it does not work as the complexity of scene changes. A

new method for adaptive threshold based on the current frame peak correlation value

is proposed in this chapter. During occlusion, correlation based measurement vector is

ignored and KF predicted vector is used as the next measurement vector. This way,

the tracker becomes (1) fast, (2) its performance remains safe from a lot of clutter

outside the search window, and (3) it shows robustness to occlusion as well. However,

due to all or any of the above mentioned issues occurring inside the search window,

the tracker may provide wrong measurements to KF which in turn generates wrong

Figure 4.8 Distance Score for Board video sequence

64


predictions and whole tracking process is deteriorated. Now the question arises, how

to get to know automatically that this situation has happened? In order to answer this

question, the difference between KF predicted and correlation based measured

coordinates is calculated; and is checked against another adaptive threshold based on

the target size. The next question that comes, intuitively, in mind is whether the

tracker should go with the predicted or the measured coordinates? Adaptive fast mean

shift algorithm is used to answer this question. It is applied to find out the clusters in

difference of the search windows of two consecutive frames. These clusters are

moving regions in the video, thus they become potential candidates for being the

target. The Nearest Neighborhood technique is used to check whether a candidate

target is close to predicted or measured coordinates. In such a way, the fast mean shift

algorithm acts as an arbitrator for the validity between KF and correlation based

results. Thus, KF can be protected from being misled by the wrong measurement

vector. The size of the kernel for mean shift is set adaptively according to the

changing target size. To tackle the issue of rapid change in target appearance, a novel

method is proposed which updates the target model according to rate of appearance

change of the target. In general , the proposed tracking strategy can be considered as

an ensemble of the three techniques, complementing each other in complex situations.

Figure 4.9 Pascal Score for Board video sequence

65


The switching from one technique to another technique is decided heuristically as

described above.

Figure 4.11 Distance Score for Liquor video sequence

Figure 4.10 Pascal Score for Liquor video sequence

66



Figure 4.12 Sample tracked frames of Box video sequence. The proposed algorithm successfully tracks the target during occlusions, scale changes, 3D

motion causing blurriness, and clutter background.




Figure 4.14 A few tracked frames of Liquor video sequence. The proposed approach successfully tracks during occlusions, 3D motion causing blurriness,

and background clutter.


Figure 4.13 Results for Board video sequence. The proposed algorithm successfully handles the out of plane motion of the target in cluttered

background.

67




Figure 4.17 A few frames of Faceocc2 video sequence. The proposed algorithm tracks the target with large appearance changes and slowly occurring heavy

occlusions.


Figure 4.15 Frames of Car video sequence. The proposed algorithm successfully tracks the target in low light conditions.


Figure 4.16 Some frames from David video sequence. The proposed algorithm tracks the target in varying illuminations and appearance changes.

68


Frame 1 Frame 115 Frame 136


Figure 4.19 A few frames of Singer video sequence. The proposed algorithm successfully handles high illumination effects as well as large scale changes.


Figure 4.18 Some tracked frames from the sequence ThreePastShop2Cor2 (Caviar dataset). The main challenges in the video include the existence of

similar objects, and the occlusions which occur while the persons in the sequence cross each other. The proposed method successfully tracks the

target.

69

5 Stabilized Active Camera Tracking System

An active camera tracking system (ACTS) tracks a target with a moving video

camera. The system is illustrated as a section of a block diagram shown in Figure 5.1.

It consists of: (1) a video camera, (2) a visual tracking algorithm, (3) a pan-tilt control

algorithm, and (4) a pan-tilt unit (PTU). Every frame acquired from the video camera

is analyzed by the visual tracking algorithm, which localizes the target in the image in

pixel-coordinates. The coordinates are sent to the pan-tilt control algorithm which

rotates the PTU according to the motion of the object. Since the camera is attached to

the PTU, it also rotates in sync with the PTU. Thus, the tracked target is always

projected at the center of the video frames, regardless of whether the object is moving

or stationary.

If the ACTS is mounted on a vibrating platform such as truck, helicopter, ship,

etc., it is required to stabilize the video without affecting the efficiency of the system.

The purpose of the video stabilization is to filter out the annoying vibration from the

video to reduce the unnecessary strain on the eyes of the viewer.

A simplified block diagram of a stabilized ACTS is shown in Figure 5.1. The

VOT has been explained in Chapter 4. So, an introduction to the rest of the individual

algorithmic components of the system is provided as follows.

Visual TrackingAlgorithm

Pan-Tilt Control Algorithm

Video Frame

(x, y)Target Position

Pan-TiltUnit

Video Camera

Pan-Tilt Control Signals

Digital Video Stabilization

Monitor

Angle of View

Active Camera Tracking System

Figure 5.1 Simplified block diagram of the proposed stabilized active camera tracking system.

70

Stabilized Active Camera Tracking System

5.1 Pan-Tilt Control The camera in the stabilized ACTS is mounted on top of a PTU, so it moves in sync

with it. The PTU motion is controlled by a control algorithm. If the control is not

smooth and precise, the object in the video will oscillate to-and-fro from the center of

the frame, and in the worst case the object may get out of the field of view (FOV).

One approach is to use a classic proportional-integral-derivative (PID) controller

[217]. However, its design requires a mathematical model of the system. Besides, it

necessitates a sensitive and rigorous tuning of its three gain parameters (i.e.,

proportional, differential and integral) at all the zoom levels of the camera. An

alternative approach is to use a fuzzy controller [218, 219] that does not require the

system model, but choosing a set of right membership functions and fuzzy rules

calibrated for every zoom-level of the camera is practically very cumbersome.

Another alternative is to implement a neural network controller [220], but it is heavily

dependent on the quality and the variety of the examples in the training data set,

which can accurately represent the complete behavior of the controller in all possible

scenarios, including the varying zoom-levels of the camera. Moreover, the traditional

control algorithms, e.g. [221], are generally implemented based on the difference

between the center of the frame and the current target position in the frame. These

algorithms do not account for the target velocity. As a result, there will be oscillations

(if the object is moving slow), lag (if it is moving with a mediocre speed), and loss of

the object from the frame (if it is moving faster than the maximum pan-tilt velocity

generated by the control algorithm). Keeping in view the above-mentioned limitations

of the various control algorithms, a predictive open-loop car-following control (POL-

CFC) algorithm [44] is proposed to use for target tracking. Its basic idea is borrowed

from the car-following control (CFC) strategy [222]. The CFC assumes that the actual

velocity of the PTU is observable through a velocity sensor. However, the POL-CFC

does not make this assumption and simply considers that the current PTU velocity is

the previous velocity command sent to the PTU. Then, it computes the velocity of the

target relative to the PTU velocity from the predicted target positions provided by the

Kalman filter in the current and the next frame. Finally, it generates precise velocity

commands for the PTU to move the camera towards the target accurately in real-time.

Thus, the proposed control strategy is very useful for controlling a system, which does

not feedback its current velocity, such as stepper-motor PTU. Its performance is tested

71


on real-world scenarios and has proven to be adequately smooth, fast and accurate.

The POL-CFC algorithm in the proposed stabilized ACTS offers 0% overshoot, 0

steady-state tracking error, and 1.7 second rise-time at least for 1x to 6x zoom levels

of the camera.

5.2 Video Stabilization Video stabilization is the process of removing vibrations from the video. It has very

wide application spectrum ranging from consumer devices (e.g., handy-cams, mobile

phone with video camera, etc.) to state-of-the-art military and defense systems, e.g.

the payloads for unmanned aerial vehicle (UAV) and unmanned ground vehicle

(UGV) [223], etc. There are many hardware as well as software solutions available

for video stabilization with their own merits and demerits depending upon their

applications. There are two types of motion when the camera is mounted on a PTU.

One is valid motion that comes due to the motion of the object to be tracked. The

other is the annoying motion that comes due to the mechanical vibration transmitted

from the vibratory vehicle (on which the PTU is mounted) or environmental factor

(such as wind). The aim of video stabilization is to filter out the latter motion [224,

225].

The ideal approach for video stabilization is the hardware solution. Use of

mechanical tools to physically avoid camera vibration is one of the hardware

solutions. Another solution may be to exploit optical or electronic devices to influence

how the camera sensor receives the input light, [226, 227]. These are expensive

solutions or need some additional information about camera motion. Therefore, image

processing based video stabilization (also called digital video stabilization) is the

approach of common choice [226].

Optical flow based method is opted by Chand, Lie and Lu [228] for digital

video stabilization. However, it has an inherent aperture problem [229]. Fuzzy logic

modeling is used in [226] for video stabilization. But, it is time consuming to select

the membership functions and tune their parameters to achieve the satisfactory results.

Image based rendering technique is used in [230]. However, it works well only in case

of slow camera-motion. Block matching methods are used for stabilization [231-233],

but these algorithms do not track blocks in consecutive frames, so they may be misled

by large moving objects [226]. In order to handle these problems, a stabilization

72


algorithm which estimates the vibratory motion between the frames by taking inputs

from the visual tracking module (discussed in Chapter 4) is proposed. The proposed

stabilization method does not add any extra computational overhead to the system for

estimating the instantaneous vibration. The vibratory motion in the video is filtered

using a simple, low-pass filter. Thus, the stabilization algorithm works at the full

frame rate of a standard video (i.e. 25 fps).

5.3 Proposed Pan-tilt Control Algorithm The proposed pan-tilt control algorithm has been derived from the basic car-

following-control (CFC) law [222]. The CFC law can be used only for closed-loop

system in which the current velocity of the pan-tilt unit (PTU) is fed back to the

control algorithm. It is modified the CFC so that it can be used in an open-loop

system. The modified control algorithm is named as Predictive Open-Loop Car-

Following Control (POL-CFC) strategy. The algorithm generates the pan-tilt velocity

commands in accordance with the Kalman predicted [44] target velocity components

in the video frames. The use of predicted velocity is helpful in compensation of the

inertia for the pan-tilt mechanism and hence following the target without any lag or

inaccuracy. The POL-CFC strategy is described mathematically as:

( )( )

* *

* *

[ 1] [ ] [ 1 | ] [ 1 | ]

[ 1] [ ] [ 1 | ] [ 1 | ]

p p x rp

t t rt y

v n v n Ke n n v n n

v n v n v n n Ke n n

η

η

+ = + + − +

+ = + + − + (5.1)

where vp[n] and vt[n] are the current pan and tilt velocities of the PTU (which

were generated by Eq. (5.1) in the previous iteration), η is a small positive constant in

the range (0.0, 1.0] which controls the amount of the velocity added to the previous

velocity, K is the proportional gain parameter (which is the only parameter to be tuned

for every zoom level of the camera), *[ 1 | ]xe n n+ and *

[ 1 | ]ye n n+ are the predicted errors

in both the axes defined as:

* *

* *

[ 1 | ] [ 1 | ]

[ 1 | ] [ 1 | ]

x x

y y

e n n r x n n

e n n r y n n

+ = − +

+ = − + (5.2)

73


Furthermore, the * [ 1 | ]rpv n n+ and * [ 1 | ]rtv n n+ in Eq. (5.1) are the predicted

relative velocities of the target in terms of pan-tilt degrees per second, defined as:

* **

* **

[ 1 | ] [ | 1][ 1 | ]

[ 1 | ] [ | 1][ 1 | ]

rp dpp

rt dpp

x n n x n nv n n C

T

y n n y n nv n n C

T

+ − −+ =

+ − −+ =

(5.3)

where Cdpp is a conversion ratio in degrees per pixel determined by a simple

camera calibration procedure for all the zoom levels of the camera, T is the sampling

time (which is inverse of the video frame rate), and ( *[ 1 | ]x n n+ , *[ 1 | ]y n n+ ) and

( *[ | 1]x n n − , *[ | 1]y n n − ) are the target coordinates in the video frame predicted by

Kalman filter in the current and the previous iterations, respectively. Through POL-

CFC, we have achieved 0% overshoot, 1.47 second rise time, and maximum steady-

state error as illustrated in Table 5.1.

5.4 Proposed Video Stabilization Algorithm The proposed video stabilization algorithm takes two inputs: the current video frame

and the current image coordinates (x, y) of the target (estimated by the tracker

described in Chapter 4). The algorithm outputs the stabilized video frame, which can

be seen on a monitor, as illustrated in the block diagram shown in Figure 5.1.

Software approach for video stabilization (which is also called digital video

stabilization) requires a foreground object with respect to which the stabilization

process is performed. In the proposed algorithm, the target is taken as a foreground

object. The inter-frame motion is estimated with the help of its image coordinates (x,

Table 5.1 Maximum steady state error of the tracker

Zoom Maximum Steady State Error (pixels)

1x to 6x 0 7x to 15x ±1

16x to 19x ±2 20x to 25x ±3

74


y). This motion contains low frequency components (i.e., valid motion) as well as

high frequency components (i.e., vibration). In order to filter out the latter, a low pass

filter is used in x and y axes. The proposed two-dimensional filter is given as:

1

1

ˆˆ (1 ) 0 0ˆ 0 0 (1 )

ˆ

n

n n

nn

n

xxxyyy

α αα α

−

−

−=

− (5.4)

where nx and ny are the xy-coordinates of the target in the un-stabilized

current frame estimated by the tracking module, ˆnx and ˆny are the stabilized xy-

coordinates of the target in the current frame, 1ˆnx − and 1ˆny − are the stabilized xy-

coordinates in the previous frame, and α is the coefficient of the filter having value

from the range 0 < α < 1 to meet the filter stability criterion. The lower the value of α,

the lower the cutoff frequency of the low-pass filter (as illustrated in Figure 5.2). The

cut-off frequency of a low-pass filter is defined as the frequency above which the

magnitude of the frequency response of the filter is ideally zero. However, practical

low-pass filter response does not become zero immediately beyond the cutoff

Figure 5.2 Relationship between α and cut-off frequency of the low-pass

filter

75


frequency, so the cutoff frequency is normally considered as the frequency where the

magnitude of the frequency response of the filter is 1 2 (i.e., 0.707). Thus, the value

of α can be set according to the vibration frequency involved in the application at

hand. For example, the frequency response of the filter, when α is set to 0.11, is

shown in Figure 5.3 This figure also shows that the cutoff frequency of the filter is 0.5

Hz (at the magnitude of 0.707). It may be observed that the frequency response

around the cutoff frequency (i.e., roll-off) is not perfectly steep because the filter is

real (not ideal) and of first order (i.e., single pole). The filter has single parameter α

and it is easy to tune in run-time while observing its effects on the stabilized video.

However, if one is sure about the optimal cutoff frequency for a specific application, a

higher order filter can be designed using MATLAB filter design tool and

implemented to have as small transition region as possible around the cutoff

frequency in its frequency response.

Once the stabilized xy-coordinates of the target in the current frame are

obtained, the vibratory motion estimation vector is calculated as below:

ˆˆ

n

n

n n

n n

x

y

M x xM y y

= − (5.5)

Figure 5.3 Magnitude of frequency response of the low-pass filter at α = 0.11

76


where nxM and

nyM are the xy-components of the vibratory motion estimation

vector for the current frame. The estimated vibratory motion is then compensated by

translating every frame-pixel at (i, j) in the opposite direction of the vibratory motion,

such that the new coordinates of the pixel become (i’, j’), calculated as:

''

n

n

x

y

i Mij j M

−=

− (5.6)

It may be noted that i and j are the horizontal and vertical coordinates of the

pixel, respectively. If the new position of the pixel is obtained outside the frame

boundaries, the corresponding pixel is discarded. As a result, we get the stabilized

video frame with respect to the target at the cost of some undefined or vacant regions

Frame 34

Frame 42

Figure 5.4 Original (left side) versus stabilized (right side) frames of a video recorded from a vibratory flying helicopter

77


at the boundaries from where the pixels were translated. There are two widely used

strategies to fill these regions. One approach is to fill the vacant regions with black

pixels, but it creates unpleasant impact on the viewer because their number is

continuously changing according to the varying instantaneous magnitude of

oscillations due to vibration. Another approach is to fill the vacant regions by the

same regions as that of the previous

frame, but it creates unpleasant artifacts at the boundaries of the video frame. Yet,

another approach is to define a border of fixed size greater than the anticipated

maximum vibration amplitude and resize the stabilized image within the border up to

the full frame. Thus, the stabilized video displayed only the slightly zoomed-in

version of the valid scene with no black border. However, this approach slightly

deteriorated the sharpness of the image due to the bilinear interpolation involved

during resizing. Moreover, the interpolation puts an extra computation overhead to the

system. In order to overcome the above mentioned limitations of the three approaches,

black boundary of a fixed width greater than the anticipated maximum vibration

amplitude without resizing the stabilized portion of the image is proposed in this

thesis.

Figure 5.5 Original versus stabilized x-coordinates of the left truck shown

in Figure 5.4

78


5.5 Results and Discussion This section presents the results of: (a) the video stabilization algorithm on some off-

line challenging videos, (c) the active camera tracking system (ACTS), and finally (d)

the stabilized ACTS.

5.5.1 Performance of Stabilization Algorithm Figure 5.4 shows vibratory aerial video frames (left side) versus stabilized video

frames (right side) using the proposed algorithm. The un-stabilized video was

recorded from a flying helicopter during tracking a truck using the proposed tracking

algorithm and it contains jitters due to the helicopter vibration creating an unpleasant

effect on the eyes of the viewer. The large yellow crosshair in the video frames is

overlaid to easily perceive the vibration in the un-stabilized frames and its effective

attenuation in the stabilized frames. In order to visualize the effects of stabilization on

the whole image sequence, the

record of x-y-coordinates of the truck (on the left side of the road) in the original and

the stabilized video frames is maintained, because the vibratory motion of the truck

Figure 5.6 Original versus stabilized y-coordinates of the left truck shown in Figure 5.4

79


can be considered as the vibratory motion of the whole scene. The coordinates are

shown in Figure 5.5 and Figure 5.6, where it may be noted that the coordinates of the

truck in the stabilized frames are smoother than those in the original vibratory video

frames. Another example to show the efficacy of the stabilization algorithm is given

Frame 20

Frame 24

Frame 53

Figure 5.7 Original (left side) versus stabilized (right side) frames of a video recorded from a vibratory hovering helicopter

80


in Figure 5.7, which presents the frames of the video of a building taken from a

hovering helicopter. Vibrations of the helicopter yield vibratory image frames shown

in left column of the Figure 5.7, right column shows the frames after stabilization.

Plots in Figure 5.8 and Figure 5.9 explain the stabilization process for better

understanding .

5.5.2 Performance of Active Camera Tracking System In this section, some experimental results of the active camera tracking system are

presented to show its robust and accurate performance.

Figure 5.10 shows some frames of a tracking session in which a helicopter is

being tracked. The best match area is represented by a white rectangle, and the frame

center (i.e., the optical axis of the camera) is represented by a white dot. The updated

edge-enhanced template is overlaid at the bottom-right of every frame. The overlaid

text at the top of the frames consists of the correlation peak value, the center

coordinates of the tracked target in the frame, zoom level of the camera, and finally

the pan-tilt velocities of the camera mounted on a PTU in degrees/second. The pan

velocity is positive, if the camera is rotating towards left. The tilt velocity is positive,

if the camera is rotating downwards. It can be seen that the helicopter is automatically

Figure 5.8 Original versus stabilized x-coordinates of the building shown in Figure 5.7

81


centralized very efficiently and smoothly in the video by increasing the pan velocity

within the first 40 frames (i.e. 1.2 seconds), which is less than even the rise time of

the proposed pan-tilt control system as mentioned in Section 5.1. After the initial

automatic target centralization, the helicopter remains at the center of the frames

throughout the tracking session, and it can be verified by the target coordinates shown

at the overlaid text keeping in view that the frame size is 320×240. It may be noted

that the helicopter is being tracked persistently and precisely with the proposed active

camera tracking system even when: (1) the user had initialized the template

inaccurately due to the motion of the helicopter in the video, and (2) the size, the

appearance, and the velocity of the helicopter is continuously varying. The BMR

adjustment algorithm solves the incorrect initialization problem by resizing/relocating

the BMR so that it tightly encloses the target very efficiently within the first 20

frames. Later on, the BMR is further dynamically resizing/relocating itself according

to the current size of the helicopter by both the scale handling method and the BMR

adjustment algorithm.

Figure 5.11 illustrates how efficiently the proposed system tracks the face of a

person, who is walking in a room with all the lights turned off. The only light, that

was available in the room, was coming from the blinds shown in the frames. This

Figure 5.9 Original versus stabilized y-coordinates of the building shown in Figure 5.7

82


natural light created a severe illumination variation in the video, since the camera was

operating in its auto mode. Specifically, when the camera was looking in the direction

of the bright window, the other things (persons, wall, etc.) became very dark (see

Frames 271 to 512), and when there was no bright window in the video frames, the

Frame 1 Frame 20

Frame 40 Frame 300

Frame 385 Frame 520

Figure 5.10 A helicopter is being tracked persistently and precisely with the proposed tracking system even when the user has initialized the template

inaccurately, and the size, the appearance, and the velocity of the helicopter is continuously varying.

83


whole scene became a little clearer. It may also be noted, that there is noise and no

detail in the whole video due to low light conditions. The target person and the

occluding person are both walking in the same direction, making the scenario even

more complex. It can be further observed in Frame 495, that the occlusion of the

tracked person by the other person happens partly in the bright region and partly in

the dark region of the video frame. Moreover, the track of the target person after the

occlusion is resumed in the very much dark, as shown in Frame 512. Since the

persons were very near to the camera, even a small movement of the persons was

reflecting a large movement in the video frames. Thus, it was a challenging




Figure 5.11 Tracking the face of a person during severe illumination variation, noise, low detail, and occlusion. All the lights in the room were turned off in this experiment to create a challenging scenario. The dark yellow rectangle in Frame

495 indicates that the tracker is currently working in its occlusion handling mode.

84


experiment for the pan-tilt control algorithm as well. All the problems (i.e. severe

illumination variation, noise, low detail, full occlusion, and fast motion) are handled

very efficiently and robustly by the proposed active camera tracking system in real-

Frame 420

Frame 422

Frame 425

Frame 427

Frame 430

Figure 5.12 Results of un-stabilized (left column) vs. stabilized active camera tracking (right column) of a

distant airplane

85


time, and the face of the person of interest is always at (or near) the center of the

video frames.

5.5.3 Performance of Stabilized Active Camera Tracking System

In this section, the results of comple stabilized active-camera tracking system are

demonstrated.

Figure 5.12 shows some of the frames of a long active camera tracking and

stabilization session of a very distant airplane at 25x (highest) zoom level of the

camera. The un-stabilized as well as the stabilized video frames are recorded in real-

time for demonstration purpose. Left column depicts the resulting tracking video

frames without stabilization, while the right column shows the resulting tracking

video frames with stabilization. The periodic vibratory force at the rate of 1 Hz was

used to induce vibration on the PTU and thus on the real-time video. In the un-

stabilized video frames, visual tracking and control algorithms always try to keep the

target at the center of the image plane, but due to vibration the airplane is oscillating

about the frame center. Video frames at the right column shows that the stabilization

module of the system successfully diminishes the vibration in the target being tracked.

In Figure 5.13, a pedestrian is being tracked in a cluttered environment at 11x

zoom level. The vibratory source in this case is a Toyata 2400 cc Hilux engine

working at 500 RPM. The synchronized tracking video frames without and with

stabilization are again shown in the left and the right columns, respectively. Images in

the left side highlight the oscillatory motion of the scene, because the active camera

tracking system is mounted on a vibratory vehicle. Images in the right column show

that the tracked person and the varying background scene are stable and the vibration

is significantly attenuated.

5.6 Chapter Summary A robust stabilized active camera tracking system is proposed, consisting of a visual

tracking module, a pan-tilt control module, and a video stabilization module. The

visual tracking module can handle template-drift, noise, object fading (obscuration),

clutter, intermittent occlusion, varying illumination in the scene, high computational

complexity, and varying shapes, scale, and velocity of the maneuvering target during

86


its motion. The proposed pan-tilt control module is a predictive open-loop car-

following-control algorithm, which moves the camera efficiently and smoothly so that

the object being tracked is always at the center of the video frame. The control

algorithm offers 0% overshoot, negligible steady-state error, and 1.47 second rise-

time. The video stabilization module handles the annoying vibratory motion in the

Frame 196

Frame 209

Frame 226

Frame 249

Frame 261

Figure 5.13 Results of un-stabilized (left column) vs. stabilized active camera

tracking (right column) of a pedestrian

87


image frame during tracking while the system is mounted on a vibratory platform

(e.g., vehicle, helicopter, etc.). The complete proposed system has been successfully

used for more than two years in indoor as well as outdoor scenarios, and it works in

real-time at the full frame rate of 25 fps.

88

6 Conclusion and Future Work

Visual tracking is a non-trivial task in an unstructured environment and far class

independent target. This thesis presents a new visual tracking framework which

combines correlation, Kalman filter and mean shift algorithm. The proposed tracking

method successfully tracks the target (the type of target is not known already) in an

unknown environment. This chapter summarizes the thesis and provides future

directions for VOT.

6.1 Summary Chapter 1 provides an introduction to visual object tracking, and shows its usability in

other fields, e.g., human-computer interaction, security and surveillance system,

activity recognition, industrial robotics, etc. Moreover, the chapter explains different

issues such as template drift, changing target appearance, occlusion, clutter, similar

objects, etc., which make VOT a non-trivial task

Chapter 2 furnishes itself by providing various classical and contemporary

approaches for VOT. Thus, the old as well as recent techniques for visual tracking are

discussed. Moreover, the chapter provides a list of different online resources, which

includes data set and code for different tracking algorithms.

Chapter 3 explains the proposed template updating method. The updating

method determines the rate of change in target’s appearance, and sets the update rate

of the template accordingly. This way, the template is efficiently updated for slow as

well as fast moving target without drifting. The proposed updating method

outperforms in comparison (qualitative and quantitative) with three other methods,

i.e., naïve, α, and β updating methods.

Chapter 4 proposes a tracking framework consisting of: (1) correlation tracker,

(2) Kalman filter, and (3) mean shift tracker. These three methods work jointly to

reinforce each other’s strength and to suppress the individual weaknesses. The

correlation tracker normally suffers from the template drift problem, so, the adaptive

template updating method proposed in Chapter 3 is used. In order to handle occlusion,

89

Conclusion and Future Work

KF is combined with the correlation tracker. The threshold for correlation in order to

sense the occurrence of occlusion is set adaptively at each image frame based on its

peak value in the previous frame. Moreover, a search area is defined based KF

predicted position in the next frame to reduce the computation for correlation

matching. The size of the search area is dynamically set according speed and direction

of target’s motion. KF predicted and correlation measured target positions do not

coincide if the correlation-KF method fails to track the target and at the same time

peak correlation value does not drop below the threshold due to the presence of any

similar object in the background. In this case, adaptive fast mean shift algorithm is

proposed, which finds the position of moving region, i.e., candidate target, in the

search window. Its difference is calculated with both KF predicted and correlation

measured coordinates, and whichever is finds less is considered as the target position.

The algorithm is compared with nine other recent methods on eleven different

challenging videos which pose different challenges such as occlusion, clutter, change

of size, fast motion, out of plane rotation, etc. The experimental results, which

includes sampled tracked frames for qualitative results, and center location error and

Pascal score for quantitative results, explains that the proposed method tracks the

target more robustly than the other methods.

Chapter 5 provides the detail of stabilized active camera tracking system.

Active camera tracking system is comprised of a camera mounted on a pan-tilt unit. In

order to smoothly track the target using the proposed tracking method, PTU is moved

using the car-following control algorithm. When the ACTS is mounted on vibratory

platform, the output video contains jitters and vibration. Stabilization methods

normally require a reference object according to which the whole frame is stabilized.

The proposed stabilization method uses the target position, calculated by the tracking

method, and smooth down the vibrations using a single pole low pass filter, without

any significance computation overhead. Experimental results show the efficacy of the

algorithm

6.2 Future Work Although the proposed tracking framework shows robustness against different issues,

including occlusion, clutter, changing target’s appearance, heavy motion, etc., but

90

Conclusion and Future Work

there are still many rooms for future work in the tracking method, described as

follows:

If the target changes, significantly, its speed or direction during occlusion. It is

likely that KF would not be able to predict the target position correctly.

The assumption that temporally updated template should not change its

appearance more than 50 percent when compared with the initially selected template,

might not work in case of target moving continuously away from the camera.

The presence of other moving objects similar to target in search area reduce

the robustness of the algorithm if occlusion is not sensed.

The higher order stabilization filter may remove more oscillations in vibratory

video at the cost of tuning more parameters.

91

References

[1] S. Stalder and H. Grabner. (2009, 07 October). on-line boosting trackers. Available: http://www.vision.ee.ethz.ch/boostingTrackers/onlineBoosting.htm

[2] H. Grabner, M. Grabner, and H. Bischof, "Real-Time Tracking via On-line Boosting," in British Machine Vision Conference (BMVC), 2006, pp. 47-56.

[3] B. Babenko, M. H. Yang, and S. Belongie, "Robust object tracking with online multiple instance learning," IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 33, pp. 1619-1632, 2011.

[4] A. Adam, E. Rivlin, and I. Shimshoni, "Robust fragments-based tracking using the integral Histogram," in IEEE conference on computer vision and pattern recognition (ICPR), 2006, pp. 798-805.

[5] D. P. Chau, F. Bremond, and M. Thonnat, "Object Tracking in Videos: Approaches and Issues," The International Workshop'Rencontres UNS-UD'(RUNSUD), 2013.

[6] R. T. Collins, Y. Liu, and M. Leordeanu, "Online selection of discriminative tracking features," IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 27, pp. 1631-1643 2005.

[7] X. Zhang, W. Hu, S. Maybank, X. Li, and M. Zhu, "Sequential particle swarm optimization for visual tracking," in IEEE Conference on Computer Vision and Pattern Recognition, 2008. CVPR 2008., 2008, pp. 1-8

[8] M. Yang, Y. Wu, and G. Hua, "Context-aware visual tracking," IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 31, pp. 1195-1209, 2009.

[9] D. Comaniciu and V. Ramesh, "Mean Shift and Optimal Prediction for Efficient Object Tracking," in IEEE International Conference on Image Processing (ICIP), 2000, pp. 70–73.

[10] D. Comaniciu, V. Ramesh, and P. Meer, "Real-time tracking of non-rigid objects using mean shift," in IEEE Conference on Computer Vision and Pattern Recognition . , 2000, pp. 142-149.

[11] S. Hare, A. Saffari, and P. H. S. Torr, "Struck: Structured output tracking with kernels," in IEEE International Conference on Computer Vision (ICCV), 2011, 2011, pp. 263-270.

[12] A. Yilmaz, O. Javed, and M. Shah, "Object tracking: A survey," Acm Computing Surveys (CSUR), vol. 38, 2006.

92

[13] Y. Li and R. Nevatia, "Key Object Driven Multi-category Object Recognition, Localization and Tracking Using Spatio-temporal Context," in Europian Conference on Computer Vision, , 2008, pp. 409-422.

[14] D. A. Ross, J. Lim, R. S. Lin, and M. H. Yang, "Incremental learning for robust visual tracking," International Journal of Computer Vision, vol. 77, pp. 125-141, 2008.

[15] D.-S. Jang and H.-I. Choi, "Active models for tracking moving objects," Pattern Recognition, vol. 33, pp. 1135-1146, 2000.

[16] X. Zhang, W. Hu, W. Qu, and S. Maybank, "Multiple object tracking via species-based particle swarm optimization," IEEE Transactions on Circuits and Systems for Video Technology vol. 20, pp. 1590-1602, 2010.

[17] D. Comaniciu, V. Ramesh, and P. Meer, "Kernel-based object tracking," IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 25, 2003.

[18] S. Avidan, "Support vector tracking," IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 26, pp. 1064-1072, 2004.

[19] X. Mei and H. Ling, "Robust visual tracking using ℓ 1 minimization," in IEEE 12th International Conference on Computer Vision, 2009, pp. 1436-1443.

[20] K. A. Joshi and D. G. Thakore, "A Survey on Moving Object Detection and Tracking in Video Surveillance System," International Journal of Soft Computing and Engineering (IJSCE) ISSN, pp. 2231-2307, 2012.

[21] Z. Li, C. Xu, and Y. Li, "Robust object tracking using mean shift and fast motion estimation," in IEEE International Symposium on Intelligent Signal Processing and Communication Systems (ISPACS) , 2007, pp. 734-737

[22] H. T. Nguyen, Q. Ji, and A. W. M. Smeulders, "Spatio-temporal context for robust multitarget tracking," IEEE Transactions on Pattern Analysis and Machine Intelligence vol. 29, pp. 52-64, 2007.

[23] R. Akbari, M. D. Jazi, and M. Palhang, "A hybrid method for robust multiple objects tracking in cluttered background," in Information and Communication Technologies, 2006. ICTTA'06. 2nd, 2006, pp. 1562-1567.

[24] C. Yang, R. Duraiswami, and L. Davis, "Efficient mean-shift tracking via a new similarity measure," in IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR), 2005, pp. 176-183.

[25] C. Beleznai, B. Frühstück, and H. Bischop, "Human Tracking by Fast Mean Shift Mode Seeking," Trans. Journal of Multimedia (JMM), vol. 1, pp. 1-8, April 2006.

[26] C. Beleznai, B. Frühstück, and H. Bischop, "Human Tracking by Mode Seeking," in Proc. 4th International Symposium on Image and Signal Processing and Analysis (ISPA), 2005, pp. 1-6.

93

[27] C. Beleznai, B. Frühstück, and H. Bischop, "Tracking Multiple Humans using Fast Mean Shift Mode Seeking," in IEEE International Workshop on Visual Surveillance and Performance Evaluation of Tracking and Surveillance, 2005, pp. 25-32.

[28] C. Beleznai, B. Frühstück, and H. Bischop, "Detecting Humans in Groups using a Fast Mean Shift Procedure," in Proc. 28th Workshop of the Austrian Association for Pattern Recogniton (AAPR), 2004, pp. 71-78.

[29] C. Beleznai, B. Frühstück, and H. Bischop, "Human Detection in Groups using a Fast Mean Shift Procedure," in International Conference on Image Processing (ICIP), 2004, pp. 349-352.

[30] H. Yang, L. Shao, F. Zheng, L. Wang, and Z. Song, "Recent advances and trends in visual tracking: A review," Neurocomputing, vol. 74, pp. 3823-3831, 2011.

[31] B. Kwolek, "Multi-object Tracking Using Particle Swarm Optimization on Target Interactions," in Advances in Heuristic Signal Processing and Applications, ed: Springer, 2013, pp. 63-78

[32] X. Li, T. Zhang, X. Shen, and J. Sun, "Object Tracking using an Adaptive Kalman Filter combined with Mean Shift," Optical Engineering (OE) Letters, vol. 49(2), February 2010.

[33] L. Wen, Z. Cai, Z. Lei, D. Yi, and S. Li, "Robust Online Learned Spatio-Temporal Context Model for Visual Tracking," IEEE Transactions on Image Processing, 2013.

[34] Z. Zivkovic and B. Krose, "An EM-like algorithm for color-histogram-based object tracking," in Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR), 2004, pp. 798-803.

[35] K. Cannons, "A review of visual tracking," Dept. Comput. Sci. Eng., York Univ., Toronto, Canada, Tech. Rep. CSE-2008-07, 2008.

[36] A. Ali and S. M. Mirza, "Object tracking using correlation, Kalman filter and fast means shift algorithms," in International Conference on Emerging Technologies, 2006. ICET'06. , Islamabad, 2006, pp. 174-178.

[37] H. Grabner, J. Matas, L. Van Gool, and P. Cattin, "Tracking the invisible: Learning where the object might be," in IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2010, pp. 1285-1292.

[38] L. Anton-Canalis, M. Hernandez-Tejera, and E. Sanchez-Nielsen, "Particle swarms as video sequence inhabitants for object tracking in computer vision," in Sixth International Conference on Intelligent Systems Design and Applications, 2006. ISDA'06. , 2006, pp. 604-609.

[39] H. Zhou, Y. Yuan, Y. Zhang, and C. Shi, "Non-rigid object tracking in complex scenes," Pattern Recognition Letters, vol. 30, pp. 98-102, 2009.

94

[40] H. Grabner, C. Leistner, and H. Bischof, "Semi-supervised on-line boosting for robust tracking," in Computer Vision–ECCV, ed: Springer, 2008, pp. 234-247.

[41] D. Geronimo, A. M. Lopez, A. D. Sappa, and T. Graf, "Survey of pedestrian detection for advanced driver assistance systems," IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 32, pp. 1239-1258, 2010.

[42] J. Kwon and K. M. Lee, "Visual tracking decomposition," in IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2010, pp. 1269-1276.

[43] Y. Zheng and Y. Meng, "Adaptive object tracking using particle swarm optimization," in International Symposium on Computational Intelligence in Robotics and Automation, 2007. CIRA 2007., 2007, pp. 43-48.

[44] J. Ahmed, M. N. Jafri, M. Shah, and M. Akbar, "Real-Time Edge-Enhanced Dynamic Correlation and Predictive Open-Loop Car Following Control for Robust Tracking," Machine Vision and Applications Journal, vol. 19, pp. 1-25, January 2008.

[45] K. Zhang and H. Song, "Real-time visual tracking via online weighted multiple instance learning," Pattern Recognition, 2012.

[46] N. Jifeng, L. Zhang, D. Zhang, and C. Wu, "Robust object tracking using joint color-texture histogram," International Journal of Pattern Recognition and Artificial Intelligence, vol. 23, pp. 1245-1263, 2009.

[47] N. A. Ogale, "A survey of techniques for human detection from video," Survey, University of Maryland, 2006.

[48] A. M. Abdel Tawab, M. B. Abdelhalim, and S.-D. Habib, "Efficient multi-feature PSO for fast gray level object-tracking," Applied Soft Computing, 2013.

[49] C. Ridder, O. Munkelt, and H. Kirchner, "Adaptive background estimation and foreground detection using kalman-filtering," in Proceedings of International Conference on recent Advances in Mechatronics, 1995, pp. 193-199.

[50] B. Zeisl, C. Leistner, A. Saffari, and H. Bischof, "On-line semi-supervised multiple-instance boosting," in IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2010, 2010.

[51] E. Trucco and K. Plakas, "Video tracking: a concise survey," Oceanic Engineering, IEEE Journal of, vol. 31, pp. 520-529 2006.

[52] C. Shan, T. Tan, and Y. Wei, "Real-time Hand Tracking using a Mean Shift Embedded Particle Filter," Trans. Pattern Recognition, vol. 40, pp. 1958-1970, 2007.

95

[53] J. Santner, C. Leistner, A. Saffari, T. Pock, and H. Bischof, "PROST: Parallel robust online simple tracking," in IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2010, pp. 723-730.

[54] S. K. Borra and S. K. Chaparala, "Tracking of an Object in Video Stream Using a Hybrid PSO-FCM and Pattern Matching," International Journal of Engineering, vol. 2, 2013.

[55] J. K. Aggarwal and Q. Cai, "Human motion analysis: A review," in Nonrigid and Articulated Motion Workshop, 1997. Proceedings., IEEE, 1997, pp. 90-102

[56] B. D. Lucas and T. Kanade, "An iterative image registration technique with an application to stereo vision," in 7th international joint conference on Artificial intelligence, 1981.

[57] X. Wang, L. Liu, and Z. Tang, "Infrared Human Tracking with Improved Mean Shift Algorithm based on Multi-cue Fusion," Trans. Journal of Applied Otics, vol. 48, pp. 4201-4212, July 2009.

[58] B. Zhan, D. N. Monekosso, P. Remagnino, S. A. Velastin, and L.-Q. Xu, "Crowd analysis: a survey," Machine Vision and Applications, vol. 19, pp. 345-357 2008.

[59] M. Isard and A. Blake, "Condensation—conditional density propagation for visual tracking," International Journal of Computer Vision, vol. 29, pp. 5-28, 1998.

[60] W. Kang and F. Deng, "Research on intelligent visual surveillance for public security," in Computer and Information Science, 2007. ICIS 2007. 6th IEEE/ACIS International Conference on, 2007, pp. 824-829.

[61] J. Jeyakar, R. V. Babu, and K. R. Ramakrishnan, "Robust object tracking with background-weighted local kernels," Computer Vision and Image Understanding, vol. 112, pp. 296-309, 2008.

[62] K. Zhang, L. Zhang, and M.-H. Yang, "Real-time compressive tracking," in Computer Vision–ECCV 2012, ed: Springer, 2012, pp. 864-877.

[63] O. Arikan and L. Ikemoto, Computational Studies of Human Motion: Tracking and Motion Synthesis: Now Publishers Inc, 2006.

[64] X. Jia, H. Lu, and M. H. Yang, "Visual tracking via adaptive structural local sparse appearance model," in IEEE Conference on Computer Vision and Pattern Recognition (2012, pp. 1822-1829.

[65] M. I. Khan, J. Ahmed, A. Ali, and A. Masood, "Robust Edge-Enhanced Fragment Based Normalized Correlation Tracking in Cluttered and Occluded Imagery," Signal Processing, Image Processing and Pattern Recognition, pp. 169-176, 2009.

96

[66] I. S. Kim, H. S. Choi, K. M. Yi, J. Y. Choi, and S. G. Kong, "Intelligent visual surveillance—A survey," International Journal of Control, Automation and Systems, vol. 8, pp. 926-939, 2010.

[67] W. Zhong, H. Lu, and M.-H. Yang, "Robust object tracking via sparsity-based collaborative model," in IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2012, 2012, pp. 1838-1845.

[68] T. B. Moeslund, A. Hilton, and V. Krüger, "A survey of advances in vision-based human motion capture and analysis," Computer Vision and Image Understanding, vol. 104, pp. 90-126 2006.

[69] M. S. Arulampalam, S. Maskell, N. Gordon, and T. Clapp, "A tutorial on particle filters for online nonlinear/non-Gaussian Bayesian tracking," IEEE Transactions on Signal Processing, vol. 50, pp. 174-188, 2002.

[70] A. S. Jalal and V. Singh, "The State-of-the-Art in Visual Object Tracking," Informatica, vol. 36, pp. 227-248, 2012.

[71] X. Li, W. Hu, C. Shen, Z. Zhang, A. Dick, and A. v. d. Hengel, "A Survey of Appearance Models in Visual Object Tracking," ACM Transactions on Itelligent Systems and Technology, 2013.

[72] D.-N. Ta, W.-C. Chen, N. Gelfand, and K. Pulli, "SURFTrac: Efficient tracking and continuous object recognition using local feature descriptors," in IEEE Conference on Computer Vision and Pattern Recognition, CVPR, 2009, pp. 2937-2944

[73] I. Skrypnyk and D. G. Lowe, "Scene modelling, recognition and tracking with invariant image features," in Third IEEE and ACM International Symposium on Mixed and Augmented Reality, 2004. ISMAR 2004, pp. 110-119

[74] T. Ko, "A survey on behavior analysis in video surveillance for homeland security applications," in 37th IEEE Applied Imagery Pattern Recognition Workshop, 2008. AIPR'08. , 2008, pp. 1-8.

[75] A. Ess, K. Schindler, B. Leibe, and L. Van Gool, "Object detection and tracking for autonomous navigation in dynamic environments," The International Journal of Robotics Research, vol. 29, pp. 1707-1725 2010.

[76] P. Mistry and P. Maes, "SixthSense: a wearable gestural interface," in ACM SIGGRAPH ASIA 2009 Sketches, 2009, p. 11.

[77] G. R. Bradski, "Real time face and object tracking as a component of a perceptual user interface," in Fourth IEEE Workshop on Applications of Computer Vision (WACV'98). , 1998, pp. 214-219.

[78] Z. Zhu and Q. Ji, "Eye gaze tracking under natural head movements," in IEEE Computer Society Conference on Computer Vision and Pattern Recognition, CVPR 2005, pp. 918-923

97

[79] Siemens. (21-10-11). Sistore CX EDS. Available: https://www.cee.siemens.com/web/sk/sk/priemysel/technologie-budov/katalogove-listy/Katalogy_poziarnychPriemyselna_televizia/b299.pdf

[80] L. Collins, F. Kanade, T. Duggins, and E. Tolliver, "Hasegawa. A system for video surveillance and monitoring: Vsam final report," ed: Technical Report CMU-RI-TR-00-12, Robotics Institute, Carnegie Mellon University, 2000.

[81] I. Haritaoglu, D. Harwood, and L. S. Davis, "W4: real-time surveillance of people and their activities," IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 22, pp. 809-830, 2000.

[82] V. Kettnaker and R. Zabih, "Bayesian multi-camera surveillance," in IEEE Computer Society Conference on Computer Vision and Pattern Recognition, 1999, pp. 1-18.

[83] W. Hu, T. Tan, L. Wang, and S. Maybank, "A survey on visual surveillance of object motion and behaviors," IEEE Transactions on Systems, Man and Cybernetics, vol. 34, pp. 334-352, August 2004.

[84] R. T. Collins, A. J. Lipton, H. Fujiyoshi, and T. Kanade, "Algorithms for cooperative multisensor surveillance," Proceedings of the IEEE, vol. 89, pp. 1456 - 1477, 2001.

[85] M. Greiffenhagen, D. Comaniciu, H. Niemann, and V. Ramesh, "Design, analysis, and engineering of video monitoring systems: an approach and a case study," Proceedings of the IEEE, vol. 89, pp. 1498 - 1517, 2001.

[86] R. Kumar, H. Sawhney, S. Samarasekera, S. Hsu, H. Tao, Y. Guo, K. Hanna, A. Pope, R. Wildes, D. Hirvonen, M. Hansen, and P. Burt, "Aerial video surveillance and exploitation," Proceedings of the IEEE, vol. 89, pp. 1518 - 1539, 2001.

[87] B. Coifman, D. Beymer, P. McLauchlan, and J. Malik, "A real-time computer vision system for vehicle tracking and traffic surveillance," Transportation Research Part C: Emerging Technologies, vol. 6, pp. 271-288, 1998.

[88] J.-C. Tai, S.-T. Tseng, C.-P. Lin, and K.-T. Song, "Real-time image tracking for automatic traffic monitoring and enforcement applications," Image and Vision Computing, vol. 22, pp. 485-501 2004.

[89] O. Masoud and N. P. Papanikolopoulos, "A novel method for tracking and counting pedestrians in real-time using a single camera," Vehicular Technology, IEEE Transactions on, vol. 50, pp. 1267-1278 2001.

[90] N. P. Papanikolopoulos and P. K. Khosla, "Adaptive robotic visual tracking: Theory and experiments," IEEE Transactions on Automatic Control, vol. 38, pp. 429-445, 1993.

[91] Y. Sakagami, R. Watanabe, C. Aoyama, S. Matsunaga, N. Higaki, and K. Fujimura, "The intelligent ASIMO: System overview and integration," 2002, pp. 2478-2483

98

[92] I. F. Mondragon, P. Campoy, J. F. Correa, and L. Mejias, "Visual model feature tracking for UAV control," in IEEE International Symposium on Intelligent Signal Processing, 2007. WISP 2007., 2007, pp. 1-6

[93] J. Lee, R. Huang, A. Vaughn, X. Xiao, J. K. Hedrick, M. Zennaro, and R. Sengupta, "Strategies of path-planning for a UAV to track a ground vehicle," in AINS Conference, 2003.

[94] U. Handmann, T. Kalinke, C. Tzomakas, M. Werner, and W. von Seelen, "Computer vision for driver assistance systems," in International Society for Optics and Photonics: Aerospace/Defense Sensing and Controls, 1998, pp. 136-147.

[95] J. Ahmed, M. Shah, A. Miller, D. Harper, and M. N. Jafri, "A Vision-based System for a UGV to Handle a Road Intersection," in Proceedings of the National Conference on Artificial Intelligence, 2007.

[96] D. Rand, R. Kizony, and P. T. L. Weiss, "The Sony PlayStation II EyeToy: low-cost virtual reality for use in rehabilitation," Journal of neurologic physical therapy, vol. 32, pp. 155-163, 2008.

[97] S. Wang, X. Xiong, Y. Xu, C. Wang, W. Zhang, X. Dai, and D. Zhang, "Face-tracking as an augmented input in video games: enhancing presence, role-playing and control," in Proceedings of the ACM SIGCHI conference on Human Factors in computing systems, 2006, pp. 1097-1106.

[98] A. Amini, R. Owen, P. Anandan, and J. Duncan, "Non-rigid motion models for tracking the left-ventricular wall," in Information Processing in Medical Imaging, 1991, pp. 343-357.

[99] M. J. M. Vasconcelos, S. M. R. Ventura, D. R. S. Freitas, and J. M. R. S. Tavares, "Using statistical deformable models to reconstruct vocal tract shape from magnetic resonance images," Proceedings of the Institution of Mechanical Engineers, Part H: Journal of Engineering in Medicine, vol. 224, pp. 1153-1163, 2010.

[100] M. J. Vasconcelos, S. M. Rua Ventura, D. R. S. Freitas, and J. M. R. S. Tavares, "Towards the automatic study of the vocal tract from magnetic resonance images," Journal of Voice pp. 732-42, 2010.

[101] C. Stauffer and W. E. L. Grimson, "Learning patterns of activity using real-time tracking," IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 22, pp. 747-757, 2000.

[102] R. Bodor, B. Jackson, and N. Papanikolopoulos, "Vision-based human tracking and activity recognition," in Proc. of the 11th Mediterranean Conf. on Control and Automation, 2003.

[103] J. M. Fitts, "Precision correlation tracking via optimal weighting functions," in 18th IEEE Conference on Decision and Control including the Symposium on Adaptive Processes, 1979, pp. 280-283.

99

[104] K. Fukunaga and L. Hostetler, "The estimation of the gradient of a density function, with applications in pattern recognition," IEEE Transactions on Information Theory, vol. 21, pp. 32-40, 1975.

[105] Y. Cheng, "Mean shift, mode seeking, and clustering," IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 17, pp. 790-799, 1995.

[106] D. Comaniciu and P. Meer, "Mean shift: A robust approach toward feature space analysis," IEEE Transactions on Pattern Analysis and Machine Intelligence vol. 24, pp. 603-619, 2002.

[107] D. Comaniciu and P. Meer, "Robust analysis of feature spaces: color image segmentation," in IEEE Computer Society Conference on Computer Vision and Pattern Recognition, 1997, pp. 750-755.

[108] A. O. Hero Iii, B. Ma, O. J. J. Michel, and J. Gorman, "Applications of entropic spanning graphs," IEEE Signal Processing Magazine, vol. 19, pp. 85-95, 2002.

[109] C. Shen, M. Brooks, and A. Van Den Hengel, "Fast global kernel density mode seeking: Applications to localization and tracking," IEEE Transactions on Image Processing, vol. 16, pp. 1457-1469, 2007.

[110] R. E. Kalman and R. S. Bucy, "New results in linear filtering and prediction theory," Journal of Basic Engineering, vol. 83, pp. 95-108, 1961.

[111] E. Brookner and J. Wiley, Tracking and Kalman filtering made easy: Wiley New York, 1998.

[112] G. Welch and G. Bishop, "An introduction of the Kalman filter TR 95-041 Department of Computer Science," University of North Carolina at Chapel Hill, 2005.

[113] M. S. Grewal and A. P. Andrews, Kalman filtering: theory and practice using MATLAB: Wiley. com, 2011.

[114] Y. Boykov and D. P. Huttenlocher, "Adaptive Bayesian recognition in tracking rigid objects," in IEEE Conference on Computer Vision and Pattern Recognition, 2000., pp. 697-704.

[115] D. Beymer, P. McLauchlan, B. Coifman, and J. Malik, "A real-time computer vision system for measuring traffic parameters," in IEEE Computer Society Conference on Computer Vision and Pattern Recognition, 1997, pp. 495-501

[116] T. J. Broida and R. Chellappa, "Estimation of object motion parameters from noisy images," IEEE Transactions on Pattern Analysis and Machine Intelligence pp. 90-99 1986.

[117] D. B. Gennery, "Visual tracking of known three-dimensional objects," International Journal of Computer Vision, vol. 7, pp. 243-270, 1992.

[118] M. Isard and A. Blake, "Active contours," ed: Springer-Verlag, 1998.

100

[119] D. Terzopoulos and R. Szeliski, "Tracking with Kalman snakes," in Active vision, 1993, pp. 3-20.

[120] E. V. Cuevas, D. Zaldivar, and R. Rojas, Kalman filter for vision tracking: Freie Univ., Fachbereich Mathematik und Informatik, 2005.

[121] N. Peterfreund, "Robust tracking of position and velocity with Kalman snakes," IEEE Transactions on Pattern Analysis and Machine Intelligence vol. 21, pp. 564-569 1999.

[122] B. D. O. Anderson and J. B. Moore, Optimal filtering: DoverPublications. com, 2012.

[123] A. Doucet, S. Godsill, and C. Andrieu, "On sequential Monte Carlo sampling methods for Bayesian filtering," Statistics and computing, vol. 10, pp. 197-208, 2000.

[124] G. M. Rao and C. Satyanarayana, "Visual Object Target Tracking Using Particle Filter: A Survey," International Journal of Image, Graphics and Signal Processing, pp. pp. 57-71, 2013.

[125] R. O. Duda and P. E. Hart, Pattern classification and scene analysis vol. 3: Wiley New York, 1973.

[126] R. C. Gonzalez and R. E. Woods, Digital Image Processing, 2nd ed.: Prentice-Hall, Inc., 2002.

[127] C. Kuglin and D. Hines, "The Phase Correlation Image Alignment Method," in International Conference on Cybernetics and Society, 1975, pp. 163-165.

[128] J. P. Lewis, "Fast Normalized Cross-Correlation," in Vision Interface, 1995, pp. 120-123.

[129] S.-I. Chien and S.-H. Sung, "Adaptive window method with sizing vectors for reliable correlation-based target tracking," Pattern Recognition, vol. 33, pp. 237-249 2000.

[130] R. Manduchi and G. A. Mian, "Accuracy analysis for correlation-based image registration algorithms," in IEEE International Symposium on Circuits and Systems (ISCAS'93), 1993, pp. 834-837.

[131] H. S. Stone, B. Tao, and M. McGuire, "Analysis of image registration noise due to rotationally dependent aliasing," Journal of Visual Communication and Image Representation, vol. 14, pp. 114-135, 2003.

[132] H. S. Stone, "Fourier-based image registration techniques," NEC Research, 2002.

[133] H. Foroosh, J. B. Zerubia, and M. Berthod, "Extension of phase correlation to subpixel registration," IEEE Transactions on Image Processing, vol. 11, pp. 188-200, 2002.

101

[134] Y. Keller, A. Averbuch, and O. Miller, "Robust Phase Correlation," in 17th International Conference on Pattern Recognition (ICPR’04), 2004, pp. 740-743.

[135] J. Ahmed and M. N. Jafri, "Improved Phase Correlation Matching," in ICISP-08:International Conference on Image and Signal Processing, France, 2008, pp. 128-135.

[136] S. Blackman and R. Popoli, Design and Analysis of Modern Tracking Systems. Boston: Artech House, 1999.

[137] M. Nixon and A. Aguado, Feature Extraction and Image Processing: Newnes, Oxford, 2002.

[138] M. Asgarizadeh and H. Pourghassem, "A robust object tracking synthetic structure using regional mutual information and edge correlation-based tracking algorithm in aerial surveillance application," Signal, Image and Video Processing, pp. 1-15 2013.

[139] C. R. Wren, A. Azarbayejani, T. Darrell, and A. P. Pentland, "Pfinder: Real-time tracking of the human body," IEEE Transactions on Pattern Analysis and Machine Intelligence vol. 19, pp. 780-785 1997.

[140] W. E. L. Grimson, C. Stauffer, R. Romano, and L. Lee, "Using adaptive tracking to classify and monitor activities in a site," in IEEE Computer Society Conference on Computer Vision and Pattern Recognition,, 1998, pp. 22-29.

[141] C. Stauffer and W. E. L. Grimson, "Adaptive background mixture models for real-time tracking," in IEEE Computer Society Conference on Computer Vision and Pattern Recognition, , 1999.

[142] P. KaewTraKulPong and R. Bowden, "An improved adaptive background mixture model for real-time tracking with shadow detection," in Video-Based Surveillance Systems, ed: Springer, 2002, pp. 135-144

[143] T. Horprasert, D. Harwood, and L. S. Davis, "A robust background subtraction and shadow detection," in Asian Conference on Computer Vision, 2000, pp. 983-988.

[144] T. Horprasert, D. Harwood, and L. S. Davis, "A statistical approach for real-time robust background subtraction and shadow detection," in International Conference on Computer Vision, 1999, pp. 1-19.

[145] N. M. Oliver, B. Rosario, and A. P. Pentland, "A Bayesian computer vision system for modeling human interactions," Pattern Analysis and Machine Intelligence, IEEE Transactions on, vol. 22, pp. 831-843, 2000.

[146] A. J. Lipton, H. Fujiyoshi, and R. S. Patil, "Moving target classification and tracking from real-time video," in Fourth IEEE Workshop on Applications of Computer Vision, 1998. WACV'98. , 1998, pp. 8-14

102

[147] D. J. Dailey, F. W. Cathey, and S. Pumrin, "An algorithm to estimate mean traffic speed using uncalibrated cameras," IEEE Transactions on Intelligent Transportation Systems vol. 1, pp. 98-107, 2000.

[148] D. J. Dailey and L. Li, "An algorithm to estimate vehicle speed using uncalibrated cameras," in IEEE/IEEJ/JSAI International Conference on Intelligent Transportation Systems,, 1999, pp. 441-446

[149] B. K. P. Horn and B. G. Schunck, "Determining optical flow," Artificial intelligence, vol. 17, pp. 185-203 1981.

[150] M. J. Black and P. Anandan, "The robust estimation of multiple motions: Parametric and piecewise-smooth flow fields," Computer Vision and Image Understanding, vol. 63, pp. 75-104 1996.

[151] R. Szeliski and J. Coughlan, "Spline-based image registration," International Journal of Computer Vision, vol. 22, pp. 199-218, 1997.

[152] J. Shi and C. Tomasi, "Good features to track," in IEEE Computer Society Conference on Computer Vision and Pattern Recognition, 1994 (CVPR'94). , 1994, pp. 593-600.

[153] K. Rangarajan and M. Shah, "Establishing motion correspondence," CVGIP: image understanding, vol. 54, pp. 56-73, 1991.

[154] C. P. Papageorgiou, M. Oren, and T. Poggio, "A general framework for object detection," in IEEE Sixth International Conference on Computer Vision, 1998, pp. 555-562

[155] D. Cremers and C. Schnörr, "Statistical shape knowledge in variational motion segmentation," Image and Vision Computing, vol. 21, pp. 77-86 2003.

[156] B. Li, R. Chellappa, Q. Zheng, and S. Z. Der, "Model-based temporal object verification using video," IEEE Transactions on Image Processing vol. 10, pp. 897-908 2001.

[157] M. Bertalmío, G. Sapiro, and G. Randall, "Morphing active contours," IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 22, pp. 733-737, 2000.

[158] A. R. Mansouri, "Region tracking via level set PDEs without motion computation," IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 24, pp. 947-961, 2002.

[159] X. Liu and T. Yu, "Gradient feature selection for online boosting," in IEEE 11th International Conference on Computer Vision, ICCV 2007, pp. 1-8

[160] S. Avidan, "Ensemble tracking," IEEE Transactions on Pattern Analysis and Machine Intelligence vol. 29, pp. 261-271 2007.

103

[161] J. Wang, X. Chen, and W. Gao, "Online selecting discriminative tracking features using particle filter," in IEEE Computer Society Conference on Computer Vision and Pattern Recognition, CVPR, 2005, pp. 1037-1042.

[162] L. I. Kuncheva, "Combining pattern classifiers: Methods and algorithms (kuncheva, li; 2004)[book review]," Neural Networks, IEEE Transactions on, vol. 18, pp. 964-964, 2007.

[163] C. M. Bishop and N. M. Nasrabadi, Pattern recognition and machine learning vol. 1: springer New York, 2006.

[164] A. Saffari, C. Leistner, M. Godec, and H. Bischof, "Robust multi-view boosting with priors," in Computer Vision–ECCV ed: Springer, 2010, pp. 776-789

[165] C. Leistner, A. Saffari, P. M. Roth, and H. Bischof, "On robustness of on-line boosting-a competitive study," in IEEE 12th International Conference on Computer Vision Workshops (ICCV Workshops), 2009 2009, pp. 1362-1369

[166] H. Masnadi-Shirazi, V. Mahadevan, and N. Vasconcelos, "On the design of robust classifiers for computer vision," in IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2010 2010, pp. 779-786.

[167] O. Williams, A. Blake, and R. Cipolla, "A sparse probabilistic learning algorithm for real-time tracking," in Ninth IEEE International Conference on Computer Vision, 2003. , 2003, pp. 353-360.

[168] J. Kennedy and R. Eberhart, "Particle Swarm Optimization," in IEEE International Conference on Neural Networks, Piscataway, 1995, pp. 1942-1948.

[169] R. Eberhart and J. Kennedy, "A new optimizer using particle swarm theory," in Proceedings of the Sixth International Symposium on Micro Machine and Human Science, 1995. MHS'95., , 1995, pp. 39-43.

[170] R. Poli, "Analysis of the publications on the applications of particle swarm optimisation," Journal of Artificial Evolution and Applications, 2008.

[171] M. Clerc and J. Kennedy, "The particle swarm-explosion, stability, and convergence in a multidimensional complex space," Evolutionary Computation, IEEE Transactions on, vol. 6, pp. 58-73, 2002.

[172] M. P. Wachowiak, R. Smolíková, Y. Zheng, J. M. Zurada, and A. S. Elmaghraby, "An approach to multimodal biomedical image registration utilizing particle swarm optimization," Evolutionary Computation, IEEE Transactions on, vol. 8, pp. 289-301 2004.

[173] A. P. Engelbrecht, Computational Intelligence: An Introduction: Wiley, 2007.

[174] D. Sedighizadeh and E. Masehian, "Particle swarm optimization methods, taxonomy and applications," international journal of computer theory and engineering, vol. 1, pp. 486-502, 2009.

104

[175] D. L. Donoho, "Compressed sensing," IEEE Transactions on Information Theory, vol. 52, pp. 1289-1306, 2006.

[176] E. J. Candes, J. K. Romberg, and T. Tao, "Stable signal recovery from incomplete and inaccurate measurements," Communications on pure and applied mathematics, vol. 59, pp. 1207-1223, 2006.

[177] J. Wright, Y. Ma, J. Mairal, G. Sapiro, T. S. Huang, and S. Yan, "Sparse representation for computer vision and pattern recognition," Proceedings of the IEEE, vol. 98, pp. 1031-1044, 2010.

[178] G. Sapiro, J. Mairal, J. Wright, Y. Ma, T. Huang, and S. Yan, "Sparse Representation for Computer Vision and Pattern Recognition," ed: Minnesota Univ Minneapolis Inst for Mathematics and Its Applications, 2009.

[179] J. Yang, J. Wright, T. S. Huang, and Y. Ma, "Image super-resolution via sparse representation," IEEE Transactions on Image Processing, vol. 19, pp. 2861-2873 2010.

[180] J. Wright, A. Y. Yang, A. Ganesh, S. S. Sastry, and Y. Ma, "Robust face recognition via sparse representation," IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 31, pp. 210-227, 2009.

[181] X. Mei, "Visual Tracking and Illumination Recovery Via Sparse Representation," 2009.

[182] X. Mei and H. Ling, "Robust visual tracking and vehicle classification via sparse representation," IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 33, pp. 2259-2272, 2011.

[183] B. Liu, L. Yang, J. Huang, P. Meer, L. Gong, and C. Kulikowski, "Robust and fast collaborative tracking with two stage sparse optimization," in Computer Vision–ECCV 2010, ed: Springer, 2010, pp. 624-637.

[184] B. Liu, J. Huang, L. Yang, and C. Kulikowsk, "Robust tracking using local sparse appearance model and k-selection," in IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2011, pp. 1313-1320.

[185] S. Zhang, H. Yao, X. Sun, and X. Lu, "Sparse coding based visual tracking: Review and experimental comparison," Pattern Recognition, 2013.

[186] A. Oliva and A. Torralba, "The role of context in object recognition," Trends in cognitive sciences, vol. 11, pp. 520-527, 2007.

[187] S. K. Divvala, D. Hoiem, J. H. Hays, A. A. Efros, and M. Hebert, "An empirical study of context in object detection," in IEEE Conference on Computer Vision and Pattern Recognition, 2009, pp. 1271-1278.

[188] J. F. Henriques, R. Caseiro, P. Martins, and J. Batista, "Exploiting the circulant structure of tracking-by-detection with kernels," in Computer Vision–ECCV 2012, ed: Springer, 2012, pp. 702-715.

105

[189] Y. Wu, J. Lim, and M.-H. Yang, "Online Object Tracking: A Benchmark," presented at the Computer Vison and Pattern Recognition, 2013.

[190] S. Wong, "Advanced Correlation Tracking of Objects in Cluttered Imagery," in Defense and Security:International Society for Optics and Photonics, 2005, pp. 158-169.

[191] J. Ahmed, A. Ali, and A. Khan, "Stabilized active camera tracking system," Journal of Real-Time Image Processing, pp. 1-20, 2012.

[192] W. Wang, T. Adalı, and D. Emge, "A Novel Approach for Target Detection and Classification Using Canonical Correlation Analysis," Journal of Signal Processing Systems, pp. 1-12 2012.

[193] A. Ali, H. Kauser, and M. I. Khan, "Automatic Visual Tracking and Firing System for Anti-Aircraft Machine Gun," in 6th Internation Bhurban Conference of Applied Science and Technology, Islamabad, Pakistan., 2009, pp. 253-257.

[194] F. Bousetouane, L. Dib, and H. Snoussi, "Improved mean shift integrating texture and color features for robust real time object tracking," The Visual Computer, pp. 1-16, 2013.

[195] M. Asgarizadeh, H. Pourghassem, G. Shahgholian, and H. Soleimani, "Robust and real time object tracking using regional mutual information in surveillance and reconnaissance systems," in 7th IEEE Iranian Machine Vision and Image Processing (MVIP) Conference, 2011 pp. 1-5.

[196] R. L. Brunson, D. L. Boesen, G. A. Crockett, and J. F. Riker, "Precision trackpoint control via correlation track referenced to simulated imagery," in International Society for Optics and Photonics: Aerospace Sensing, 1992, pp. 325-336.

[197] Available at: http://vision.ucsd.edu/~bbabenko/project_miltrack.shtml.

[198] Available at: http://www.cs.technion.ac.il/~amita/fragtrack/fragtrack.html.

[199] J. N. Wilson and G. X. Ritter, Handbook of Computer Vision–Algorithms in Image Algebra: CRC Press, 2001.

[200] Q. Chen, M. Defrise, and F. Deconinck, "Symmetric phase-only matched filtering of Fourier-Mellin transforms for image registration and recognition," IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 16, pp. 1156-1168, 1994.

[201] J. Jingying, H. Xiaodong, X. Kexin, and Y. Qilian, " Phase Correlation-based Matching Method with Sub-pixel Accuracy for Translated and Rotated Images," in IEEE International Conference on Signal Processing (ICSP’02), 2002, pp. 752-755.

[202] J. Ahmed, "Adaptive Edge-Enhanced Correlation Based Robust And Real-Time Visual Tracking Framework And Its Deployment In Machine Vision

106

Systems," Ph.D. Research, Department of Electrical Engineering, National University of Science and Technology (NUST), Rawalpindi, Pakistan, 2008.

[203] R. C. Gonzalez, R. E. Woods, and S. L. Eddins, Digtal Image Processing Using MATLAB: Pearson Education Pte. Ltd., 2004.

[204] S. Sutor, R. Röhr, G. Pujolle, and R. Reda, "Efficient Mean Shift Clustering using Exponential Integral Kernels," Trans. International Journal of Electrical and Computer Engineering, vol. 4, pp. 206-210, 2009.

[205] A. Yilmaz, K. Shafique, N. Lobo, X. Li, T. Olson, and M. Shah, "Target tracking in FLIR imagery using mean shift and global motion compensation," in IEEE Workshop on Computer Vision Beyond Visible Spectrum, Kauai, Hawaii, 2001, pp. 54-58.

[206] D. Comaniciu, V. Ramesh, and P. Meer, "Real-time tracking of non-rigid objects using mean shift," in IEEE Conf. on Computer Vision and Pattern Recognition, 2000, pp. 142-149.

[207] R. T. Collins, "Mean-shift blob tracking through scale space," in IEEE Computer Society Conference on Computer Vision and Pattern Recognition, 2003, pp. 234-40.

[208] J. Ahmed and M. N. Jafri, "Best-Match Rectangle Adjustment Algorithm for Persistent and Precise Correlation Tracking," in IEEE International Conference on Machine Vision (ICMV), Islambad, Pakistan, 2007.

[209] S. Oron, A. Bar-Hillel, D. Levi, and S. Avidan, "Locally Orderless Tracking," in IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2012, pp. 1940-1947.

[210] Available at : http://gpu4vision.icg.tugraz.at/index.php?content=subsites/prost/prost.php.

[211] Available at: http://groups.inf.ed.ac.uk/vision/caviar/caviardata1/.

[212] Available at: http://cv.snu.ac.kr/research/~vtd/.

[213] Available at: http://www.cs.toronto.edu/~dross/ivt/.

[214] M. Everingham, L. Van Gool, C. K. I. Williams, J. Winn, and A. Zisserman, "The pascal visual object classes (voc) challenge," International Journal of Computer Vision, vol. 88, pp. 303-338, 2010.

[215] Z. Kalal, J. Matas, and K. Mikolajczyk, "Pn learning: Bootstrapping binary classifiers by structural constraints," in IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2010, pp. 49-56.

[216] Y. Wang and Q. Zhao, "Robust object tracking via online Principal Component–Canonical Correlation Analysis (P3CA)," Signal, Image and Video Processing, pp. 1-16, 2013.

107

[217] B. C. Kuo and M. F. Golnaraghi, Automatic control systems vol. 4: John Wiley & Sons New York, 2003.

[218] E. V. Cuevas, D. Zaldivar, and R. Rojas, Intelligent tracking: Freie Univ., Fachbereich Mathematik und Informatik, 2003.

[219] T. J. Ross, Fuzzy logic with engineering applications: John Wiley & Sons, 2009.

[220] H. T. Nguyen and E. A. Walker, A first course in fuzzy logic: CRC press, 2005.

[221] N. Mir-Nasiri, "Camera-based 3D object tracking and following mobile robot," in IEEE Conference on Robotics, Automation and Mechatronics, 2006, pp. 1-6

[222] "Basic Control Law for PTU to Follow a Moving Target,," ed: Application Note 01, Directed Perception Inc, 1996.

[223] E. Vermeulen, "Real-time video stabilization for moving platforms," in 21st Bristol UAV Systems Conference, 2007, p. 3.

[224] M. Tico and M. Vehvilainen, "Robust method of video stabilization," in EUSIPCO-07: European Signal and Image Processing Conference, 2007.

[225] R. Hu, R. Shi, I. f. Shen, and W. Chen, "Video stabilization using scale-invariant features," in 11th International Conference on Information Visualization, 2007. IV'07. , pp. 871-877

[226] S. Battiato, G. Gallo, G. Puglisi, and S. Scellato, "Fuzzy-based motion estimation for video stabilization using SIFT interest points," in IS&T/SPIE Electronic Imaging, 2009, pp. 72500T-72500T-8.

[227] (15-03-2014). Inc., C., “Canon FAQ: What is vari-angle prism? Available: http://www.canon.com/bctv/faq/vari.html

[228] H.-C. Chang, S.-H. Lai, and K.-R. Lu, "A robust and efficient video stabilization algorithm," in IEEE International Conference on Multimedia and Expo, 2004. ICME'04, pp. 29-32

[229] (15-03-2014). Available: http://homepages.inf.ed.ac.uk/rbf/CVonline/LOCAL_COPIES/OWENS/LECT12/node4.html.

[230] C. Buehler, M. Bosse, and L. McMillan, "Non-metric image-based rendering for video stabilization," 2001, pp. II-609-II-614

[231] S. Auberger and C. Miro, "Digital video stabilization architecture for low cost devices," in Proceedings of the 4th International Symposium on Image and Signal Processing and Analysis, 2005. ISPA 2005. , 2005, pp. 474-479

108

[232] S.-W. Jang, M. Pomplun, G.-Y. Kim, and H.-I. Choi, "Adaptive robust estimation of affine parameters from block motion vectors," Image and Vision Computing, vol. 23, pp. 1250-1263 2005.

[233] F. Vella, A. Castorina, M. Mancuso, and G. Messina, "Digital image stabilization by adaptive block motion vectors filtering," IEEE Transactions on Consumer Electronics, vol. 48, pp. 796-801, 2002.

109

Heuristic Approach for Robust VOTprr.hec.gov.pk/.../6742/...Sciences_2015_PIEAS_ISD.pdf · Telecom...

Documents

Transcript of Heuristic Approach for Robust VOTprr.hec.gov.pk/.../6742/...Sciences_2015_PIEAS_ISD.pdf · Telecom...