c 2006by Amit Sethi. All rights reserved.asethi/pub/thesis.pdfB.Tech., Indian Institute of...

INTERACTION BETWEEN MODULES IN LEARNING SYSTEMS FOR VISION APPLICATIONS

BY

AMIT SETHI

B.Tech., Indian Institute of Technology, New Delhi, 1999M.S., University of Illinois at Urbana-Champaign, 2001

DISSERTATION

Submitted in the partial fulfillment of the requirementsfor the degree of Doctor of Philosophy in Electrical Engineering

in the Graduate College of theUniversity of Illinois at Urbana-Champaign, 2006

Urbana, Illinois

ABSTRACT

Complex vision tasks such as event detection in a surveillance video can be divided into subtasks

such as human detection, tracking, recognition, and trajectory analysis. The video can be thought of

as being composed of various features. These features can be roughly arranged in a hierarchy from

low-level features to high-level features. Low-level features include edges and blobs, and high-level

features include objects and events. Loosely, the low-level feature extraction is based on signal/image

processing techniques, while the high-level feature extraction is based on machine learning techniques.

Traditionally, vision systems extract features in a feedforward manner on the hierarchy; that is,

certain modules extract low-level features and other modules make use of these low-level features to

extract high-level features. Along with others in the research community we have worked on this design

approach. We briefly present our work on object recognition and multiperson tracking systems designed

with this approach and highlight its advantages and shortcomings. However, our focus is on system

design methods that allow tight feedback between the layers of the feature hierarchy, as well as among

the high-level modules themselves. We present previous research on systems with feedback and discuss

the strengths and limitations of these approaches. This analysis allows us to develop a new framework

for designing complex vision systems that allows tight feedback in a hierarchy of features and modules

that extract these features using a graphical representation. This new framework is based on factor

graphs. It relaxes some of the constraints of the traditional factor graphs and replaces its function nodes

by modified versions of some of the modules that have been developed for specific vision tasks. These

modules can be easily formulated by slightly modifying modules developed for specific tasks in other

vision systems, if we can match the input and output variables to variables in our graphical structure. It

also draws inspiration from product of experts and Free Energy view of the EM algorithm. We present

experimental results and discuss the path for future development.

iii

To my parents

iv

ACKNOWLEDGMENTS

I thank Professor Thomas S. Huang for the invaluable guidance, encouragement, and inspiration that

he has given me over the course of my studies. He has been helpful, understanding, and patient during

the tough times to bring this work to fruition. He knows how to provide a nurturing environment to his

students. I thank Professor David J. Kriegman at University of California, San Diego, for introducing

me to computer vision and machine learning and for his support during the early part of my graduate

studies. I also thank the rest of my doctoral committee members, Professors Narendra Ahuja, Robert

M. Fossum, and Yi Ma, for their advice and support.

I thank my current and former colleagues for their camaraderie and research inputs, especially Man-

dar Rahurkar, Aleksandar Ivanovic, Dr. Nemanja Petrovic, Dr. Ashutosh Garg, Mithun Das Gupta,

Shyamsundar Rajaram, Cagrı K. Daglı, Jilin Tu, Yue Zhou, Maha El Choubassi, and Dennis Lin. I also

thank Professor Brendan J. Frey at University of Toronto and Professor Lester Loschky at Kansas State

University for the fruitful collaboration with them.

I thank my friends from outside the research realm, especially Dr. Rajesh Kumar, Dr. Murali Manoj,

Zakia Khan, Rekha Santhanam, Soumya Jana, and Dr. Balakumar Balasubramaniam for being close and

supportive friends, and the numerous discussions towards finding a meaning in life. I also thank Sachin

Sharma, Nitin Kumar, Srinivasan Rajagopal, Gaurav Gupta, Anurag Sethi, Sunita Singh, Parag Bhide,

Dr. Vaibhav Donde, Vijay Thakur, and Kirti Joshi for their support and encouragement.

Finally, I want to thank my parents Mrs. Vyjayanti Sethi and Col. Anand M. Sethi, and brothers

Anuj and Gautam for their love, support, and encouragement throughout my life, which helped me reach

where I am today.

v

TABLE OF CONTENTS

CHAPTER 1 INTRODUCTION . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11.1 Nature of Visual Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2

1.1.1 Constraints on pixels in visual data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21.1.2 Hierarchical representation of variables from visual data . . . . . . . . . . . . . . . . . 3

1.2 Machine Learning and Video Understanding . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41.3 Original Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61.4 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6

CHAPTER 2 MODULAR FEEDFORWARD VISION SYSTEMS . . . . . . . . . . . . . . . . . . . . . 82.1 Object Recognition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8

2.1.1 Scenario . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92.1.2 Theory: feature mapping and matching . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92.1.3 Apparatus . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112.1.4 Object modeling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 122.1.5 Object recognition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 132.1.6 Results and discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14

2.2 Multiple Object Tracking and Event Detection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 152.2.1 Human detection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 152.2.2 Multiple human tracking algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 162.2.3 Tracking results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 202.2.4 Event detection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21

2.3 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21

CHAPTER 3 VISION SYSTEMS WITH FEEDBACK AND GENERATIVE MODELS . . . . . . 233.1 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23

3.1.1 Connectionist models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 243.1.2 Information-theoretic models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 243.1.3 Generative models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25

3.1.3.1 Probabilistic graphical models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 263.1.3.2 Models related to pattern theory and generative modeling . . . . . . . . . 27

3.1.4 Comparison of connectionist, information-theoretic, and generative models . . . . 273.2 Application: Multimodal Person Tracking . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28

3.2.1 Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 313.2.1.1 Time delay of arrival estimation using audio signals . . . . . . . . . . . . . 313.2.1.2 Visual tracker . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 333.2.1.3 Multimodal object tracking . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34

3.2.2 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35

vi

3.3 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38

CHAPTER 4 BACKGROUND FOR VARIABLE/MODULE GRAPHS . . . . . . . . . . . . . . . . . . 394.1 Differences between Feedforward Modular and Generative Design . . . . . . . . . . . . . . . . 394.2 A Unifying View . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40

4.2.1 Variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 414.2.2 Constraints on variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42

4.2.2.1 Modeling constraints as probabilities . . . . . . . . . . . . . . . . . . . . . . . . 434.3 Probability Density Modeling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44

4.3.1 Mixture form . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 464.3.2 Product form . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 474.3.3 Probabilistic graphical models with product form . . . . . . . . . . . . . . . . . . . . . . 48

4.3.3.1 Factor graphs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 494.3.3.2 Bayesian networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51

4.3.4 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 514.3.4.1 Advantages of product form . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 524.3.4.2 Limitations of probabilistic graphical models . . . . . . . . . . . . . . . . . . 55

CHAPTER 5 VARIABLE/MODULE GRAPHS: FACTOR GRAPHS WITH MODULES . . . . . 575.1 Replacing Functions in Factor Graphs with Modules . . . . . . . . . . . . . . . . . . . . . . . . . . 575.2 System Modeling using V/M Graphs and its Relation to the Product Form . . . . . . . . . . 585.3 Inference . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 605.4 Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 615.5 Free-Energy View of EM Algorithm and V/M Graphs . . . . . . . . . . . . . . . . . . . . . . . . . 63

5.5.1 Online and local E-step . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 645.5.2 Online and local M-step . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 655.5.3 PDF softening . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67

5.6 Prescription . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68

CHAPTER 6 SOME APPLICATIONS OF V/M GRAPHS . . . . . . . . . . . . . . . . . . . . . . . . . . . 706.1 Scenario . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 716.2 Application: Person Tracking . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71

6.2.1 Message passing and learning schedule . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 726.2.2 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73

6.3 Application: Multiperson Tracking . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 776.3.1 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77

6.4 Application: Trajectory Modeling and Prediction . . . . . . . . . . . . . . . . . . . . . . . . . . . . 796.4.1 Trajectory modeling module . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 796.4.2 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80

6.5 Application: Event Detection Based on Single Target . . . . . . . . . . . . . . . . . . . . . . . . . 826.5.1 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83

6.6 Application: Event Detection Based on Multiple Targets . . . . . . . . . . . . . . . . . . . . . . . 846.6.1 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85

CHAPTER 7 CONCLUSIONS AND FUTURE WORK . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 887.1 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 887.2 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89

vii

REFERENCES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91

AUTHOR’S BIOGRAPHY . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96

viii

CHAPTER 1

INTRODUCTION

It is estimated that over 80% of all information around us is captured by our sight in the form of

visual data. Thus, it is not surprising that since the advent of photography, video capture, and television,

images and video have become the increasingly important sources of information capture, exchange,

and storage. Today’s digital technology and affordability of capture and storage devices for image and

video have lead to a visual information boom. Automatic methods of processing visual information

have become necessary to deal with this information explosion. Image processing, image compression,

video processing, video compression, image understanding, and computer vision have become impor-

tant research fields that are stepping up to the challenge of automatic processing and handling of visual

information.

There has been tremendous progress in research and development in the fields of image and video

compression, editing, and analysis software, leading to their effective usability and commercialization.

However, success in developing general methods of analyzing video in a wide range of scenarios remains

elusive. The main reason for this is the number of scenario dependent parameters affecting various pixels

in a video or across videos. Moreover, the sheer amount of raw data in video streams is voluminous.

Yet, the problem of image or video understanding is often ill-posed, making it even more difficult to

solve the problems based on the given data alone. It is, therefore, important to understand the nature

of the generation of the visual data itself. It is also important to understand the features of visual data

that human users would be interested in, and how those features might be extracted. How these features

are related to each other and how the modules extracting them might interact with each other will be of

special interest in designing vision systems.

1

The human visual system is a ready example of a system that works successfully in extracting the

features of interest. Thus, studies in neuroscience and human visual perception have also deeply affected

design of automatic computer vision and video understanding systems. However, human brain and its

visual pathway is still far from being fully understood.

1.1 Nature of Visual Data

A digital video is a discrete version of a natural light signal captured by camera. For a video oft

frames where each frame ism rows byn columns of pixels, with three channels per pixel, this video lies

in a raw pixel space of3mnt dimensions. Assuming8 bits per channel there are224mnt possible videos,

or points in this space. Not all points lying in this discrete space represent videos that are plausible

naturally generated videos. This means that the actual videos lie on a low-dimensional subspace of this

space that represents the raw pixels. There are many constrains that define this subspace.

1.1.1 Constraints on pixels in visual data

Pixels in a frame represent the light bouncing off of the surface of various objects. Objects have

finite dimensions and predictable appearance. This predictability is sometimes calledspatial coherence

in video. Moreover, due to laws of physics, objects have limited acceleration and speeds. For example,

if we know the world record of human speeds for various activities, we can expect the humans in a given

video to be slower than those speeds. This predictability over time is known astemporal coherencein

video.

With an object- and event-centric view of the visual signals, it is easy to describe how an image or a

video was generated. Roughly speaking,objectsare distinct contiguous entities in space that constitute

an image or a frame. In a video the state of these objects is usually a function of time. For example, the

pose or position of an object might change with time. Videoeventsare usually defined as a function of

the states or time-series of these states of the objects.

Objects also have motion constraints that limit the variation in their shape that they can assume.

For example, humans have constraints on how their joints bend. Depending on the context, even the

type of the objects (including humans) and events (including human activity) encountered is limited.

2

For example, in a scenario where a fixed camera is monitoring an indoor cafe, we do not expect to see

crocodiles and airplanes as objects and volcano eruptions as events. Thus, there are a lot of constraints

leading to predictability in videos that can be learnt and interpreted.

1.1.2 Hierarchical representation of variables from visual data

Humans naturally describe images and videos as compositions of objects and events, as opposed to

being a composition of pixels as they are often captured and stored in machines. Moreover, humans

would also describe that certain features such as individual pixels and edges, to be caused by the objects

and events present in the video. The reasoning of causality follows a certain path. For example, the

presence of a foreground object against a background object is the cause of the color that we see at

certain pixel locations, and the edges that we see at the boundary of the objects. This cause-and-effect

relationship between various features leads to a hierarchy of features where the causes are usually the

high-level entities. This is a natural outcome or cause of the high-level features that interest humans

and is both part of visual cognition and necessary for our daily lives and survival. For example, we are

interested in knowing whether we are in the way of a speeding car or not more than we are interested in

the dominant color in the scene.

Thus, some of the most challenging and interesting tasks that we want computer vision systems to

perform are detection and recognition of these high-level features such as objects, their states, and events

in images and videos. For example, in a surveillance video, we would like to detect any suspicious

activity by the humans under surveillance. Such suspicious activity can be termed anunusual event.

However, these high-level features are difficult to define mathematically and algorithmically. This also

makes their extraction more difficult. In many cases, we are also interested in extraction of these features

in a manner that is invariant to many other factors. For example, if we want to define if there is a face

of a particular person present in an image or not, we want the extraction of this information to be

invariant to the lighting conditions, location of the face, its pose, and its attire or facial hair etc. Due

to such challenges, face detection is still an active research problem. On the other hand, extracting a

corner, without any context, just depends on the local distribution of intensities, and can be defined

3

Table 1.1 Characteristics and examples of the low-level and high-level features that represent videodata.

Low-Level Features High-Level FeaturesEffect of high-level features Cause of low-level featuresSemantically less meaningful Semantically more meaningfulComponents of a high-dimensional representationComponents of a low-dimensional representationEasy to define Difficult to defineEasy to extract Difficult to extractWeaker requirement of invariant extraction Stronger requirement of invariant extractionWeaker notion of invariance Stronger notion of invarianceE.g., edges, color histogram, corners, optical flowE.g., face(yes/no), walking

mathematically in a rather simple way (such as a high second eigenvalue of the correlation matrix of

image intensities in the neighborhood of the potential corner).

Table 1.1 summarizes the characteristics and examples of low-level and high-level features in video

data.

Much of the research in computer vision aims to bridge thesemantic gapbetween the low-level rep-

resentation of the data and the high-level semantic concepts (or features) that we are usually interested

in. In a generative model of the image or the video, any variable that cannot be observed (usually, only

the pixels can be observed) are known ashidden variables. Variables that are irrelevant to the interpreta-

tion of the desired concept, yet that affect the observed data, are usually a subset of the hidden variables.

While deciding on the possible outcomes of the hidden variables that we are interested in, we need to be

able to take into account the possible values of the other hidden variables and their probabilities. Due

to the difficulty of obtaining reliable functions that describe the extraction of high-level concepts in a

manner that is invariant to the value of oft-numerous irrelevant hidden variables, we need to rely on

adaptable functions that can make use of previous examples. This is where machine learning finds use

in computer vision applications.

1.2 Machine Learning and Video Understanding

Machine learning allows one to leave some function parameters to be estimated from the data itself

while estimating the function that maps points from one space to another (such as from the data space to

the feature space). When the estimation process makes use of labeled pairs of associated points from the

4

input and output spaces, it is known assupervised learning. Another name for the value in the output

space islabel of the point in the input space. When the estimation process is required to find these

labels on its own as well as learn the function parameters, it is known asunsupervised learning, which

includes dataclustering(or dividing data points into clusters based on some measure of closeness). Use

of some labeled points and some unlabeled points makes the techniquesemisupervised.

There are two distinct schools of thought in machine learning, especially when applied to high-level

feature extraction in computer vision: discriminative and generative. In thediscriminative approach,

classification and recognition systems are built bottom up, starting with low-level features in the hierar-

chy, extracting higher, and higher levels of features or concepts. The design is concerned with accurate

mapping between the input and output spaces, without any explicit emphasis on how a label from the

output space could be the cause of generation of the data point in the input space. Usually, supervised

methods are necessary to train discriminative approaches for image and video processing that provide

a meaningful output, since it is difficult to predict how a given clustering criteria will affect the final

outcome in a complex vision task.

On the other hand, it is hypothesized that humans infer the state of the world around them by

matching and validating the input signals against a model of the world that is already in their mind [1].

This means that humans are tweaking a model of how the data were generated in a top down fashion to

validate the observed data. Such models that describe the process of how the data were generated belong

to the second school of thought in machine learning and are known asgenerative models. Generative

models naturally describe how the high-level concepts give rise to the observed low-level data. Hidden

variables can also be incorporated in a generative model in a principled manner. Probabilistic graphical

models [2] are tools that help in describing such models graphically and the inherent uncertainty within

the model probabilistically. The generative approach is also closely related to a human description of

cause and effect relationship between various features of the data.

Another related field that aims to simplify (or compress) the description of data and make its inter-

pretation robust to noise is information theory. There seems to be a direct intuitive relation between an

5

efficient representation (or compression) of the data and the description of the generative process. Thus,

we are also witnessing an increased application of information theoretic concepts to machine learning.

1.3 Original Contributions

In this work, we explore the the two prevalent frameworks for designing computer vision systems.

We design and implement computer vision systems based on these two frameworks and also study

systems implemented by others. We analyze their performance. We critically compare their advantages

and disadvantages.

We also come up with a new framework that combines the advantages of the two existing frame-

works. We explore the theoretical underpinnings of this new framework to some depth. We design

computer vision systems based on this new framework and analyze the results of our experiments. We

present suggestions for further exploration and improvement.

1.4 Overview

We study the frameworks to design complex vision systems to extract high-level features based on

machine learning. Scenarios of interest will be object recognition, multitarget tracking, and event de-

tection in surveillance videos. We start with the more proliferate approach in designing vision systems

for these tasks- the feedforward design and move towards the more intuitive generative approach, ad-

dressing some of the issues faced by the generative approaches. We then design a new framework that

combines the advantages of both of these approaches.

In Chapter 2, we present our work on the feedforward way of assembling feature extraction mod-

ules in a hierarchy to design a vision system for a given task. We present the results and discuss the

limitations of these methods. We also discuss why these approaches are so popular.

In Chapter 3, we present how some of the existing techniques address the issue of feedback among

various modules or units in a vision system. We present some of our work on ad hoc feedback design.

We discuss the strengths and limitations of these techniques.

6

In Chapter 4, we compare the two frameworks critically and present a view to see them from a

common viewpoint. We lay further foundation to seamlessly borrow ideas from either framework in

order to design a new hybrid framework called V/M graphs.

In Chapter 5 we develop the V/M graphs framework that can be viewed as a hybrid between the

discriminative approach and generative models. The framework uses the known modules that work

well for different vision tasks, which may have been designed using a discriminative approach. These

modules can be modified and fit into a complex generative model to simplify computation.

In Chapter 6, we present some applications and results using the V/M graphs framework. We high-

light the qualities of this new framework that it borrows from the two existing frameworks.

We conclude our work in Chapter 7. We also present future work that needs to be done to establish

whether the new framework can become a useful tool to design complex vision systems. We also suggest

some directions that can be taken to further develop this line of research. Applications that can come

out of this work are also suggested.

7

CHAPTER 2

MODULAR FEEDFORWARD VISION SYSTEMS

Complex tasks such as object recognition, multitarget tracking, and event detection have attracted

a lot of attention in the computer vision and video understanding community. Most of the systems for

these tasks work in a feedforward manner. Image processing and other techniques extract low-level

features, usually in the form of a feature map of the image or the frame. Based on these low-level

features, the object of interest is segmented. Attributes, which are again low-level or mid-level features

are extracted for the object. These features are compared to a model of the object for object recognition,

or model of an event for event recognition. There are many techniques in object recognition and event

detection that fit this description. In this chapter we present a brief review and our work on vision

systems without feedback between modules.

2.1 Object Recognition

Object recognition is a high-level task. Most object recognition systems are based on the assumption

that the object can be accurately segmented from the image. Based on the segmentation results, certain

features are extracted. The extracted features are matched against a model of the object, which, in turn,

is based on these features. The models are usually formed in a supervised manner by presenting a some

exemplars of the object to a modeling system. Some examples from the vast number of systems that fit

this approach are [3–5].

8

2.1.1 Scenario

Based on the assumption that the object can be segmented from the image, in our work on object

recognition [6], we addressed the case where the only reliable information that can be extracted from

the image of the object is its silhouette. This is true when the object is bounded by a smooth textureless

surface. We further assumed the weak-perspective (or scaled orthographic) case where the rays from

the object are (approximately) parallel when they hit the image plane. The concept of the silhouette

formation from the occluding contour is depicted in Figure 2.1.

Viewing OccludingContour

Image

Contour

Cylinder

Figure 2.1Occluding boundaries under orthographic projection.

2.1.2 Theory: feature mapping and matching

Every two-dimensional (2-D) point on a plane has an equivalent dual in the 2-D line space (since

both can be defined by two numbers). Extending this for a curveΓ in 2-D, its dualΓ′ can be defined as

the loci of its tangents, where each tangent corresponds to a point on the curveΓ. We use this dual to

obtain an invariant representation of the silhouette that helps us in object recognition.

For each tangent orientation, there are two parallel tangents that enclose the silhouette between

them. These tangents represent the convex hull of the object. When the silhouette is complicated

enough, there are more than two parallel tangents for any given orientation of tangents. In general,

there is an even number of tangents of any given orientation unless the orientation represents one of

9

the special points. For example, if we rotate an object and densely sample the images for the changing

viewpoint/pose, we will see some of these tangents move closer or apart. The special points represent

singular viewpoints/poses where the points of contact for two tangents merge resulting in merging and

further disappearance of these two tangents if continued in the same direction of viewpoint/pose change.

These singularities are similar to the concept of aspect graphs [7,8].

When the distance between parallel tangents is divided by the distance between the outermost tan-

gents for this given orientation, the resulting vector of normalized distances is a scale-invariant property

of the shape. The locus of thus normalized intertangent distances for different orientations is called

the signatureof the silhouette. So, the signature can be parameterized by the orientation angle of the

tangents which spans a180◦ space. The points on the signature corresponding to different angles do not

have the same dimension. The dimension is given by the number of parallel tangents for that orientation

minus 2 (the ones corresponding to the convex hull). The extraction of one data point for the signature

for a given orientation of the tangents is shown in Figure 2.2.

O

dd

d1

23

O

Figure 2.2 The scalarsd1 and d2 defining a point on the signature are determined using distancesbetween parallel tangents on the original closed curve corresponding to the orientation (perpendicularto) line∆θ at angle theta to a reference orientation (such asx-axis). These distances are normalized bydividing them by the distance between the outermost tangents;d3.

The silhouette corresponds to an actual three-dimensional (3-D) curve on the surface of the object

where the view direction grazes the surface as shown in Figure 2.1, known as theoccluding contour.

10

Let us call this contourC1. The set of parallel tangent linesT θ11 grazing the silhouetteS1 for a given

orientationθ1 in a given imageI1 thus correspond to a set of parallel tangent planesL grazing the

surface of the object and containing the view direction. The orientation of linesθ1 in the imageI1 also

corresponds to an orientation about the view axis in the real world. Let the view direction (in the real

world) beV1. Let the set of points where the tangent planes touch the object beP , and the set of points

on the silhouetteS1 in the imageI1 whereT θ11 touches the silhouette beP1.

Now, if we view the object from another directionV2 that lies on the same set of planesL (such

thatV1 × V2 is normal to the plane), the corresponding occluding contourC2 will intersect the previous

occluding contourC1 at precisely the same pointsP where the planes graze the surface of the object.

This is trivially true, because the new view direction lies in the same set of planes. The set of pointsP

is known asfrontier pointsfor the two view directionsV1 andV2. Since the relative/normalized distance

between these planes remains unchanged (as these are the same planes for the two view directions or

the two images), the corresponding set of tangent linesT θ22 (for some orientationθ2 in the imageI2) to

the silhouetteS2 in the new imageI2 (which are the image of the set of planesL) will have the same

relative separation as the lines in the setT θ11 . This is true if some of the original set of frontier pointsP

are not occluded in eitherI1 or I2. Thus, if we normalize these distances between tangents in setT θ22

by dividing by the distance between the outermost tangents in this set, we will get the same signature

point corresponding to that of the setT θ11 , which can thus be matched. Such a match will be invariant

to image scaling and aspect ratio changes between various images, and will depend on the geometry

of the object. If the set of parallel tangents is sufficiently rich (and the object geometrically complex),

with some diligence such a matching can serve as surface geometry based object recognition. For more

details, one could see [6].

2.1.3 Apparatus

A camera with a zoom lens (high focal length) was used to take images of the object in a near

weak-perspective (scaled-orthographic) setting. The object was put on a pan-tilt turntable in front of a

back-lit screen or a dull black cloth, and a sequence of images in various poses (or equivalently, various

view directions) was taken. The high-contrast that the opaque object made in front of the bright back-lit

11

screen or the dull black cloth made the image processing required to extract the silhouette of the object

easier. The change in the pan-tilt parameters of the turntable determined the pose of the object, or

equivalently, the view angle. Some of the object modeling shots are shown in Figure 2.3.

Figure 2.3Some images from the object modeling sequence of the ‘Phone’ object.

2.1.4 Object modeling

Six objects were modeled using the system. Representative images of these objects are shown in

Figure 2.4. These objects were put on the turntable and the table was rotated to simulate one great circle

of the view sphere.

Figure 2.4Representative images of the six objects used for object modeling.

12

The Canny edge detector was used to obtain object boundaries as linked, closed curves. To prevent

the program from becoming confused by the internal edges while extracting the silhouette, some of

the internal edges were removed by hand. The silhouette was smoothed and its dual in tangent space

was calculated. While modeling the object from various images of the object, such signatures were

calculated and integrated into an object model. The complete system diagram for the object modeling

system is shown in Figure 2.5.

Figure 2.5Object modeling system diagram.

2.1.5 Object recognition

Test images for recognition were shot from novel viewpoints that did not coincide with the great

circle that the modeling images were shot from. The silhouette from the test image was also extracted

and smoothed. The dual of the silhouette is extracted. In the matching phase the signature of the dual

is matched against the signature of the stored object models. Assuming orthographic projection, even

if the test image is taken from an angle not seen earlier, the signatures will match for the same object.

The reason for this expected match is that the silhouettes are orthographic projections of 3-D curves that

correspond to the grazing of the view direction of the actual object surface, as explained in Section 2.1.2.

The diagram for the system that matches the test images to stored models is shown in Figure 2.6. More

details are given in [6].

Figure 2.6Object matching system diagram.

13

2.1.6 Results and discussion

The recognition system was tested on a set of six objects. The objects are shown in Figure 2.4. The

recognition system achieved more than 90% recognition rate for this set of six objects. The results are

shown in Figure 2.7. More experiments on cluttered background were also conducted and showed some

initial success [6]. These concepts were extended to non-orthographic setting in [9]. Later, the same

principles were also used in calculating structure from motion for smooth textureless objects [10,11].

Figure 2.7 Some test images from novel (unseen) viewpoints that were correctly recognized. Thecontour is shown as a thick dotted line around the object.

The success of the modeling and recognition was found to be critically dependent on accurate ex-

traction of the object silhouettes. This required an elaborate setup of clutter-free background for taking

images. From a practical system point of view, a clutter-free background is rarely available in the real

world. One has to model the clutter and take noise into consideration for any practical application.

In fact, segmentation and recognition together can be viewed as chicken-and-egg problems [12], since

the success of recognition depends on accurate segmentation, and segmentation itself can benefit from

recognition. We explore systems with feedback for chicken-and-egg problems in Chapter 3.

14

2.2 Multiple Object Tracking and Event Detection

Similar to object recognition, most approaches in event detection have taken the feedforward ap-

proach from low-level to high-level feature extraction without much use of feedback in the system.

These approaches focus on object tracking based on background subtraction and blob extraction mod-

ules whose output serve as input to object tracking modules. Some of the extracted features of the

objects and their position are used to form behavior models for event detection. Typical examples of

such systems are [13–17].

Our work on multiple object tracking and event detection is based on a similar approach [18]. The

objective is to track multiple humans in an indoor environment and detect whether certain events have

taken place. The human trajectories are used to trigger specific events. The multiple object tracking

method works on fixed cameras.

2.2.1 Human detection

Human detection starts with an adaptive background modeling module which deals with changing

illuminations and does not require objects to be constantly moving. A Gaussian-mixture-based back-

ground modeling method [19] is used to generate a binary foreground mask image. An object-detection

module takes the foreground pixels generated by background modeling as input and outputs the prob-

abilities of object-detection. It searches over the foreground pixels and gives the probability of each

location where a certain scale object is found. Any object-detection approach can be fit into this part.

In our implementation, we apply a neural-network-based object-detection module to detect pedestrians.

Each foreground blob is potentially the image of a person. Each pixel location is applied to a neural

network that has been trained for this task. The neural network generates a score, or probability, indica-

tive of the probability that the blob around the pixel does in fact represent a human of some scale. A

particular part of the detected person (e.g., the approximate center of the top of the head) is illustratively

used as the location of the object, which is shown as a light spot in Figure 2.8(c). The lighter spot

demonstrates the higher detection score. The neural network searches over each pixel at a few scales.

The detection score corresponds to the best score, i.e., the largest detection probability, among all scales.

15

(a) (b) (c)

Figure 2.8Results of the human detection. (a) The original frame, (b) background mask, and (c) humandetection probability map.

2.2.2 Multiple human tracking algorithm

The tracking algorithm accepts the probabilities of preliminary object-detection and keeps multiple

hypotheses of object trajectories in a graph structure, as shown in Figure 2.9. Each hypothesis consists

of the number of objects and their trajectories. The first step in tracking is to extend the graph to include

the most recent object-detection results, that is, to generate multiple hypotheses about the trajectories.

An image-based likelihood is then computed to give a probability to each hypothesis. This computation

is based on the object-detection probability, appearance similarity, trajectory smoothness, and image

foreground coverage and compactness. The probabilities are calculated based on a sequence of images;

therefore, they are temporally global representations of hypotheses likelihood. The hypotheses are

ranked by their probabilities and the unlikely hypotheses are pruned from the graph in the hypotheses-

management step. In this way a limited number of hypotheses are maintained in the graph structure,

which improves the computation efficiency.

In the graph structure (Figure 2.9), the graph nodes represent the object-detection results. Each node

is composed of the object-detection probability, object size or scale, location, and appearance. Each link

in the graph is computed based on position closeness, size similarity and appearance similarity between

two nodes (detected objects). The graph is extended horizontally in time. In this section we describe

16

three steps of the tracking algorithm: hypotheses generation, likelihood computation and hypotheses

management.

Figure 2.9Multiobject tracking graph structure.

Given object-detection results in each image, the hypotheses generation step firstly calculates the

connections between the maintained graph nodes and the new nodes from current image. The main-

tained nodes include the ending nodes of all the trajectories in maintained hypotheses. They are not

necessarily from the previous image since object-detection may have missing detections. The connec-

tion probabilitypcon is computed according to Equation (2.1):

pcon = wa × pa + wp × pp + ws × ps (2.1)

In Equation (2.1)wa, wp, andws are the weights in the connection probability computation; that is,

the connection probability is a weighted combination of appearance similarity probabilitypa, position

closeness probabilitypp, and size similarity probabilityps. We prune the connections whose proba-

bilities are very low for the sake of computation efficiency. As shown in Figure 2.9, the generation

process takes care of object occlusion by track splitting and merging. When a person appears from

occlusion, the occluding track splits into two tracks, on the other hand, when a person gets occluded, the

corresponding node is connected (merged) with the occluding node. The generation process deals with

missing data naturally by skipping nodes in graph extensions; that is, the connection is not necessarily

17

built on two nodes from consecutive image frames. The generation handles false detections by keep-

ing some of the hypotheses that exclude the nodes corresponding to the false detections. It initializes

new trajectories for some nodes depending on their (weak) connections with existing nodes and their

locations (at appearing areas, such as doors, view boundaries). The multiple object tracking algorithm

keeps all possible hypotheses in the graph structure. At each local step, it extends and prunes the graph

in a balanced way to maintain the hypotheses as diversified as possible and delays the decision of most

likely hypothesis to a later step.

The likelihood or probability of each hypothesis generated in the first step is computed according to

the connection probability, the object-detection probability, trajectory analysis, and the image likelihood

computation. The hypothesis likelihood is accumulated over image sequences, and likelihood for frame

i is incrementally calculated based on the likelihood in imagei− 1, as show in Equation (2.2):

li = li−1 + Pn +

∑nj=1 log(pconj ) + log(pobjj ) + log(ptrjj )

n+ Limg (2.2)

In Equation (2.2)li is the image likelihood in theith frame,n represents the number of objects in

current hypothesis.pconj denotes the connection probability ofjth trajectory computed in Equation

(2.1). If jth trajectory has missing detection in current frame, a small probability, i.e., missing prob-

ability, is assigned topconj . The termpobjj is the object-detection probability, andptrjj measures the

smoothness ofjth trajectory. We use the average likelihood of multiple trajectories in the computation.

The metric prefers the hypotheses with better human detections, stronger similarity measurements and

smoother tracks.Limg is the image likelihood of the hypothesis. It is composed of two items as shown

in Equation (2.3):

Limg = lcov + lcomp (2.3)

The termlcov in Equation (2.3) is further represented in Equation (2.4), andlcomp is represented in

Equation (2.5)

lcov = log

(|A ⋂

(⋃n

j=1 Bj + c||A|+ c

)(2.4)

18

lcomp = log

(|A ⋂

(⋃n

j=1 Bj + c||∑n

j=1 Bj |+ c

)(2.5)

In Equations (2.3) and (2.4)lcov calculates the hypothesis coverage of the foreground pixels, and in

Equations (2.3) and (2.5)lcomp measures the hypothesis compactness.A denotes the sum of foreground

pixels, andBj represents the pixels covered byjth node (or track). The symbol⋂

denotes the set

intersection and⋃

the set union. The numerators in bothlcov andlcomp represent the foreground pixels

covered by the combination of multiple trajectories in current hypothesis, therefore,lcov represents the

foreground coverage of the hypothesis, the higher the larger coverage, andlcomp measures how much

the nodes overlap with each other, the larger the less overlap and the more compact. c is a constant.

These two values give a spatially global explanation of the image (foreground) information.

The hypothesis likelihood is a value refined over time. It provides a global description of object-

detection results. Generally speaking, the hypotheses with higher likelihood are composed of better

object-detections with good image explanation. It tolerates missing and false detections since it has a

global view of image sequences.

This step ranks the hypotheses according to their likelihood values. To avoid combinatorial explo-

sion in graph extension, we only keep a limited number of hypotheses and prune the graph accordingly.

The hypotheses management step deletes the out-of-date tracks, which correspond to the objects which

are gone for a while, and keeps a short list of active nodes which are the ending nodes of the trajectories

of all the kept hypotheses. The number of active nodes is the key to determine the scale of graph ex-

tension, therefore, a careful management step assures efficient computation. The design of this multiple

object tracking algorithm follows two principles. (1) We keep as many hypotheses as possible and make

them as diversified as possible to cover all the possible explanations of image sequences. The top hy-

pothesis is chosen at a later time to guarantee it is an informed and global decision. (2) We make local

prunes of unlikely connections and keep only a limited number of hypotheses. With reasonable assump-

tions of these thresholds, the method achieves real-time performance in a not-too-crowded environment.

The graph structure is applied to keep multiple hypotheses and make reasonable prunes for both reliable

performance and efficient computation. The tracking module provides feedbacks to the object-detection

19

module to improve the local detection performance. According to the trajectories in the top hypothesis,

the multiple object tracking module predicts the most likely locations to detect objects. This interaction

tightly integrates the object-detection and tracking, and makes both of them more reliable.

2.2.3 Tracking results

The multiple object tracking method has been tested on two cameras. The first scenario includes

two persons coming into the door about the same time. Figure 2.10(a) shows 4 images from the se-

quence with overlaid bounding boxes showing the human detection results. The darker the bound box,

the higher the detection probability. Figure 2.10(b) demonstrates the multitracks with the largest prob-

ability generated by the multiple object tracking. The tracks are overlaid on the detection score map.

Different intensities represent different tracks. The human detection based on each image is certainly not

perfect. In the first and third images, the human detector misses the person in the back due to occlusion

and the person in the front due to distortion, respectively. There are false detections in the forth image

caused by background noise and people interaction. However, the multiple object tracking method man-

ages to maintain the right number of tracks and their configurations, as shown in Figure 2.10(b), because

it searches for the best explanation sequence of the observations over time. Figure 2.11 demonstrates

an example of multiple people tracking with crossing tracks. The example first shows the lady opens

the door for the person in gray shirt, then the person in dark shirt follows and goes into the area. Fig-

ure 2.11(a) shows the images from the sequence, and (b) demonstrates the tracking result. Interestingly,

there is one short track close to the up-left corner of the result image because one person is standing

inside the door and the human detection consistently detects him through the glass window. Therefore,

four tracks are shown in Figure 2.11(b), the short track for the standing person, the long track for the

lady, the light track for the guy in gray shirt, and the dark track for the guy in dark shirt.

However, there are cases (not displayed) where the human detection completely fails to pick a human

for a few consecutive frames. In such a scenario, as expected, the tracking and event detection that

depend on human detection results also fail. Also, the appearance matching is based on color-histogram

matching, and sometimes it causes tracks to crossover in error.

20

(a) (b)

Figure 2.10 Tracking results with missing/false human detections: (a) original images with overlaidbounding boxes showing the human detection results, (b) multiple object tracking result overlaid on thehuman detection map.

(a) (b)

Figure 2.11Tracking results of crossing tracks: (a) original images with overlaid bounding boxes show-ing the human detection results, (b) multiple object tracking result overlaid on the human detection map.

2.2.4 Event detection

Event detection was based simply on empirically determined hard-coded rules that the multiple

human tracks were tested against to test occurrence of events of interests. The exact nature of events is

not important for testing a simplistic event detection technique such as this, where there is no learning

involved at the event detection level.

2.3 Discussion

There are several observations in order. Feedforward approaches based on feature detection is the

popular choice to assemble complex vision systems among the research community. Attempts to im-

prove performance aim at improving the individual performance of the modules in the feedforward

hierarchy. A considerable amount of research has been done in coming up with features invariant to

21

transforms that are irrelevant to the task at hand [3, 20, 21]. The success largely depends on the indi-

vidual performance of modules in the hierarchy chain. There is information loss whenever a feature

extractor maps an image to a lower dimensional space. It is not always obvious how the information

extracted by the feature detectors will be the only information needed by the next level.

As discussed in the cases above, the feedforward systems depend on good performance of a few

critical modules. For example, the object-detection system in Section 2.1 critically depends on suc-

cessful extraction of object silhouette, and the event detection system in Section 2.2 depends on the

human detection, color-histogram extraction and interframe object matching criteria. Most interesting

real-world scenarios encountered are far too complicated, cluttered, and novel to depend on one or two

specialized modules. Moreover, there is no principled mechanism in feedforward and discriminative

design to make use of feedback from the high-level modules to help in learning or inference at the lower

levels or among the various high-level modules themselves. There is no use of synergetic inference and

learning between modules either.

In the area of multiple object tracking, there isn’t any database that the research community can

benchmark against. Such databases exist for the face recognition community, such as the Yale Face

Database [22] or FERET Face Database [23]. Moreover, most of the research is driven by target ap-

plications and scenarios. Unlike well-segmented faces under different pose and lighting conditions, for

example, it is difficult to lay down the specific scenarios for multiobject tracking. The nature of the

problem of occlusion and changing lighting conditions are unique to every scenario.

For the task of event detection, specifying the scenario becomes even more difficult. It seems that

a tight coupling between various modules and levels of the hierarchy of features is the key to produce

more generalized systems. Chapter 3 deals with this issue further.

22

CHAPTER 3

VISION SYSTEMS WITH FEEDBACK AND GENERATIVE MODELS

Many subproblems tackled by a vision system are interrelated. For example, in a generic image,

segmentation and recognition of an object depend on each other where the position, scale, pose and

configuration of an object cannot be known beforehand. Vision systems that use correct detection and

segmentation as a precondition to recognition, pose estimation, etc., have to depend on the accuracy of

detection and segmentation results. On the other hand, a recognition system can provide feedback about

the accuracy of different segmentation results thereby narrowing the search space of recognition based

on partial detection and segmentation. This can help the detection and segmentation in certain cases of

ambiguity. Such ideas have been expressed before in [24, 25]. Similarly, in hierarchical representation

of the visual data, feedback from higher level can not only help in inference in face of ambiguity and

noise at lower levels (and in turn help refine the inference at higher levels), it can also help the learning

at lower levels to tune feature detection parameters so that the results of higher level processing can be

improved.

3.1 Related Work

Previous work on systems with explicit or implicit feedback between modules that handle different

tasks can be classified based on their design. The designs differ in the basic unit module that makes

up the system as well as in the way these modules interact with each other. While there has been work

done on interconnection of linear and simple nonlinear modules, we are not aware of any work on

interconnection of complex modules. Probabilistic graphical models use the interconnected graphical

structure, but they do not have processing modules per se.

23

3.1.1 Connectionist models

Connectionist models include neural-network-type models where various vision tasks such as fea-

ture detection, segmentation, and recognition are solved implicitly and jointly in a network of very sim-

ple and often similar processing units. Inspired by an interconnection-of-neurons model of the human

brain, there are no clear-cut modules that perform subtasks separately or explicitly. Examples include

the Convolutional Neural Networks [26] that perform the task of digit localization and recognition.

Similarly, the Cresceptron is used for joint segmentation and recognition of objects from images [12],

where the system is trained by presenting exemplars of segmented objects, which triggers creation of a

connectionist network to model the shape and appearance of the object. During the testing phase, the

network solves the segmentation and recognition jointly.

There are clear advantages in using these methods as shown by the results in the related publica-

tions [12,26]. Chief among the advantages is competitive accuracy and cosolving two or more interde-

pendent tasks. However, it is usually computationally and memory intensive to train and process data

using these models. Moreover, it is not clear how to extend these models for more complicated tasks,

including tasks involving temporal sequences, such as tracking and event detection. Also, for most

tractable models the feedback is limited to training and not the inference.

3.1.2 Information-theoretic models

Information-theoretic models are based on principles of information compression and transmission,

that is, information theoretic concepts such as mutual or relative entropy of different variables. In

layered information-theoretic models, higher-level layers predict the output of the lower-level layers,

and lower-level layers pass on the prediction error back to the higher-layer to help adjust parameters

for prediction. Ideas that represent explicit hierarchical representations and feedback in the hierarchy

include [27] and [28].

In [27], an image is coded using a hierarchy of linear predictors in a cascade array, where the

receptive field of a linear predictor maintains spatial contiguity. A predictor at a given layer tries to

predict the output of the layer below, and the layer below passes only the error in prediction to the layer

24

above. Such predictive mechanism is claimed to be a part of human visual cortex. However, linear

prediction and the architecture presented in this work is restrictive and its application to problems more

complex than learning local image structures is not clear.

The idea presented in [28], however, takes a probabilistic approach in a similar predictive hierarchy

compared to [27]. The idea is to pass the information down the hierarchy in the form of probabilistic

priors. It is related to an intuitively elegant theory of how vision patterns are generated, and how learning

inference for vision can be related to information theory, calledpattern theory. Some of the other

representative publications from the pattern theory group are [29–31]. However, application of this

theory to some complex vision problems is yet to be seen.

Machine learning when applied to vision can be linked to information theory in the sense that the

higher level features are a low-dimensional compressed version of the low-level raw-pixel information

that we are trying to draw inference from.

3.1.3 Generative models

A totally different class of models compared to discriminative models are generative models that

model the generative process or the hypotheses that give rise to the observed data. The learning task

consists of estimating free parameters of the generative model. The cost function to be optimized for

learning is usually based on maximizing the likelihood of the observed data. Since observed data are

all that we are sure of, maximizing the probability of the data (in the probability space defined by them)

would make a lot of sense, if we want to make a complete use of the observed data. A noise model is

often assumed in order to prevent over-fitting of the model to the data.

The assumption behind these models is that we know (or at least have an idea of) the process that

generated the observed visual data. Feedback and interaction between various constituents is implicitly

designed in the generative process itself. For example, we know that certain scenes are generated due

to interaction of objects. For simplicity, this interaction can be modeled as an interaction between

layers [32]. The layers in front occlude the layers at the back. The features due to appearance of the

individual layers and due to the interaction and occlusion between layers can be treated in a unified

manner if we are try to generate the whole scene using the layered model. For this, we need to infer the

25

hidden variables and learn the parameters that define the layers and match against the whole observed

data to see if it can be generated with a high likelihood. As far as videos are concerned, usually the

difference between variables and parameters is that the parameters remain fixed (or change slowly) over

time, while variables change from one frame to another.

3.1.3.1 Probabilistic graphical models

The mathematical property of these models that makes them suitable for graphical representation

is the factorization property. Probability distribution defined over all random variables in the model

rarely has all variables dependent on all the other variables. Most variables are dependent on a small

set of other variables. Thus, a large number of pairs of variables in the model are mutually independent

given other variables. This reduces then-dimensional probability distribution function ofn variables to

a product of a number of simpler factors, where each factor involves only a subset of then variables.

A factor graph represents the factorization of a function graphically [33]. It is a bipartite graph,

which means it has two sets of nodes, with each node connected by edges only to some nodes in the

other set that it is not a part of. Thus, if the edges that connect nodes from one set to the other are

removed, the graph is left with no edges. The nodes in the first set are called thevariable nodes. Each

such node represents one of the variables of the overall function. The nodes in the other set are called

function nodes. Each function node represents one of the factor functions that must be multiplied to get

the overall (often probability) function of all the variables. A function node is connected to only those

variable nodes which represent the variables that are the arguments of the factor function represented by

the function node.

Other graphical models such as Bayesian networks, and Markov random fields can be converted to

factor graphs. Hence, we shall deal with the most general model, that is the factor graphs. One of the

main advantages of factor graphs is a local inference algorithm called the sum-product algorithm [33].

Pearl’s belief propagation algorithm for Bayesian networks can be viewed as a special case of the sum-

product algorithm. Learning in factor graphs can be formulated in a number of ways.

26

Another related model that generalizes over factor graphs is product of experts [34]. A useful

approximate learning algorithm based on contrastive divergence has been suggested [34]. However, a

local inference mechanism such as the Sum-Product algorithm has not been devised.

The research in [2,32] most relevant to the applications that we are interested in (tracking, separation

into layers) uses the EM algorithm [35]. Although an online version of the tracking algorithm was

also published [36], it is still based on very limiting assumptions about change of appearance, and is

computationally very expensive. It is not clear how it can be used in a real-world online system that

processes a video in real-time while producing useful tracking results for multiple targets.

3.1.3.2 Models related to pattern theory and generative modeling

Models related to pattern theory make use of information theoretic insight to come up with gener-

ative models of the data. One of the successful applications of pattern theory and generative models

is in texture recognition [37]. Approximate inference techniques such as Markov chain Monte Carlo

(MCMC) [38], and jump-diffusion process are used to infer the hidden variables in the generative model

to fit the observed data. Other successful applications of pattern theory and generative models have been

in image parsing [39] and human hair modeling [40].

However, these models suffer from drawbacks similar to other generative models (such as proba-

bilistic graphical models), such as slow inference and learning. This is actually of a larger concern in

the applications that have come out of pattern theory, since they use slow statistical sampling methods

such as MCMC. Thus, although the results present a promising possibilities, the applications have been

limited to static image understanding tasks, and have not scaled to video processing.

3.1.4 Comparison of connectionist, information-theoretic, and generative models

With a simple argument one can prove that an accurate generative model is lowest dimensional rep-

resentation of the data. By accurate generative model, we mean that the data were indeed generated as

described by the generative process described by model, and assumption about the unpredictable noise

in the generative process is also accurate. Let us assume that a generative model withn independent

parameters was used to generate a data set. Any other learnt description of the data set withm indepen-

27

dent parameters will fail to account for a variation due to at least one of the parameters ifm < n, due

to the independence assumption.

Generative models share certain characteristics with information theoretic models. First, while gen-

erative models represent data efficiently by using a smaller dimension of parameters and hidden vari-

ables than the dimension of the observed variable space, the information theoretic models equivalently

find an efficient representation or coding of the data through information theoretic means. Second, al-

though generative models are explicitly designed such that one can sample hidden variables from them

to generate a plausible observed data point, with some imagination and modification such a sampling

and generation of plausible data points is also possible using the information theoretic models even

though they are not explicitly designed for such a sampling process.

On the other hand, while probabilistic graphical models and connectionist models share the graph

structure, the similarity ends here. There are no explicit criteria that lay emphasis on the model’s abil-

ity to generate (or sample) the observed data in connectionist models. However, there is some loose

connection between the two models when it comes to learning model parameters making use of the

graphical structure of the model. Methods such as gradient backpropagation [26] in connectionist mod-

els are similar to some of the gradient-based learning methods that piggyback on local message passing

algorithms in probabilistic graphical models [41,42].

3.2 Application: Multimodal Person Tracking

Feedback can also be built in an ad hoc fashion, leading to boosting-type methods. In surveillance,

target tracking and signal enhancement in sound are two of the important tasks. These two problems

can be solved jointly and synergetically. The spatial motion of a moving target can be followed using

video data, captured by a camera. If the object emits sound (e.g., person speaking) audio data can be

used to estimate the time delay of arrival of sound between two (or more) microphones and thus used for

tracking. Tracking using audio is robust to occlusions and variations in lighting whereas tracking using

video alone gives us bothx- andy-coordinates. This point is demonstrated in Figure 3.1, where the visual

modality loses track of the region of interest (ROI) due to occlusion. The occlusion is simulated as a

28

column of random noise pixels. Thus, intuitively it is obvious that these modalities should complement

each other and when used together should provide a more robust system with collective capabilities that

is more than sum of its parts. The audio and video signals are correlated at various levels; lip movement

of the speaker is correlated with the amplitude of part of the signal and can also help us narrow down the

ROI to sound generating source. Also the time delay of arrival (TDOA) between the two microphones

is correlated with the position of the speaker in the image. It has been shown that humans use TDOA

as a main cue for source localization [43]. We also exploit TDOA for the audio-based estimate of the

person’s position using two microphones. When visual tracking fails due to occlusion, instability of the

tracker, or corruption of frames by random noise, audio modality can be used to reinitialize the visual

tracker. On the other hand, when visual tracking is robust, the estimate of the position of the object can

be used to get an estimate of the time delay of the component of the sound coming from the target at the

microphone pair, thus helping in source separation and noise cancelation.

Figure 3.1 Top row shows tracking performance using video alone in presence of occlusion (randomnoise strip). Note that in the rightmost frame, the tracker is unable to follow the subject. Bottomrow shows tracking performance using both audio and video. Now the target is being followed afterocclusion.

We consider a surveillance system with audio and video subsystems. The system blocks are shown in

Figure 3.2. These subsystems can be used together either by viewing the problem as a feature integration

problem or these subsystems can interact amongst themselves to give a better solution than what either

one of them can generate individually. We demonstrate that by using audio alongside video our system

is robust to occlusion. It is also performs robustly when some frames are totally corrupted by noise.

Audio is also used for automatic initialization of visual tracking. Results of the video tracking are used

29

to estimate the time delay for the audio signal generated by the target in a robust manner. This delay is

further used to separate the target audio signal from the background noise. To our knowledge neither

problem has been attempted to be solved using a multimodal approach.

Figure 3.2System diagram.

There has been substantial work done in tracking moving objects using video [32,36,44]. Tracking

people using microphone arrays has also been done [45]. However, the problem of using these two

modalities together is relatively new and growing fast. In [46] the problem of speaker detection is

addressed by using a time delayed neural network to learn correlation within single microphone channel

and an image in video sequence. Cutler and Davis [46] and Garg et al. [47] also address a similar

problem by using multiple audio and video features, such as skin color, lip motion, etc., in a probabilistic

approach. The particle filtering approach of [44] was extended by Vermaak et al. [48] to include audio

for tracking by modeling cross-correlations between signals from the microphones in a array. The

tracking algorithm in [49] extends this approach using a unscented particle filter. Another approach

using graphical models was used by Beal et al. [50] for audiovisual object tracking. However, none

of these works deal with occlusion of the target or corruption of some frames by random noise. The

problem of audio source separation using visual tracking has also not been addressed.

30

3.2.1 Algorithm

We start by developing the audio and video subsystems of the surveillance system independently.

Then we combine the two subsystems to deal with problems of visual tracking; initialization, occlusion,

and frames corrupted by noise. We also solve the problem of source separation and noise cancelation in

audio using the result of visual tracking.

We intentionally chose simple algorithms for each subsystem, which when combined with visual

modality would do a better job than other sophisticated algorithms. We reiterate that focus of the paper

in on combining these modalities in a synergetic manner.

3.2.1.1 Time delay of arrival estimation using audio signals

As shown in Figure 3.3, let us assume that the target objectT moves in thex-direction in a plane

parallel to the image planeO′I of an ideal pinhole camera with focal pointO, whereO′ is the location

where the optical axis meets the image plane, andI is the image of the target object. Thus,OO′

represents the optical axis andO′I represents the image plane. LetT ′ be the projection ofT on the

focal planeOT ′. Thus,z = TT ′ is the distance of the plane of motion of the object from the focal plane

of the camera. Let the microphonesM1 andM2 be placed at a distanced = OM1 = OM2 each on

either side of the focal pointO of the camera. Let the distance of the object from the optical axis be

x = OT ′. The angleφ1 = ∠TM1T′ follows the relation given in Equation (3.1) in triangle4TT ′M1.

tanφ1 =z

x + d=

z

m(x)(3.1)

wherem(x) is a linear function of the position of the object in thex-direction in the image;x = O′I,

which mapsx to x + d = OT ′. Assumingφ1 ≈ φ2 = ∠TM2T′, we also get Equation (3.2):

cosφ1 =δ

2d(3.2)

Using Equations (3.1) and (3.2) we get Equation (3.3):

δ =δ

c=

2d

ccos(cot−1(l(x))) (3.3)

31

wherel(x) is a linear function ofx, δ is the time delay between the sound signals that arrive at the

microphonesM1 andM2, δ is the approximate extra distance that the sound has to travel to reachM1

when compared toM2, andc is the speed of sound.

Figure 3.3Geometry of time delay.

In the real world, however, some of the assumptions (such as the object moving exactly parallel

to the image plane) will not hold. Therefore, we estimate the mapping betweenx (the position of the

object inx-direction in the image), andδ is estimated using a calibration process.

Our framework requires only 15 frames for learning the mapping from the time delayδ to x-

coordinate of the object:x. This process is calledcalibrationof the audio, and it is a one time increment

and an offline training process. After calibration, the audio subsystem can pass an estimate of the object

location to the video subsystem for every video frame.

The delayδ is estimated as follows. We consider windowed audio frames ofN samples with 75%

overlap. It should be noted that several audio frames make up one video frame in terms of time. We

used a coherence measure of cross-correlation to determine the delay between the two microphones:

Rij(τ) =N−1∑

n=0

xi[n]xj [n− τ ] (3.4)

32

wherexi[n] is the discrete sample signal received by microphonei, andτ is the TDOA between the two

received signals. In our case we had two microphones. The cross-correlation is maximal whenRij(τ)

is maximal and whenτ is equal to the offset between the two signals. The complexity in computing

Rij(τ) using Equation (3.4) isO(N2). This can be approximated by computing the inverse Fourier

transform of the cross-spectrum as given by:

Rij(τ) ≈N−1∑

n=0

Xi(K)Xj(K)∗ej2πkτ/N (3.5)

We have developed this algorithm closely along the lines of Knapp and Carter [51]. Also note that

δ follows the relation in Equation (3.6):

δ =τ

fs(3.6)

wherefs is the sampling rate of the sound signal. In actuality, the calibration process fixes the mapping

betweenτ (which is estimated for each visual frame) andx.

The mean of thex’s collected for all the audio frames corresponding to a video frame gives the

audio estimate of the position of the object inx-direction at timet: xaudiot . The inverse of the standard

deviation of thex’s collected for all the audio frames corresponding to a video frame gives the confi-

dence in audio-based estimate of the positionAudConf , which is used in combining the two modalities

to improve object tracking as described later.

3.2.1.2 Visual tracker

Once initialized by hand or automatically (by audio-based estimation, as shown later), the visual

tracker maximizes the Bhattacharya coefficient [52] between the histogram of features extracted from

the target region in the previous frame and that extracted from the potential regions in the current frame.

The feature that we use is the grayscale intensity frequency distribution in the regions. Our visual tracker

is a simplified version of the mean shift tracker [53], which finds rectangular window in current frame

which is a translated version of the rectangular window from the previous frame. The matching finds an

exact solution by searching with one pixel-shifts in a region around the expected position of the window

33

in the current frame, and picking the window that gives the maximum Bhattacharya coefficient with the

window from the previous frame. This is described in Equation (3.7):

xvideot = arg max

∑nk=1

√hk

t (x)hkt−1(xt−1)

x∈N (xt−1)(3.7)

In Equation (3.7)xvideot is the position of the window in the framet (current frame),hk

t (x) is

the histogram forkth feature in thetth frame at the window around positionx, andN (xt−1) is the

neighborhood aroundxt−1, which is defined as a rectangular region aroundxt−1 plus a fraction of the

previous motion vector if the previous motion vector is trustworthy, as determined by the maximum

Bhattacharya coefficient. To our knowledge, such use of Bhattacharya coefficient to determine the

confidence in visual tracking has not been explored before. This gives a simple tracking strategy that

can reduce the search space for the new window by predicting the position of the window in the next

frame in using a simple strategy.

3.2.1.3 Multimodal object tracking

For initializing the visual tracker in the first frame, the position of the target as determined by the

audio subsystem is used. After this the visual tracker performs two-frame tracking as described above.

For cases where the visual tracking fails, a criterion to determine the failure was determined. In such

a scenario, the estimate of the position of the target determined by audio was used to reinitialize the

tracker. The estimate of the target position was fed back to the audio subsystem to estimate the delay

associated with the sound component from the target to perform noise cancelation.

Failure of visual tracker was determined as follows. If the tracker loses the object due to drift, or

occlusion, it settles on the background. When this happens, the tracking window stops moving and

settles on a constant window that does not change. This will also result in the maximum Bhattacharya

coefficient becoming close to one. Thus, when these two conditions happen simultaneously for consec-

utive frames, it is an indication of tracker failure. When frames become totally corrupted, or when the

tracker suddenly loses track of the object, the maximum Bhattacharya coefficient will approach zero.

This was the other criterion to indicate failure of visual tracking. The third way to determine failure was

34

the indication of a highly confident estimate of target position from the audio subsystem that was far

away from the visual tracking-based estimate. This has be summarized in Equation (3.8):

V isFail =

TRUE if bcMax ≥ θ1

AND |xvideot − xt−1| = 0;

TRUE else ifbcMax ≤ θ2;

TRUE else ifAudConf ≥ θ3

FALSE else.

(3.8)

whereV isFail is the Boolean flag that tells whether visual tracking has failed or not,bcMax is the

maximum Bhattacharya coefficient of matching the window from previous frame with windows in the

current frame,xt is the position of the window intth frame,AudConf is the confidence of the audio

subsystem in its prediction of the position of the target, andθ1, θ2, andθ3 are empirically determined

thresholds. The positionxt is set toxaudiot (which is the estimate of the position of the target as deter-

mined by the audio subsystem) whenV isFail is TRUE and toxvideot if V isFail is FALSE, to give a

robust estimate ofxt.

3.2.2 Results

We tested the algorithm on several test cases, and having a multimodal object tracking would im-

proved the performance a purely visual tracker. In some of the cases where the visual tracker failed, the

cause was the tracker drifting on to a matching background, a change in its appearance, an occlusion of

the target, and corruption of frames with random noise. We also tested the algorithm for noise cancela-

tion for source separation of the sound coming from the source of interest. Target motion was mostly

horizontal and translational without any significant movement in they-direction. However, there was

significant change in the appearance of the object due to rotation and articulation. The algorithm was

consistently able to track the speaking target in the presence of background noise, occlusion or even

when frames were corrupted with noise.

The video capture rate was 15 frames per second while audio was digitized at 44 100 Hz. Thus,

we have 2940 audio samples for each video frame. The horizontal direction represents the time along

35

0 10 20 30 40 50 60 700

50

100

150

200

250

300

350

Time →

Est

imat

ed X

Co−

ordi

nate

→

Tracking Performance when Occlusion Occurs

Tracking Using A/VTracking Using Audio OnlyTracking using Video Only

Figure 3.4Thex-coordinate of the output of the audio, video, and combined trackers for occlusion.

the sequence. Since the delay locationsτ had to be mapped to image locations, 10 manually annotated

frames were used only once, and thereafter only raw data were given to the algorithm. No model

parameters were set by hand and no initialization was required.

We present the results on two sequences which had occlusion and or had dropped frames. Audio

waveforms were consistently corrupted with background noise. Occlusions were simulated by making

an occluding bar of pixels using random noise. When visual tracker reaches the portion where occlusion

occurs it loses the target and is unable to locate it again as indicated by the the constant estimate of the

x-coordinate. When we added the audio stream which is not affected by the occlusion the tracking

performance improves as seen in Figure 3.4 where tracker now uses the estimates given by the audio

stream and is thus able to locate the target.

Frame corruptions were simulated by replacing the entire intermittent frames with random noise

pixels. The improvement in the performance by having a additional audio modality is demonstrated in

Figures 3.5 and 3.6. Thus in both these case visual tracker lost the target, but was able to follow the

target with the estimates from audio, consistently. The work has been accepted or publication and will

appear as [54]. We are omitting the results on noise cancelation here.

36

Figure 3.5 Top row shows tracking performance using video alone when frames are dropped due tocorruption by random noise. In the frames after the noisy frame, the tracker is unable to follow thesubject. Bottom row shows tracking performance using both audio and video. Now the target is beingfollowed even after frame drops due to noise.

0 10 20 30 40 50 60 700

50

100

150

200

250

300

350

Time →

Est

imat

ed X

Co−

ordi

nate

→

Tracking Performance when Frames are Noisy

Tracking Using A/VTracking Using Audio OnlyTracking using Video Only

Figure 3.6Thex-coordinate of the output of the audio, video, and combined trackers for noisy frames.

37

3.3 Discussion

In the human tracking system developed with an ad hoc feedback mechanism between subsystems

dealing with the audio and video modalities respectively, we saw a definitive improvement in results. A

more principled manner to integrate different modalities and build a tighter feedback mechanism in the

system will be to use a generative model. The performance of a completely generative model is limited

by the designer’s understanding of the generative process and the ability to convert that understanding

into a generative model: functions of a factor graph, for example. Moreover, for complex systems it

may be very tricky to formulate the algorithms such as the sum-product algorithm for inference and

the EM algorithm for learning. Approximations may be needed for the marginalization and other inte-

gration procedures involved in the sum-product algorithm and the EM algorithm. On the other hand,

some of the specialized modules developed for feedforward systems perform this marginalization for

their output variables in some sense. With some modification, these can inspire new forms of function

approximations for factor graphs. These ideas are explored further in Chapter 4 to inspire the design of

the new framework described in Chapter 5 for designing complex vision systems.

38

CHAPTER 4

BACKGROUND FOR VARIABLE/MODULE GRAPHS

In Chapters 2 and 3, we saw that for solving complex vision problems, there are two types of system

design approaches used in the research community. In the more widely used approach of feedforward

modular design, systems are designed as an interconnection of modules with one-way flow of infor-

mation from one module to another. The desired high-level task is done by a module at the end of the

processing chain. In the second approach, there is implicit or explicit feedback between various modules

to solve related tasks; treating them as interrelated subproblems. The mathematical theory for genera-

tive modeling approach is better understood as compared to that for ad hoc feedback approaches and

connectionist models, since the generative models model the probability distribution function (PDF) of

the combined set of observed and hidden variables. In this chapter, we develop a hybrid approach, called

variable/module graphs (or V/M graphs) [55], by combining the two approaches in order to benefit from

the advantages of both.

4.1 Differences between Feedforward Modular and Generative Design

Design of a feedforward modular vision system [6, 18, 56–58] usually follows a familiar path. One

starts by identifying the end variables (or high-level features) that need to be estimated for each observed

data point. The image can be treated as a high-dimensional vector, which serves as the data point. For

a video, images or frames form a temporal sequence. The video clips can also be treated as a single

data points by concatenating their frames. Then one identifies some intermediate variables (or low-level

features) that can be extracted and can possibly help the detection of the desired high-level features.

If the low-level features can be extracted satisfactorily according to the context and the scenario, it

39

becomes easier to extract the high-level features. However, many of the low-level features are very

difficult to detect due to phenomena such as occlusion, lighting and shadows. Thus, one tries to design

modules to take the observed data (raw pixels) as input and produce low-level features as output in a

manner that is robust to some of these phenomena. Similarly, one designs modules to take low-level

features as inputs and produce high-level features as outputs as robustly as possible. There may be other

tiers in between, depending on the designer’s imagination and the need of the task at hand.

On the other hand, while designing a generative model for video processing [32, 36, 37, 39], one

hypothesizes a generative process of how the high-level variables lead to the generation of the observed

variables. Any intermediate or end variable that is not observed directly is called ahidden variable. The

emphasis of design is not on individual modules or feature extraction but on the generative process as a

whole. The interaction between variables is coded in form of conditional independence of each variable

with respect to other variables. When a subset of variables are directly dependent on each other, this

relation is coded as a joint density function of these variables.

The differences between the two approaches is tabulated in Table 4.1.

4.2 A Unifying View

With some insight and imagination, one can see that there are common points between the two ap-

proaches. First, one needs to identify the hidden variables associated with the observed variables. Some

of these hidden variables form the end variables, whose values we are interested in computing. If we

consider a joint space spanned by observed and hidden variables, it is obvious that not all combinations

of different values (or not all points in this joint space) are equally likely. For example, if the position

of a person in a frame is such that it covers the pixel at position(x, y), then the intensity and color

of the pixel is more likely to be drawn from the appearance of the person than from that of the back-

ground. Such quantification can be coded as a hard or a soft constraint between the values of different

variables. Although it is done differently in the two approaches, identification of hidden variables and

quantification of constraints between variables is the two points that are common between them.

40

Table 4.1Differences between modular and generative design.

Attribute Modular Design Generative Design

Design

Identify end-variablesIdentify features that canor need to be extractedIdentify feature of features

Hypothesize a generative processand dependence structureGraphically represent structureApproximate inference and learning

DiagnosticsImprove individual modulesReplace modules with better ones

Review generative process anddependence structureReview approximations

Advantages

Straightforward, discriminativedesignBottom-up speedy inferenceLess parameters; faster learningComplex joint-PDF models forspecific features

Natural and intuitive representationof the generative anatomyMake principled and completeuse of the information availableLocal message-passing for inferenceEM algorithm for holistic learning

Shortcomings

May not use all informationbecause of feature extractionNo feedback among modulesDepends on individual performanceof modulesMay not be representative of thegenerative anatomy

Limited types of PDFs can be modeledEven approximations are slowOnline learning not always easyLocal learning not always easy

4.2.1 Variables

The inference that we need to make from a video is usually posed as a variable estimation problem.

Since the variable representing the inference is not directly observed, it is called ahidden variablein

the terminology of probabilistic graphical models [59]. There are other hidden variables in the system

that help simplify the relation between the observed variables and the hidden variable that needs to be

inferred. For example, if we want to infer if a person entered or exited the scene, the pixels of different

frames of the video will form the observed variables, the variable representing the “enter” or “exit”

event will be the hidden variable to be inferred. In this case, the variable representing the position of

the person in a given frame can be an example of a hidden variable that we do not really care about

by itself. However, this variable helps define the relation between the pixels and the “enter/exit” event

variable in a more structured manner. The estimation of event variable only needs to be based on the

temporal sequence of the position variable, while the position variable could be directly related to the

41

observed pixel variables. In other words, the event variable can be conditionally independent of the

pixel variables, given the position variable.

In probabilistic graphical models (or generative models), these variables form nodes of the graph.

On the other hand, in modular design, the output of modules can be thought of as the hidden variables.

This can help us think of both the design paradigms from a common viewpoint, which will help us

develop the V/M graph framework to design vision systems.

Generative models, especially those using probabilistic graphical models, work with probability

distributions of the variables instead of their single values. On the other hand, in modular systems,

modules usually spit out single values of various variables. One can think of these single values as

dirac-delta probability distributions or probability mass functions of these variables. This will further

help us with putting the ideas from both the design paradigms together for the V/M graph framework.

4.2.2 Constraints on variables

Although, the ways by which constraints on variables are formulated in the two paradigms seem

very different, they can be viewed from a common viewpoint. As we shall see, the overall knowledge of

these constraints is usually made up of a collection of subconstraints, where each of the subconstraints

is defined only on a subset of the (observed and/or hidden) variables.

In modular design, the working of a processing module defines the output for a given input, or in

other words, constraints the output variable based on the input variables. This can be thought of as a

constraint on the joint value of input and output variables. Another point worth noting is that a module

usually puts constraints on only the subset of variables, where the subset is defined by the variables

forming the input and output of the module.

On the other hand, constraints in probabilistic graphical models are defined as joint-probability

density functions between different variables. The graphical structure takes advantage of the fact that

many of the net joint density between all the variables can be expressed as a product of different joint

density functions, where each of these function terms in the product takes only a subset of the variables

as its arguments.

42

Thus, in both the paradigms, the net constraint on the joint space of observed and hidden variables

is expressed as a collection of subconstraints, where each of these subconstraints is defined only on a

subset of all the variables. This gives us the common ground to view the expression of constraints in

both paradigms from the same viewpoint after viewing the variables in the same setting in Section 4.2.1.

4.2.2.1 Modeling constraints as probabilities

Building upon the idea of constraints on variables, one could think of the subconstraints put by a

module as a joint-probability distribution between the input and the output variables. Approximation to

the actual distribution can theoretically be made by extensive sampling of input and their corresponding

output values from the module. Each of the modules puts one or more of these subconstraints on

the distribution of these variables. The coexistence of these subconstraints occurs in anAND fashion,

making it similar to the product form used in probabilistic graphical models.

Thus, our learning goal would be to estimate the parameters of this joint-distribution that models the

probability of observed data points by optimizing some cost function. The most common cost function

used is the likelihood of the data itself, and the estimation process is known asmaximum likelihood

estimation (ML estimation).

The goal of inference process would be to estimate the (statistics/distribution of the) hidden variables

of interest, given the parameters and observed variables. Thus, we need to know the parameters to infer

the hidden variables. On the other hand, we can estimate the parameters easily if we know the hidden

variables associated with the data (observed variables). In other words, the inference problem is the dual

of the learning problem, and their solutions are mutually dependent. EM algorithm [35], in the two steps

it iterates over, the E step and the M step, refines the estimate of hidden variables and the parameters in

turns.

In V/M graphs, we will make use of the graphical structure of the probabilistic graphical models to

model the structure of dependencies and constraints on the variables, while we will use the processing

of modules as approximations to some of the functions for parameter learning and marginalization over

probability distributions for inference to speed up the process. This also has another added advantage

that we can use naturally more complex and task specific constraints that some modules can define.

43

4.3 Probability Density Modeling

In the previous section, we established how we can view variables and constraints on variables

in modular and generative design from the same perspective. This requires modeling constraints on

variables as probability density functions. There are many advantages of modeling the constraints as

probability densities as listed below.

The advantages of using probability density modeling for expressing constraints on variables include

the following:

1. Probabilistic modeling provides a way to deal with uncertainties.Video processing is notoriously

afflicted by uncertain data. For example, edges become weak in low light or saturation, or dis-

appear behind occluding surfaces. Inferences cannot be made without taking different sources of

information into account. In such a scenario, probabilistic modeling provides a way to remain

non-committal to an early decision until more information can be incorporated.

2. Probability densities functions can be broken into simpler subfunctions using different density

modeling techniques.Such a property is useful when the net constraint on the variables can be

expressed as combination of subconstraints, as is often the case.

3. Bayesian modeling provides a mathematically sound feedback mechanism.While dealing with

variables at different levels, feedback from higher-level variables can help disambiguate the in-

ference at lower-levels, since higher-level variable are usually inferred by using more information

than a lower-level variable. Bayesian modeling provides us with the mathematical tool that al-

lows us bidirectional flow of information. In a loose sense, by “lower-level” variables we mean

the variables that are closely related to the observed variables or raw-data in a graphical structure

of conditional dependencies.

4. It gives access to well-understood algorithmic tools.Tools for inference such as the message

passing algorithms such as the sum-product algorithm [33] or belief propagation [60] are avail-

able for probabilistic models. Similarly, well-understood learning algorithms such as the EM

algorithm [35] and its variants [61] are available for probabilistic models.

44

While modeling a vision system, generative modeling provides a more principled way to model the

data under probability theory. A generative model is usually formulated to explain the entire data in

form of a cost function that represents the data likelihood. Using such cost functions in a generative

setting allows us to account for all the information present in the data, and we do not have to worry

about information loss that is usually associated with feature extraction modules. The information loss

can be quantified in the assumptions made on various probability functions. However, an accurate

model of the entire data in interesting scenarios of video understanding is usually mathematically and

computationally prohibitive. Thus, it becomes necessary to make approximations during modeling,

inference, and learning. For example, when a camera is fixed, it is reasonable to assume that the change

in human appearance will be due to changes in joint-angles; It is prohibitive to model all the joints of

the human body and match their angles against the appearance. Approximate models can model the

changing human appearance as a linear combination of eigen-bases of appearance maps or by modeling

a subset of (salient) joint-angles of a human body. For many practical applications, such modeling

approximations produce acceptable results. However, these simpler models may be inaccurate and

incomplete for certain more difficult high-level tasks such as multi-object tracking with partial or brief

occlusions.

On the other hand, researchers using modular approaches have already improved performance of

individual modules in isolation. For example, background subtraction [19, 56] and contour tracking in

simple scenarios using Kalman filters [58] or particle filters [62] have reached a certain level of maturity.

If we think about it, these modules are ready sources of inspiration and functional approximations for

modeling some of the constraints that can be put on joint-probability space of observed and hidden

variables. So, while the intuition of mutual conditional independence can be coded in a graphical form,

as done in probabilistic graphical models, the formulation of some of the constraints between mutually

dependent variables can be done using modules used in non-generative modular vision systems. This,

as we shall see in Chapter 5, is the key idea behind V/M graphs.

The goal of a generative model is to fit a plausible probability model over the joint space of ob-

served and hidden variables such that it best explains the observed data. Modules that in effect constrain

45

the joint-probability space can be combined in different ways as subconstraints on the joint-probability

space. The two most well-studied ways to combine simpler functions or subconstraints to model com-

plex probability densities are the mixture form and the product form. In the following, we investigate

where and why the product form is more well-suited than the mixture form for combining subcon-

straints.

4.3.1 Mixture form

In the mixture form, a PDF is modeled by a weighted sum of different functions. These functions

are commonly known as mixture components and the weights are known as mixing coefficients. The

mixing coefficients are decided according to the fraction of the total density represented by the cor-

responding component. As an example, the densityp(x) can be represented as a sum ofn mixture

densities represented bypi(x)’s (where,i ranges from1 to n) with mixing coefficientsωi’s as shown in

Equation (4.1):

p(x) =n∑

i=1

ωipi(x) (4.1)

It should be noted that ifpi(x)’s represent valid probability densities, then forp(x) to be a valid

probability density, the mixing coefficients satisfy Equation (4.2):

n∑

i=1

ωi = 1 (4.2)

A mixture form would be useful for modeling a PDF with a complex shape, if its support (probability

space) can be broken into regions where the shape of the PDF in the region can be approximated by a

simple function. In such a scenario, mixture components will act as function approximators in the given

component regions. In the ensuing and the following discussions, adjectivessimpleandcomplexused

for functions refer to their mathematical or computational tractability and the number of parameters

needed to define them. A common function form for the mixing coefficients is the Gaussian form,

leading to mixture of Gaussian modeling [19].

46

Mixture modeling can also take the form of a classification problem where one would want to

determine the underlying distribution of a set of data points with the underlying assumption that each

data point is drawn according to one of the simpler component densities. The mixing coefficients form

the prior probability of the component, while the class label is represented by the mixture that the data

point is supposedly drawn from. The EM algorithm has been used extensively in the literature [35] to

determine parameters of mixture models in an unsupervised manner. However, putting subconstraints

together is different than the classification problem, and as we shall see, the product form may be better

suited for putting this task.

4.3.2 Product form

Approximating a probability density using a product form involves combining subconstraints or dis-

tribution functions by multiplying them together. Such a model is also known asproduct of experts[34].

A product of experts combines different distributions or “experts” by multiplying them and renormal-

izing the output to arrive at a joint-distribution. For example, if a joint-distributions for a set of five

variablesx1 throughx5 can be expressed using four “experts”A, B, C, andD, such a combination

would be described by Equation (4.3):

p(x1, x2, x3, x4, x5) ∝

pA(x1, x2, x3, x4, x5)

× pB(x1, x2, x3, x4, x5)

× pC(x1, x2, x3, x4, x5)

× pD(x1, x2, x3, x4, x5)

(4.3)

In Equation (4.3)p(.) is the joint-probability distribution of the five variables, andpA(.), pB(.),

pC(.), andpD(.) are the factor probability distributions (or “opinions”) given by the four experts. We are

neglecting the normalization needed so that each of the terms in the Equation (4.3) is a valid probability

distribution that integrates to 1.

While modeling a probability distribution over a high-dimensional space using a product form, the

term contributed by each expert need not involve all the variables. This makes it more efficient to model

distributions over high dimensional spaces as individual experts can specialize over a small subsets of

47

variables. Indeed, as we shall see later, this property is used by probabilistic graphical models, such as

Bayesian networks and factor graphs, in order to simplify modeling of the joint-distribution.

The individual experts need to agree on the correct solution (the high probability region of the joint-

distribution). However, each expert is allowed to make mistakes that falsely allot high probability in

certain regions, which should actually be low probability regions of the joint space. This will work as

long as the regions where each expert makes mistakes do not coincide. In other words, all the experts

should not waste their probability distribution on the same low probability region; instead, they should

correct each other’s mistakes. Such an expression is suitable when different experts are looking at

different aspects of a complex task. Modules that perform each of these subtasks will represent different

experts.

Care must be taken, however, while designing the output of the experts. If an expert is wrongly

overconfident that a region should have a low probability, no matter how much the other experts try to

raise the probability of the region, they might never be able to offset the negative opinion of a single

overconfident expert. One way to alleviate this problem is to increase the entropy of individual experts

slightly by adding small uniform distribution terms to their outputs (and renormalizing). This ensures

that every region of the space is assigned a nonzero probability, no matter how small. As we shall see

later, it also has mathematical advantages.

4.3.3 Probabilistic graphical models with product form

Probabilistic graphical models make use of the property of the product form that a component func-

tion (of the factorizable joint-PDF) can take subsets of variables as its arguments instead of the whole

set of variables. The subsets of variables occurring together as arguments in a given function define a

structure of conditional dependencies among the variables. Under certain conditions one can change

from one graphical model to another while coding the same set of conditional dependencies [59]. Here

we revisit two closely related graphical models: factor graph and Bayesian network. V/M graph can be

viewed as a generalization over these two graphical models.

48

4.3.3.1 Factor graphs

A factor (function) of a product term can selectively look at a subset of dimensions while leaving the

other dimensions that are not in the subset for others factors to constrain. In other words, only a subset

of variables may be part of the constraint space of a given expert. This leads to the graphs structure

of a factor graph, where the edges between a factor function node and variable nodes exist only if the

variable appears as one of the arguments of the factor function. This also establishes an equivalence

between factor graphs and the product of experts. An example of this equivalence is shown in Equation

(4.4):

p(x1, x2, x3, x4, x5) ∝

pA( x1, x2, x3, x4, x5 )

× pB( x1, x2, x3, x4, x5 )

× pC( x1, x2, x3, x4, x5 )

× pD( x1, x2, x3, x4, x5 )

∝

fA( x1, x2 )

× fB( x2, x3 )

× fC( x1, x3 )

× fD( x3, x4, x5 )

(4.4)

In Equation (4.4),fA(x1, x2), fB(x2, x3), fC(x1, x3), andfD(x3, x4, x5) are the factor functions of

the factor graph. As we can see, all we have to do is to add dummy variables to each of the functions to

establish equivalence to a corresponding expert in the product of experts form. In this example, function

fA is equivalent to expertpA, functionfB is equivalent to expertpB, and so on. Now, we can freely

borrow ideas from both types of models. The factor graph expressed mathematically in Equation (4.4)

can be expressed graphically as shown in Figure 4.1.

Inference in factor graphs can be made using a local message passing algorithm called the sum-

product algorithm [33]. The algorithm reduces the exponential complexity of calculating the probability

distribution over all the variables into more manageable local calculations at the variable and function

49

Figure 4.1Factor graph for the example given in the text.

nodes. The local calculations depend only on the incoming messages from the nodes adjacent to the

node at hand (and the local function, in case of function nodes). The messages are actually distributions

over the variables involved. For a graph without cycles, the algorithm converges when messages pass

from one end of the graphs to the other and back. For many applications, even when the graph has

loops, the messages converge in a few iterations of message passing. Turbo codes in signal processing

make use of this property of convergence of loopy propagation [63]. The message passing clearly is a

principled form of feedback or information exchange between modules. We will make use of a variant

of message passing for our new framework because exact message passing is not feasible for complex

vision systems.

In the message passing algorithm, there are three main types of calculations. The first type of calcu-

lation is that of calculating a message from a variable node to one of its adjacent function nodes. This

message is obtained by multiplying and normalizing the product of all the incoming messages to that

variable node from other function nodes adjacent to it, except for the message from the function node we

are sending the message to. The second type of calculation is that of a message from a function node to

a variable node. This is done by multiplying all the incoming messages to the function from the variable

nodes adjacent to the function node, except that from the variable node we are sending the message to,

50

with the local function at the function node, and marginalized over all the other variables except the

variable of the variable node we are sending the message to. The third type of calculation is just the

calculation of local belief at a variable node, and it amounts to multiplying all the incoming messages at

that node and normalizing the product. All these messages have to be appropriately normalized so that

they represent probability distributions (that integrate to 1).

4.3.3.2 Bayesian networks

A Bayesian network is a directed graph of nodes. Each node represents a variable, and directed

edges represent dependence relations among the variables. If there is a directed edge from a nodeA to

another nodeB, then we say thatA is a parent ofB, andB is the child ofA. In other words, the variable

at the head of the edge is conditionally dependent on the variable at the tail. The joint-probability of all

the variables is expressed as the product of their conditional probabilities given their parents.

Given a Bayesian network, it can be converted into a factor graph [59]. Thus, ideas from Bayesian

networks are also applicable to factor graphs with appropriate modifications. For example, the inference

algorithm known as belief propagation for Bayesian networks (also known as Belief Networks [60]) is

an instance of the sum-product algorithm used for factor graphs [33].

Results on learning parameters [41,42] and structure [64] in Bayesian networks can also be adapted

to factor graphs. Since we concentrate only on learning parameters in this work and not the structure,

we shall see how the gradient descent-based parameter learning algorithms [41,42] are actually another

instance of online EM algorithm, and applicable to V/M graphs.

4.3.4 Discussion

A cascade of modules can also be thought of as a special case of implementation of the product

form, where each module in the cascade is a factor of the product, and it puts constraints only on its

input and output variables. We have established the characteristics of the mixture form and the product

form, and also seen how the product form lends itself to factorization that is the basis of the probabilistic

graphical models. It is now easy to further justify the advantages of the product form over the mixture

form.

51

4.3.4.1 Advantages of product form

It can be shown that the product form has some definite advantages over the mixture form. The

mixture form is suited for partitioning of the data space into different simple high probability sub-

regions. Each subregion can be represented by a component of a mixture. When the components

themselves have shapes very different from these sub-regions, we may require a large number of such

components. Moreover,if-then-elsetype conditional gating is much easier to obtain using a product

form than using a mixture form. Certain useful learning algorithms such as the EM algorithm [35] are

applicable to log probabilities or log likelihoods. Log forms decompose to much simpler sums when

the product of probabilities is involved. Effective variable partitioning, dimension partitioning, and

gating is very cumbersome with mixture forms. Moreover, by adding some uniform distributions to the

factors of a product form (and renormalizing), one can obtain the function of a mixture form as well.

The converse, that is, expressing a product form as a mixture form, is not easy. Thus, there are clear

advantages to using the product form of modeling joint-distribution over observed and hidden variables.

To get a more intuitive feel for the expressive power of the product form, consider a set of functions

pi(x) indexed byi, wherei ranges from1 to n. As expressed in Equation (4.1), if these functions are

probability distribution, a new distribution can be expressed as a mixture of these functions by taking

a weighted sum. Each of these weights have to be non-negative and sum to1. Now, let us make some

assumptions about the probability space and the functions that we are dealing with. Let us assume that

we are dealing with a finite space, that is, the variablex is defined on a closed setX, such that the

Equation (4.5) holds for some finite positive constantk (0 < k < ∞):

∫

x∈X1 dx = k (4.5)

Let us also assume that all the functions that we speak of are zero outside this setX. Let us also assume

that the functions are highly discriminative, that is, they have a low entropy. Such an assumption is

not unreasonable, since while dealing with high-dimensional data, the underlying structure in the data

will occupy a much smaller subspace. This means that the data points will be sparsely distributed, and

therefore functions representing the data population will be highly discriminative with strong peaks.

52

Such functions will be close to zero (since they cannot be negative and must integrate to1) in most of

the space.

Now, let us modify the functions a little by adding a small constantαi to each functionpi. By small

we mean thatαi << max(pi(x)). The resulting function can be renormalized to integrate to1 over the

setX. Essentially, we are creating a weighted mixture of a uniform distribution overX and the original

function pi. For small enough weight given to the uniform, this should not affect the data-modeling

capabilities ofpi. Here, the assumption is that we are interested in capturing relative probabilities of

high-probability regions accurately, rather than worry about the low-probability regions. As long as

the low-probability regions will remain sufficiently unlikely to be sampled upon random sampling for

example, it does not matter whether instead of the correct cumulative probability ofε, our model assigns

an5ε2 to that region (ifε → 0).

Now, let us consider a product of such modified functions. Equation (4.6) considers such a product to

yield a new functionp(x), while neglecting normalization needed to yield valid probability distributions

for the time being:

p(x) ∝ ∏i(αi + pi(x))

=∏

i αi +∏

i αi∑

jpj(x)αj

+∏

i αi∑

j

∑k>j

pj(x)pk(x)αjαk

+ . . . +∏

j pj(x)(4.6)

In Equation (4.6), if we can safely neglect higher order terms from third term onwards (in the

last line) provided one condition is satisfied. If the functionspi have nonoverlapping high-probability

regions, and if in their low probability regions the functions are so low that they are negligible in com-

parison to a constantαi, (pi(x) << αi in most of the space), the third term (and subsequent terms) will

fade in comparison to the second term, leaving us with the first two terms. Renormalizing the product

will leave us with a uniform (first term) and a sum of functionspi’s with different weighting factors

related toαi’s. Thus, for a slightly modified function (by addition of a constant) and under certain con-

ditions on these functions, we can get a mixture form by taking the product. This is further illustrated

graphically in Figure 4.2. In this figure, we choose thepi’s to be Gaussian functions, so the modified

functionsαi + pi(x) are uni-Gauss functions (sum of a uniform and a Gaussian). We show how closely

53

the sum and the product match on the high-probability region when the peaks of the original functions

are nonoverlapping.

−3 −2 −1 0 1 2 3 40

0.2

0.4

0.6

0.8

1

1.2

1.4

Figure 4.2Product and mixture of uni-Gauss functions coincide over compact supports when peaks areapart: Solid and dotted lines are the original one-dimensional uni-Gauss functions (x-axis representsthe free variable, andy-axis represents the function value), while dashed and dash-dotted lines arenormalized mixture (sum) and product, respectively. All functions are defined in the range -3 to 4 only(and can be assumed zero outside it).

When the high-probability regions of the original functions coincide to some extent, such claims

about representing mixtures through products by adding small uniform terms cannot be guaranteed. Yet,

we can find constants that, if added to the original function, can produce fairly good results. Figure 4.3

shows this for two functions whose peaks are not that far apart. For the sake of the figure, the constants

were found by trial and error. However, in such cases, mixture modeling is also not an elegant way to

model the data, as we are left increasingly unsure of which mixture component the data point belongs

to. On the other hand, as we shall see later, this is where the product form can come in handy when

its different “experts” model different dimensions or aspects of the data. Thus, in our opinion, product

form is much versatile than the mixture form in probabilistic modeling by breaking down the original

function into simpler components.

54

−3 −2 −1 0 1 2 3 40.05

0.1

0.15

0.2

0.25

0.3

0.35

0.4

0.45

0.5

0.55

Figure 4.3 Product and mixture of uni-Gauss functions coincide over compact supports even whenpeaks are fairly close: Solid and dotted lines are the original one-dimensional uni-Gauss functions (x-axis represents the free variable, andy-axis represents the function value), while dashed and dash-dottedlines are normalized mixture (sum) and product, respectively. All functions are defined in the range -3to 4 only (and can be assumed zero outside it).

4.3.4.2 Limitations of probabilistic graphical models

Despite all their advantages, probabilistic graphical models have limitations. As listed in Table 4.1,

the main limitations arise from the calculations needed to estimate probability densities. For example,

local message-passing algorithms for inference such as the sum-product algorithm [33], require two

kinds of calculations; product of probability distributions and marginalization. Of the two, marginaliza-

tion, in general, is difficult to perform since it involves integration (or summation) over a subset of its

argument space. Even while calculating the product, in order for the product to be a valid distribution,

it has to be normalized so that it integrates to1. The normalization constant is known as thepartition

function, which again is a result of integration over the entire argument space.

Approximations for many operations in graphical models have been developed over the years. Sta-

tistical sampling methods such as different forms of MCMC [38] for performing integration, or particle

filtering [62] to perform propagation of distributions over time have been developed and applied to

55

generative models. Variational methods [65] that approximate the overall cost function by a surrogate

distribution have also been developed.

The sampling methods are usually very slow as they require large number of samples to approximate

complex distributions. On the other hand, variational methods are limited by the form of the surrogate

distribution and its complexity, as these determine its capacity to model the actual distribution.

56

CHAPTER 5

VARIABLE/MODULE GRAPHS: FACTOR GRAPHS WITHMODULES

With an understanding of the strengths and limitations of modular and generative design, we are

now in a position to develop a hybrid framework to design modular vision systems. In this new frame-

work, which we callvariable/module graphsor V/M graphs[55], we aim to borrow the strengths of

both modular and generative design. From the generative models in general and probabilistic graphical

models in particular, we want to keep the principled way to explain all the information available and the

relations between different variables using a graphical structure. From the modular design, we want to

borrow ideas for local and fast processing of information available to a given module as well as online

adaptation of model parameters.

5.1 Replacing Functions in Factor Graphs with Modules

In Section 4.2, we showed how both the design frameworks can be viewed from the same perspec-

tive; identification of variables and designing constraints on the joint-probability spaces. From that

perspective, modules in modular design constrain the joint-probability space of observed and hidden

variables just as the factor functions in factor graphs. However, there are crucial differences. Without

loss of generality, we will continue our discussion on graphical models based on factor graphs, since

many of the other graphical models can be converted to factor graphs.

Modules in modular design take (probability distributions of) various variables as inputs, and pro-

duce (probability distributions of) variables as outputs. Producing an output can be thought of as passing

a message from the module to the output variable. This is comparable to part of the message passing

57

algorithm in factor graphs, that is, passing a message from the function node to a variable node. This

calculation is done by multiplying messages from all the other variable nodes (except the one that we

are sending the message to) to the factor function at the function node and marginalizing the product

over all the other variables (except the one that we are sending the message to). Processing of a module

can be thought of as an approximation to this calculation.

However, the notion of a variable node does not exist in modular design. Let us, for a moment, imag-

ine that modules are not connected to each other directly. Instead, let us imagine that every connection

that connects output of a module to the input of another module is replaced by a node connected to the

output of the first module and input of the second module. This node represents the output variable of

the first module, which is same as the input node of the second module. Let us call this thevariable

nodemuch as we would call it in case of factor graphs.

In other words, a cascade of modules in a modular system is nothing but a cascade of approximators

to function nodes (separated by variable nodes, of course). If we generalize this notion of intercon-

nection of module ormodule nodesvia variable nodes, we get a graph structure. Such a graph will be

bipartite in much the same way as a factor graph. We shall call such a graph aV/M graph. To put it

yet another way, if we replace the function nodes in a factor graph by modules, we get a V/M graph- a

bipartite graph in which the variables represent one set of nodes (calledvariable nodes), and modules

represent the other set of nodes (calledmodule nodes).

5.2 System Modeling using V/M Graphs and its Relation to the ProductForm

A factor graph is a graphical representation of the factorization that a product form represents. Since

the V/M graph can be thought of as a generalization of the factor graph, what does this mean for the the

application of product form to the V/M graph? In essence, we are still modeling the overall constraints

on the joint-probability distribution using a product form. However, the rules of message passing have

been relaxed. This makes the process an approximation to the exact product form. We compare this

58

approximation to the well-known variational methods [65] of approximation in generative models in

Sections 5.4 and 5.5.

To see how we are still modeling the joint-distribution over the variables using a product form, let us

start by analyzing the role of modules. A module takes the value of the input variable(s)xi and produces

a probability distribution over the output variable(s)xj . This is nothing but the conditional distribution

over the output variables given the input variable, orp(xj |xi). Thus, each module is nothing but an

instantiation of such conditional density functions.

In a Bayesian network, similar conditional probability distributions are defined, with an arrow rep-

resenting the direction of causality. This makes it a simple case to define the module as a single edge

or a set of multiple edges going from the input to the output, converting the whole V/M graph into a

Bayesian network, which is another graphical representation of the product form, as described in Sec-

tion 4.3.3. Also, since the Bayesian network can always be converted into a factor graph [59], we can

convert a V/M graph into a factor graph. However, processing modules are many times arranged in a

bottom up fashion, whereas the flow of causality in a Bayesian network is top-down. This is not a prob-

lem, since we can use Bayes’ rule to reverse the flow of causality. Once we have established a module

as an equivalent of a conditional density, manipulation of the structure is easy, and it always remains in

the purview of product form modeling of the joint distribution. However, the similarity between V/M

graphs and probabilistic graphical models ends here on a theoretical level. As we shall in Section 5.3,

the inference mechanisms that are applied in practice to graphical models are not applied in the exact

same manner to V/M graphs. One of the reasons for this is that modules do not produce a functional

form of the conditional density functions. They only produce a black-box from which we can sample

output (distribution) for given sample points of input, and not the other way around. Thus, in practice,

application of Bayes’ rule to change the direction of causality is not as easy as it is in theory. We use

comodules, at times, for flow of messages in the other direction to a given module. This makes V/M

graph an approximation to an equivalent Bayesian network, or at best, a highly loopy graphical model.

59

5.3 Inference

In a factor graph, calculating the messages from variable nodes to function nodes, or the belief at

each variable node is usually not difficult. When the incoming messages are in a nonparametric form,

any kind of resampling algorithm or nonparametric belief propagation [66] can be used. What is more

difficult is the integration or summation associated with the marginalization needed to calculate the

message from a function node to a variable node. Another difficulty that we face here is the complexity

with which we can design the local function at a function node. Since we also need to calculate the

messages using products and marginalization (or sum), we need to devise functions that model the sub-

constraint as well as lend themselves to easy and efficient marginalization (or approximation thereof).

If one were to break the function down into more subfunctions, there is a trade-off involved between

network complexity and function complexity for a manageable system.

This is where we can make use of the modules developed for other systems. The output of a module

can be viewed as a marginalization operation used to calculate message sent to the output variable.

Now, the question arises what we can say about the message sent to the input variable. If we really

cannot modify the module to send a message to what was the input variable in the original module,

we can view it as passing a uniform message (distribution) to the input variable. To save computation

this message can be totally discounted during calculations that require combination of this message

with other messages. However, in this framework, we encourage modifying existing modules to pass

information backwards as well. A way to do this is to associate a comodule with the module that does

the reverse of the processing that the module does. For example, if a module takes in a background

mask and outputs probability map of the position of a human in the frame, the comodule will provide

some probability map of pixels belonging to background or foreground (human) given the position of

human to this comodule.

Let us now see what we gain by introducing modified modules as approximation to functions and

their message calculation procedures. Basically, we get computationally cheap approximations to com-

plex marginalization operations over functions that will be difficult to perform from first principles or

statistical sampling; the approach used with generative models until now. Whether this kind of message

60

passing will converge or not even for graphs without cycles remains to be seen in theory, however, we

have found the results to be convincing for the applications that we implemented it for as shown in

Chapter 6.

5.4 Learning

There are a few issues that we would like to address while designing learning algorithms for complex

vision systems. The first issue is that when the data and system complexity are prohibitive for batch

learning, we would really like to have designs that lend themselves to online learning. The second

major issue is the need to have a learning scheme that can be divided into steps that can be performed

locally at different module or function nodes. This makes sense, since the parameters of a module are

usually local to the module. Especially in an online learning scheme, the parameters should depend only

on the local module and the local messages incident on the function node.

We shall derive learning methods for V/M graphs based on those for probabilistic graphical models.

Although, methods for structure learning in graphical models have been explored [64, 67], we will

limit ourselves for the time being to parameter learning. Structure learning is a suggestion for future

work in Section 7.2. In line with our stated goals in the paragraph above, we will consider online and

local parameter learning algorithms for probabilistic graphical models [41, 42] while deriving learning

algorithms for V/M graphs.

Essentially, parameter adjustment is done as a gradient ascent over the log likelihood of the given

data under the model. While formulating the gradient ascent over the cost function, due to the factoriza-

tion of the joint-probability distribution, derivative of the cost function decomposes into a sum of terms,

where each term pertains to local functions. A similar idea can be extended to our modified factor graphs

or V/M graphs. However, the mathematics may not be straightforward because of the approximations

made to the factorization.

Now, we shall derive a gradient ascent based algorithm for parameter adjustment for V/M graphs.

Our goal is to find the model parameters that maximize the data likelihoodp(D), which is a standard

goal used in the literature [35,41], since (observed) data are what we have and seek to explain, while rest

61

of the (hidden) variables just aid in modeling the data. Each module will be represented by a conditional

density functionpωi(xi|Ni). Here,xi represents the output variable of theith module,Ni represents the

input set of variables to theith module, andωi represents the parameters associated with the module. We

will make the assumption that data points are independently identically distributed (iid), which means

that for data pointsdj (wherej ranges from1 to m, the number of data points) and the data likelihood

p(D), Equation (5.1) holds:

p(D) =m∏

j=1

p(dj) (5.1)

In principle, we can choose any monotonically increasing function of the likelihood, and we chose

the ln(.) function to convert the product into a sum. This means that for the log likelihood, Equation

(5.2) holds:

ln p(D) =m∑

j=1

ln p(dj) (5.2)

Therefore, when we maximize the log likelihood with respect to the parametersωi’s, we can concentrate

on maximizing the log likelihood of each data point by gradient ascent, and adding these gradients

together to get the complete gradient of the log likelihood over the entire data. Thus, at each step we

need to deal with only one data point and accumulate the result as we get more data points. This is

significant in developing online algorithms that deal with limited (one) data point(s) at a time. In case

where we tune the parameters slowly, this is in essence like a running average with a forgetting factor.

Now, taking the partial derivative of the log likelihood of one data pointdj with respect to a param-

eterωi, we get Equation (5.3). Since we will getp(dj |xi, Ni) as a result of message passing, and we

will get p(xi|Ni) as the output of the processing module, all these computations can be done locally at

the modulei itself. The probability densitiesp(dj) andp(Ni) are nonnegative functions that only scale

the gradient computation, and not the direction of the gradient. With V/M graphs, when we are not

even expecting to calculate the gradient, we will only try to do a generalized gradient ascent by going

in the direction of positive gradient. It suffices that as an approximate greedy algorithm we move in the

general direction of increasingp(xi|Ni) and hope thatp(dj |xi, Ni), which is a marginalization of the

62

product ofp(xk|Nk) over manyk’s, will follow an increasing pattern as we spread the procedure over

manyk’s (modules). The greedy algorithm should be slow enough in gradient ascent that it can capture

the trend over manyj’s (data points) when run online:

∂ ln p(dj)∂ωi

=∂

∂ωip(dj)

p(dj)

=∂

∂ωi

(∫xi,Ni

p(dj |xi, Ni)p(xi, Ni) dxi dNi

)

p(dj)

=∂

∂ωi

(∫xi,Ni

p(dj |xi, Ni)p(xi|Ni)p(Ni) dxi dNi

)

p(dj)

=

∫xi,Ni

∂∂ωi

(p(dj |xi, Ni)p(xi|Ni)p(Ni)) dxi dNi

p(dj)

=

∫xi,Ni

p(Ni) ∂∂ωi

(p(dj |xi, Ni)p(xi|Ni)) dxi dNi

p(dj)

(5.3)

This sketches the general insight into the learning algorithm. The sketch is in line with a similar

derivation for Bayesian network parameter estimation in [41], where the scenario is much better defined

than the scenario defined here for V/M graphs. In Section 5.5.2, we provide another viewpoint to justify

the same steps.

5.5 Free-Energy View of EM Algorithm and V/M Graphs

For generative models, the EM algorithm [35] and its online, variational, and other approximations

have been used as the learning algorithm of choice. Online methods work by maintaining sufficient

statistics at every step for theq-function that approximates the probability distributionp of hidden and

observed variables. We use a free-energy view of the EM algorithm [61] to justify a way of designing

learning algorithms for our new framework. In [61] the online or incremental version of EM algorithm

was justified using a distributed E-step. We extend this view to justify local learning at different module

nodes. Being equivalent to a variational approximation to the factor graph means that some of the con-

cepts applicable to generative models, such as variational and online EM algorithm, can be applicable to

63

the V/M graphs. We use this insight to compare inference and learning in V/M graphs to the free-energy

view of the EM algorithm [61].

5.5.1 Online and local E-step

Let us assume thatX represents the sequence of observed variablesxi, andY represents the se-

quence of hidden variablesyi. So, we are modeling the generative processp(xi|yi, θ), with some prior

on yi; p(yi), given system parametersθ (which is same for all pairs(xi, yi). Due to the Markovian

assumption ofxi being conditionally independent ofxj givenY , wheni 6= j, we get Equation (5.4):

p(X|Y, θ) =∏

i

p(xi|yi, θ) (5.4)

We would like to maximize the log likelihood of the observed dataX. EM algorithm does this by

alternating between an E-step as shown in Equation (5.5) and an M-step shown in Equation (5.6) in each

iteration with iteration numbert:

Compute distribution:qt(y) = p(y|x, θ(t−1)) (5.5)

Compute arg max:θ(t) = arg max Eqt [log P (x, y|θ)]θ

(5.6)

Going by the free energy view of the EM algorithm [61], the E- and M-steps can be viewed as

alternating between maximizing the free energy with respect to theq-function and the parametersθ.

This is related to the minimization of free energy in statistical physics. The formulation of free energy

F is given in Equation (5.7):

F (q, θ) = Eq[log(x, y|θ)] + H(q)

= −D(q‖pθ) + L(θ)(5.7)

In Equation (5.7),D(q‖p) represents the Kullback-Leibler divergence (KL-divergence) betweenq

andp given by Equation (5.8), andL(θ) represents the data likelihood for the parameterθ. In other

64

words, the EM algorithm alternates between minimizing the KL-divergence betweenq andp, and max-

imizing the likelihood of the data given the parameterθ:

D(q‖p) =∫

yq(y) log

q(y)p(y)

dy (5.8)

The equivalence of the regular form of EM and the free-energy form of EM has already been estab-

lished in [61]. Further, sinceyi’s are independent of each other, theq(y) andp(y) terms can be split into

a products of differentq(yi)’s andp(yi)’s, respectively. This is used to justify the incremental version

of EM algorithm that incrementally runs partial or generalized M-steps on each data point. This can

also be done using sufficient statistics of the data collected until that data point, if it is possible to define

sufficient statistics for a sequence of data-points.

Coming back to the message passing algorithm, for each data point, when message passing con-

verges, the beliefs at each variable node give a distribution over all the hidden variables. If we look at

theq-function, it is nothing but an approximation of the actual distribution over the variables -p, and we

are trying to minimize the KL-divergence between the two. Now, we can get the sameq-function from

the converged messages and beliefs in the graphical model. Hence, one can view message passing as a

localized and online version of the E-step.

5.5.2 Online and local M-step

Now, let us have a look at the M-step. M-step involves maximizing the likelihood with respect to

the parametersθ. When performed online for a particular data point, it can be thought of as a stochastic

gradient ascent version of Equation (5.6). Making use of the sufficient statistics will definitely improve

the approximation of the M-step since it will use the entire data presented until that point, instead of

a single data point. Now, if we take the factorization property of the joint-probability function into

account, we can also see that the M-step can be distributed locally for each component of the parameter

θ associated with each module or function node. This justifies the localized parameter updates based

on gradient ascent shown in [41, 42]. This is another critical insight that will help us use the online

learning algorithms devised for various modules to be used as local M-steps in our systems. Due to the

65

integration involved with the marginalization over the hidden variables while calculating the likelihood,

this will be an approximation of the exact M-step. Determining the conditions where this approximation

should work will be part of our future work.

One issue that still remains is the partition function. With all the local M-steps maximizing one

term of the likelihood in a distributed fashion, it is likely that the local terms increase infinitely, while

the actual likelihood does not. This problem arises when insufficient care is taken to normalize the

likelihood by dividing it with a partition function. While dealing with sampling-based numerical inte-

gration methods such as MCMC [38], it becomes difficult to calculate the partition function. This is

because methods such as importance sampling and Gibbs sampling used in MCMC deal with surrogate

q-function, which is usually a constant multiple of the targetq-function. The multiplication factor can

be assessed by integrating over the entire space, which is difficult. There are two ways of getting around

this problem. One way was suggested in [34] as maximizing the contrastive divergence instead of the

actual divergence. The other way is to put some kind of local normalization in place while calculat-

ing messages sent out by various modules. As long as the multiplication factor of theq-function does

not increase beyond a fixed number, we can guarantee that maximizing the local approximation of the

components of the likelihood function will actually improve system performance.

In the M-step of the EM algorithm, we minimizeQ(θ, θ(i−1)) with respect toθ, as described in

Equation (5.9):

M-step:θ(i) ← arg maxθ

Q(θ, θ(i−1)) (5.9)

In Equation (5.10), we show how this minimization can be distributed over different components of

the parameter variableθ:

Q(θ, θ(i−1)) = E[log p(X,Y |θ)|X, θ(i−1)]

=∫h∈H log p(X, Y |θ)f(Y |X, θ(i−1))dh

=∫h∈H(

∑mi=1 log p(xi, yi|θi))f(Y |X, θ(i−1))dh

=∑m

i=1

∫h∈H log p(xi, yi|θi)f(Y |X, θ(i−1))dh

(5.10)

66

In Equation (5.10), part of the generative model is represented by the functionf(Y |X, θ), which

describes how the observation setY can be generated if the hidden variable setX and the parameter

vectorθ were given. Also, the observation set and the hidden variable set are broken into pairs(xi, yi),

which justifies distributing the M-step over these pairs.

5.5.3 PDF softening

Until now, PDF softening was only intuitively justified [34]. In this section, we revisit the intuition,

and justify the concept mathematically in Equation (5.11):

D(q ‖ p)

=∫

x∈Xq(x) log

q(x)p(x)

dx

=∫

x∈Xq(x) log q(x) dx−

∫

y∈Xq(y) log p(y) dy

=∫

x∈Xq(x) log

∏i qi(x)∫

w∈X

∏j qj(w) dw

dx−∫


=∫

x∈Xq(x)

∑

i

(log qi(x))− log

∫

w∈X

∏

j

qj(w) dw

dx−

∫


=∫

x∈X

∑

i

(q(x) log qi(x))− q(x) log

∫

w∈X

∏

j

qj(w) dw

dx−

∫


=∑

i

(∫

x∈Xq(x) log qi(x) dx

)−

∫

z∈Xq(z) log

∫

w∈X

∏

j

qj(w) dw

dz −

∫


=∑

i

(∫


)− log

∫

w∈X

∏

j

qj(w) dw

∫

z∈Xq(z)dz −

∫


=∑

i

(∫


)− log

∫

w∈X

∏

j

qj(w) dw

−

∫


(5.11)

As shown in Equation (5.11), if we want to decrease the KL-divergence between the surrogate

distributionq and the actual distributionp, we need to minimize the sum of three terms. The first term

on the last line of the equation is minimized if there is an increase in the high-probability region as

67

defined byq, which is actually a low-probability region for an individual componentqi. This means,

that this term prefers diversity among differentqi’s, sinceq is proportional to the product ofqi’s. Thus,

the low-probability regions ofq need not be low probability regions of a givenqi. On the other hand,

the third term is minimized, if there is an overlap between the high-probability region as defined byq

and the high-probability region defined byp and between the low-probability region as defined byq and

the low-probability region defined byp. In other words, surrogate distributionq should closely model

the actual distributionp.

Hence, overall, the model seeks a good fit in the product, while seeking diversity in individual terms

of the product. It also seeks not-so-high-probability regions of individualqi’s to overlap with high-

probability regions ofq. Whenp has a peaky (low-entropy) structure, these goals may seem conflicting.

However, this problem can be alleviated if the individual experts cater to different dimensions or aspects

of the probability space, while each individual distribution has high enough entropy. This justifies soft-

ening the PDFs. This can be done by adding a high-entropy distribution such as a uniform distribution

(which has provably the highest entropy), by raising the distribution to a fractional power, or by raising

the variance of the peaks. Intuitively, this means that we want to strike a balance between useful opinion

expressed by an expert and being overcommitted to any particular solution (high-probability region).

5.6 Prescription

With the discussion on the theoretical justification of the design of V/M graphs complete, in this

section we want to summarize how to design a V/M graph for a given application. In Chapter 6, we

will present experimental results of successful design of vision systems for complex tasks using V/M

graphs.

To design a V/M graph for an application, we will follow the following guidelines:

1. Identify the variables needed to represent the solution.

2. Identify the intermediate hidden variables.

3. Suitably break down the data into a set of observed variables.

68

4. Identify the processing modules that can relate and constrain different variables.

5. Ensure that there is enough diversity in the processing modules.

6. Lay down the graphical structure of the V/M graph similar to how one would do that for a factor

graph, using modules instead of function nodes.

7. Redesign each module so that it can tune online to increase local joint-probability function in an

online fashion.

8. Ensure that the modules have enough variance or leniency to be able to recover from mistakes

based on the redundancy provided by the presence of other modules in a graphical structure.

9. If a module has no feedback for a variable node, this can be considered to be a feedback equivalent

of a uniform distribution. Such a feedback can be dropped from calculating local messages to save

computation.

Once the system has been designed, the processing will follow a simple message passing algorithm

while each module will learn in a local and online manner. If the results are not desirable, one would

want to replace some of modules with better estimators of the given task, or make the graph more robust

by adding more (and diverse) modules, while considering making modules more lenient.

69

CHAPTER 6

SOME APPLICATIONS OF V/M GRAPHS

The V/M graph framework was developed in Chapters 4 and 5 to help design fast online learning

applications for accomplishing complex video processing and understanding tasks. In this chapter, we

report design and experimental results of several applications related to the broad problem of automated

surveillance.

Vision systems for automated surveillance have evoked a lot of interest due to the increase in security

concerns all over the world and the availability of superior computing power and cheap cameras [68–70].

Of particular interest is automatic event detection in surveillance video. Event is a high-level semantic

concept and is not very easy to define in terms of low-level raw data. This gap between the available

data and the useful high-level concepts is known as thesemantic gap. It can be safely said that the vision

systems, in general, aim to bridge the semantic gap in visual data processing.

Variables representing high-level concepts such as events can be conveniently defined over lower-

level variables such as position of people in a frame; provided that the defining lower-level variables

are reliably available. For example, if we were to decide whether a person came out or went in through

a door, we can easily decide this if the sequence of the position of the person (and the position of the

door) in various frames in the scene was available to us. This is the rationale behind modular design,

where in this case, one would devise a system for person tracking and the output of the tracking module

would be used by an event detection module to decide whether the event has taken place or not.

70

6.1 Scenario

The scenario that we considered for our experiments is related to the broad problem of automated

surveillance. We assume a fixed camera in our experiments, but such an assumption is not a necessity

for the application of V/M graphs, since it is a generic framework to design a vision system based on

interaction between various processing modules. We also consider the scene to be staged indoors, and

again, this has no bearing on the utility of V/M graphs for vision systems that can deal with outdoor

scenes.

In the following experiments, we concentrate on several applications of V/M graphs in the surveil-

lance setting. We will roughly proceed from simpler tasks to increasingly complex tasks. While doing

so, many times we will incrementally build upon previously accomplished sub-tasks. This will also

showcase one of the advantages of V/M graphs; namely, easy extendability.

6.2 Application: Person Tracking

We start with the most basic experiment, where we build an application for tracking a single target

(person) using a fixed indoor camera. In this application, we identify five variables that affect inference

in a frame: the intensity map (pixel values) of the frame (or, the observed variable(s)), the background

mask, the position of the person in the current frame, the position of the person in previous frame, and

the velocity of the person in previous frame. These variables are represented asx1, x2, x3, x4, and

x5, respectively, in Figure 6.1. The variables exchange information through modulesFA, FB, FC , and

FD. Module FA represents the background subtraction module that maintains an eigen background

model [56] as system parameters, using a modified version online learning algorithm for performing

principal component analysis (PCA) as described in [71]. While it passes information fromx1 to x2, it

does not pass it the other way round, as image intensities are evidence, hence fixed. ModuleFC serves as

the interface between the background mask and the position of the person. In effect, we run an elliptical

gaussian filter, roughly of the size of a person/taeget, over the background map and normalize its output

as a map of the probability of a person’s position. ModuleFB serves as the interface between the image

intensities and the position of the person in the current framex3. Since it is computationally expensive

71

to perform operations on every pixel location, we sample only a small set of positions to confirm if the

image intensities around that position resemble the appearance of the person being tracked. The module

maintains an online learning version of eigen-appearance of the person as system parameters based on

a modification of a previous work [72]. It also does not pass any message tox1. The position of the

person in the current frame is dependent on the position of the person in the previous framex4 and the

velocity of the object in the previous framex5. Assuming a first-order motion model, which is encoded

in FD as a Kalman filter, we connectx3 to x4 andx5. Bothx4 andx5 are assumed fixed for the current

frame; therefore,FD only passes the message forward tox3 and does not pass any message tox4 or x5.

Figure 6.1V/M graph for single-target tracking application.

6.2.1 Message passing and learning schedule

The message passing and learning schedule used was as follows:

1. Initialize a background model.

2. If a large contiguous foreground area is detected, initialize a person-detection moduleFC and

tracking-related modulesFB andFD.

3. Initialize the position of the person in the previous frame as the most likely position according to

the background map.

4. Initialize the velocity of the person in the previous frame to be zero.

72

For every frame, the following will occur:

1. Propagate a message fromx1 to FA as the image.

2. Propagate a message fromx1 to FB as the image.

3. Propagate messages fromx4 andx5 to FD.

4. Propagate a message fromFD to x3 in form of samples of likely position.

5. Propagate a message fromFA to x2 in form of a background probability map after an eigen-

background subtraction.

6. Propagate a message fromx2 to FC in form of a background probability map.

7. Propagate a message fromFC to x3 in form of a probability map of likely positions of the object

after filtering ofx2 by an elliptical gaussian filter.

8. Propagate a message fromx3 to FB in form of samples of likely position.

9. Propagate a message fromFB to x3 in form of probabilities at samples of likely position as

defined by the eigen appearance of the person maintained atFB.

10. Combine the incoming messages fromFB, FC , andFD at x3 as the product of the probabilities

at the samples generated byFD.

11. Infer the highest probability sample as the new object position measurement. Calculate current

velocity.

12. Update online eigen models atFA, andFB.

13. Update motion model atFD.

6.2.2 Results

We ran our person tracker in both single person and multi-person scenarios using grayscale indoor

sequences 320× 240 in dimensions using a fixed camera. People appeared to be as small as 7× 30

73

pixels. It should be noted that no elaborate initialization and no prior training was done. The tracker

was required to run and learn on the job, fresh out of the box. The only prior information used was the

approximate size of the target, which was used to initialize the elliptical filter. Some of the successful

results on difficult sequences are shown in Figure 6.2. Running on unoptimized MATLAB code on a

2.4 GHz computer, the system performs at about 2 frames per second.

(a) Brief occlusion of the object.

(b) Nearly camouflaged object.

Figure 6.2Results of the tracking application (white rectangles around targets) zoomed in and croppedfor clarity.

The tracker could easily track people successfully after complete but brief occlusion, owing to the

integration of a background subtraction, eigen-appearance, and motion models. The system successfully

picks up and tracks a new person automatically when he/she enter the scene, and gracefully purges the

tracker when the person is no longer visible. As long as a person is distinct from the background for

some time during a sequence of frames, the online adaptive eigen appearance model successfully tracks

the person even when they are subsequently camouflaged into the background. Note that any of the

74

tracking components in isolation would fail in difficult scenarios such a complete occlusion, widely

varying appearance of people, and background camouflage.

The tracker was not a complete success, and it lost track of the object in rather difficult situations

when the target goes into occlusion uncovering a background object not only matched the greylevel, but

also matched the shape of the target being tracked. Such a tracking failure is shown in Figure 6.3.

Figure 6.3Sequence showing failure for a difficult situation.

To alleviate the problem of losing track because of occlusion, coupled with matching of background

objects in appearance, we changed our model to include more information. Specifically, we used color

frames, instead of grayscale frames. The V/M graphs remains the same, as shown in Figures 6.1. The

tracking results improved tremendously, and are shown in Figure 6.4. In this figure, the trajectory of

the center of the bounding box (from previous frames) is plotted in green, and the bounding box in the

current frame is shown as a white rectangle. Note that some trajectories pass behind objects and are still

not lost.

Even the improved tracker was not perfect. When we tried yet another difficult scenario, where

the target suddenly changes velocity by a large amount (the person starts running after slow walking),

the tracker loses track of the object and initializes a new track where the object appears in the next

frame. We think that the nature of the motion model (first-order linear), cannot take into account large

accelerations. This can be alleviated by a more sophisticated motion model. The failure results are

shown in Figure 6.5.

75

Figure 6.4Different successful tracking sequences after using color information.

Figure 6.5Even color tracker can lose track under large acceleration of the object.

76

6.3 Application: Multiperson Tracking

To adapt the single person tracker developed in Section 6.2 for multiple targets, we need to modify

the V/M graph depicted in Figure 6.1. In particular, we will need at least one position variable for each

target being tracked. We will also need one variable representing the position in the previous frame

and one representing the velocity in the previous frame for each object. On the module side, we will

need one module each for each object representing the appearance matching, elliptical filtering on the

background map, and Kalman filter. The resulting V/M graph is shown in Figure 6.6. The message

passing and learning schedule was pretty much the same as given in Section 6.2.1, except the steps

specific to the target were performed for each target being tracked.

Figure 6.6V/M graph for multiple-target tracking application (here, two targets).

6.3.1 Results

We ran our person tracker to track multiple persons’ grayscale indoor sequences 320× 240 in

dimensions using a fixed camera. People appeared to be as small as 7× 30 pixels. It should be noted

that no elaborate initialization and no prior training was done. The tracker was required to run and learn

77

on the job, fresh out of the box. The results are shown in Figure 6.7. Running on unoptimized MATLAB

code on a 2.4 GHz computer, the system performs at about 2 frames per second.

Figure 6.7 Results of the multitarget tracking application (white rectangles around targets) zoomed inand cropped for clarity.

We also modified the tracker to take into account the color information. The tracking results im-

proved, and the results are shown in Figure 6.8.

Figure 6.8Different successful tracking sequences involving multiple targets and using color informa-tion.

The multiperson tracker was unable to deal with situations where two persons would walk together

and one would fully or partially occlude the other all the time. In such a case, many times only one

78

person would be tracked successfully. An example of this failure is shown in Figure 6.9. This problem

can be solved by explicit occlusion modeling into the generative model that would lead to the V/M

graph.

Figure 6.9Failure due to one person occluding the other.

6.4 Application: Trajectory Modeling and Prediction

A tracking system can be an essential part of a trajectory modeling system. Many interesting events

in a surveillance scenario can be recognized based on trajectories. People walking into restricted areas,

violations at access controlled doors, moving against the general flow of traffic are examples of few

interesting events that can be extracted based on trajectory analysis. This will allow us to detect unusual

events that are based on unusual trajectories. With this framework, it is easy to incrementally build a

trajectory modeling system on top of a tracking system with interactive feedback from the trajectory

models to improve tracking results.

6.4.1 Trajectory modeling module

We add a trajectory modeling moduleFE connected tox3 andx4 which represent the position of

the object being tracked in the current frame and the previous frame respectively. The factor graph of

the extended system is shown in Figure 6.10.

79

Figure 6.10V/M graph for trajectory modeling system.

The trajectory modeling module stores the trajectories of the people, and predicts the next position

of the object based on previously stored trajectories. The message passed fromFE to x3 is given in

Equation (6.1):

ptraj ∝ α +∑

i

wixpredi (6.1)

In Equation (6.1),ptraj is the message passed fromFE to x3, α is a constant added as a uniform

distribution,i is an index that runs over the stored trajectories,wi is the weight calculated based on how

close is the trajectory to the position and direction of the current motion, andxpredi is the next point to

the current closest point on the trajectory to the object position in the previous frame. The predicted

trajectory is represented by variablex6.

6.4.2 Results

This is a very simple trajectory-modeling module, and the values of various constants were set

empirically, although no elaborate tweaking was necessary. As shown in Figure 6.11, we can predict

the most probable trajectory in many cases where similar trajectories have been seen before.

80

Figure 6.11Sequences showing successful trajectory modeling. Object trajectory is shown in green,and predicted trajectory is shown in blue.

81

There are times when the target starts out close to a particular stored trajectory, but then goes on to

another path. Our system can correct its prediction (or stop giving wrong prediction) under this changing

scenario. Successful results are shown in Figure 6.12.

Figure 6.12Sequences showing successful adaptation of trajectory prediction. Each row represents onesequence.

The experiments are very preliminary at this stage, and more data will be needed to do some behavior

monitoring. Also, the trajectory modeling module is very rudimentary as it stores all the trajectories and

compares the position of object in the previous frame with all the trajectories in the memory. For

a system that will run for a long time collecting many trajectories, such a modeling and matching

scheme is certainly not computationally efficient. Other approaches to trajectory modeling such as

vector quantization [15] can be used to replace the trajectory modeling module in this framework.

6.5 Application: Event Detection Based on Single Target

The ultimate goal for automated video surveillance is to be able to do automatic event detection in

video. With trajectory analysis, we move closer to this goal, since there are many events of interest that

can be detected using trajectories. In this section, we present an application to detect the event whether

82

a person went in or came out of a secure door. To design this application, all we have to do is to add an

event detection module that is connected to the trajectory variable node, and add an event variable node

to the event detection module. The event detection module can work according to simple rules based on

the target trajectory.

We show the V/M graph used for this application in Figure 6.13. The event detection module applies

some simple rules on the trajectory to decide whether the person came out or went in. Specifically, it

checks the direction of the vector from the start-point of the trajectory to its end-point and divides the

direction space into two sets to make the decision. The decision is taken only when the track is lost, and

not while the object is still being tracked. Thus, the event variable has three states; “no event,” “came

out,” and “went in.”

Figure 6.13V/M graph for single track-based event detection system.

6.5.1 Results

The results were quite encouraging. We got 100% correct event detection results owing to reasonable

tracking performance. Some results are shown in Figure 6.14. In theory, one could also design an event

83

detection system that can give a feedback to the trajectory variable module. However, we assumed this

feedback to be a uniform distribution in this example, and did not use it in any calculations.

Figure 6.14Results of the event detection based on analysis of a single trajectory. Each row representsone sequence. The result is shown as a text label in the last frame of each sequence.

6.6 Application: Event Detection Based on Multiple Targets

We also designed applications for event-detection based on multiple trajectories. Specifically, we

designed applications to detect two people meeting in a cafe scenario, and piggybacking and tailgating

at secure doors. The event detection module worked according to simple rules based on the trajectories

of the targets. We show the V/M graph used for this application in Figure 6.15.

84

Figure 6.15V/M graph for multiple track-based event detection system.

The event detection module (FF ) in Figure 6.15 applies some simple rules on the trajectories of two

targets to decide whether the event has taken place or not. Specifically, to detect two people meeting,

it checks the that the trajectories of the two people converge and stay together for a while to make the

decision. For detecting piggybacking or tailgating, it checks whether the trajectory of the two targets

started together or not in order to infer whether the person swiping the card was aware of the presence

of the other person behind him/her. If s/he was, then it is piggybacking, else it is tailgating.

6.6.1 Results

We implemented three different multitarget event detection systems, one each for a type of event.

Two of these were for detecting conditions at a secure door entry-point into a building, that is, tailgating

and piggybacking. The system could pick up 80% of the instances tailgating and piggybacking from a

total of 5 examples in the video shot. Sample results are shown in Figures 6.16 and 6.17.

85

Figure 6.16Sequence showing a detected “piggybacking” event. The first two images show represen-tative frames of the second person following the first person closely, and the third image represents thedetection result using an overlayed semitransparent letter “P.”

Figure 6.17Sequence showing a detected “tailgating” event. The first two images show representativeframes of the second person following the first person at a distance (sneaking in from behind), and thethird image represents the detection result using an overlayed semitransparent letter “T.”

86

Sample result for the event detection system for the third type of event (“meeting for lunch”) is also

shown in Figure 6.18. All the results are preliminary examples of the potential of the system and are

by no means indicative of how it compares to other event detection systems. The main difficulty in a

comparison of different event detection systems is the lack of commonly agreed upon video data corpus

that can be used to benchmark different systems in the research community.

Figure 6.18Sequence showing a detected “meeting for lunch” event. The first two images show repre-sentative frames of the second person following the first person to the lunch table, and the third imagerepresents the detection result using an overlayed semitransparent letter “M.”

87

CHAPTER 7

CONCLUSIONS AND FUTURE WORK

Complex vision tasks such as event detection in a surveillance video can be divided into subtasks

such as human detection, tracking, recognition, and trajectory analysis. The video can be thought of

as being composed of various features. These features can be roughly arranged in a hierarchy from

low-level features to high-level features. Low-level features include edges and blobs, and high-level

features include objects and events. Loosely, the low-level feature extraction is based on signal/image

processing techniques, while the high-level feature extraction is based on machine learning techniques.

7.1 Conclusions

In this work we have shown that information extracted from an image or a video can be arranged in

a hierarchy of features. Towards the higher end of the hierarchy are semantically meaningful entities,

and towards the lower end are easy to extract features that are based on local appearance of the image

or the video. We characterized the differences between the nature of the two levels of the hierarchical

representation.

Many interesting tasks and applications such as object recognition, tracking, and event detection

require high-level feature extraction. Moreover, many of the high-level tasks are interrelated and can

benefit from each other using a feedback mechanism. A feedback mechanism from the high-level fea-

tures to the low-level features is also likely to help with the processing at the lower levels, and in turn

the higher levels itself. We presented out work on systems without feedback. We showed limitations of

systems without any feedback.

88

We also presented our work on a system with limited feedback, and commented on ad hoc methods

of designing systems with limited feedback. We presented examples of system design paradigms that

make extensive use of feedback and interaction between various units of a complex vision system.

Among those, probabilistic graphical models directly model the generative process of the observed

data. We commented on the limitations of these models.

We presented a new framework that makes use of the graphical structure of the probabilistic graph-

ical models. This framework takes a quantum jump in modeling simplification by replacing function

nodes of a factor graph with generic modules inspired by feedforward modular systems. Intuition behind

the design, inference and learning methods for the framework was developed based on the sum-product

algorithm, product of experts, free-energy view of the EM algorithm, and local parameter optimization

in factor graphs. Applications were developed using the new framework. Without using very sophisti-

cated individual modules, the results were found to be encouraging.

7.2 Future Work

On the theoretical side, future work includes mathematical analysis of conditions for convergence in

the message passing in the new framework for graphs without loops. More analysis of adding uniform

distributions to the output of modules is also needed. One could also analyze the effect of changing

modules. At this point, our guess is that changing a few subconstraints will not change the solution

drastically, as long as the new submodule assigns high-probability to the correct solution, and does not

make the same mistakes (assign high-probability to the same non-solution regions) as the other modules.

One thing that has not been touched in this work is learning the structure of the graph. Since we

have predefined modules that we fit in a graph, we only learn the parameters of the modules. In the

applications with dynamic graphs, such as when we are tracking a dynamic number of multiple targets,

we initialize and destroy modules and variable nodes on the fly, in an ad hoc fashion. Principled ways

to change the structure of the graph can be an interesting and useful direction to pursue.

On the application side, one could improve the multiperson tracking application with a more sophis-

ticated appearance model. One choice for this can be an online version of a manifold-learning based

89

approach [73,74]. One could also extend our preliminary work on trajectory modeling. To this end, we

will use a better trajectory modeling approach such as one based on vector quantization [15]. Another

extension of this work can be for an application that can do trajectory analysis and can detect some in-

teresting behavior patterns and unusual events in a surveillance scenario. The work can also be applied

to simple video stabilization applications.

Beyond the scope of the surveillance problems, we think that our work has a lot of potential. It lays

down some design ideas for complex vision systems, which can be applied to a vast number of vision

applications, and other applications where modular learning systems can be useful. Other areas that can

potentially benefit from this work are speech processing, text processing, data-mining, and multimodal

information fusion such as in multimedia systems. In-depth theoretical analysis can provide further

insight into extending and improving the system.

90

REFERENCES

[1] S. Lehar,The World in Your Head. Mahwah, NJ: Lawrence Erlbaum Associates, 2003.

[2] B. J. Frey and N. Jojic, “Transformation-invariant clustering using the EM algorithm,”IEEE Trans-actions on Pattern Analysis and Machine Intelligence, vol. 25, pp. 1–17, Jan 2003.

[3] D. G. Lowe, “Object recognition from local scale-invariant features,” inProceedings of the Inter-national Conference on Computer Vision, vol. 2, IEEE Computer Society, 1999, pp. 1150–1157.

[4] S. Belongie, J. Malik, and J. Puzicha, “Shape matching and object recognition using shape con-texts,” IEEE Transactions on Pattern Analysis Machine Intellegence, vol. 24, no. 4, pp. 509–522,2002.

[5] L. Fei-Fei, R. Fergus, and P. Perona, “A Bayesian approach to unsupervised one-shot learningof object categories,” inProceedings of the International Conference on Computer Vision, IEEEComputer Society, 2003, p. 1134.

[6] A. Sethi, D. Renaudie, D. Kriegman, and J. Ponce, “Curve and surface duals and the recognitionof curved 3D objects from their silhouettes,”International Journal of Computer Vision, vol. 58,no. 1, pp. 73–86, 2004.

[7] Y. L. Kergosien, “La famille des projections orthogonales d’une surface et ses singularites,” C.R.Acad. Sc. Paris, vol. 292, pp. 929–932, 1981.

[8] J. J. Koenderink and A. J. Van Doorn, “The singularities of the visual mapping,”Biological Cy-bernetics, vol. 24, pp. 51–59, 1976.

[9] S. Lazebnik, A. Sethi, C. Schmid, D. J. Kriegman, J. Ponce, and M. Hebert, “On pencils of tangentplanes and the recognition of smooth 3D shapes from silhouettes,” inProceedings of the EuropeanConference on Computer Vision, 2002, pp. 651–665.

[10] Y. Furukawa, A. Sethi, J. Ponce, and D. Kriegman, “Structure and motion from images of smoothtextureless objects,” inProceedings of the European Conference on Computer Vision, vol. 2, 2004,pp. 287–298.

[11] Y. Furukawa, A. Sethi, J. Ponce, and D. Kriegman, “Robust structure and motion from outlines ofsmooth curved surfaces,”IEEE Transactions on Pattern Analysis Machine Intellegence, vol. 28,pp. 302–315, Feb 2006.

[12] J. J. Weng, N. Ahuja, and T. S. Huang, “Learning recognition and segmentation using the crescep-tron,” International Journal of Computer Vision, vol. 25, no. 2, pp. 109–143, 1997.

[13] F. Porikli and T. Haga, “Event detection by eigenvector decomposition using object and framefeatures,” inProceedings of the IEEE Conference on Computer Vision and Pattern Recognition,vol. 7, IEEE Computer Society, 2004, p. 114.

91

[14] G. G. Medioni, I. Cohen, F. Bremond, S. Hongeng, and R. Nevatia, “Event detection and analysisfrom video streams,”IEEE Transactions on Pattern Analysis Machine Intellegence, vol. 23, no. 8,pp. 873–889, 2001.

[15] N. Johnson and D. Hogg, “Learning the distribution of object trajectories for event recognition,”in British Machine Vision Conference, vol. 2, 1995, pp. 583–592.

[16] C. Bregler, “Learning and recognizing human dynamics in video sequences,” inProceedings of theIEEE Conference on Computer Vision and Pattern Recognition, IEEE Computer Society, 1997,p. 568.

[17] N. Nguyen, H. Bui, S. Venkatesh, and G. West, “Recognizing and monitoring high-level behaviorsin complex spatial environments,” inProceedings of the IEEE Conference on Computer Vision andPattern Recognition, vol. 2, IEEE Computer Society, 2003, pp. 620–625.

[18] M. Han, A. Sethi, W. Hua, and Y. Gong, “A detection-based multiple object tracking method,” inProceedings of the International Conference on Image Processing, 2004, pp. 3065–3068.

[19] C. Stauffer and W. Grimson, “Adaptive background mixture models for real-time tracking,” inProceedings of the IEEE Conference on Computer Vision and Pattern Recognition, vol. 2, IEEEComputer Society, 1999, pp. 246–242.

[20] F. Jurie and C. Schmid, “Scale-invariant shape features for recognition of object categories,” inProceedings of the IEEE Conference on Computer Vision and Pattern Recognition, vol. II, IEEEComputer Society, 2004, pp. 90–96.

[21] J. L. Chen and A. Kundu, “Rotation and gray scale transform invariant texture identification usingwavelet decomposition and hidden Markov model,”IEEE Transactions on Pattern Analysis andMachine Intelligence, vol. 16, no. 2, pp. 208–214, 1994.

[22] A. S. Georghiades, P. N. Belhumeur, and D. J. Kriegman, “From few to many: Illumination conemodels for face recognition under variable lighting and pose,”IEEE Transactions on Pattern Anal-ysis Machine Intelligence, vol. 23, no. 6, pp. 643–660, 2001.

[23] P. J. Phillips, H. Moon, S. A. Rizvi, and P. J. Rauss, “The feret evaluation methodology for face-recognition algorithms,”IEEE Transactions on Pattern Analysis Machine Intelligence, vol. 22,no. 10, pp. 1090–1104, 2000.

[24] D. Marr,Vision. San Francisco, CA: W.H. Freeman, 1982.

[25] A. S. Ogale and Y. Aloimonos, “Shape and the stereo correspondence problem,”InternationalJournal of Computer Vision, vol. 65, no. 1, 2005.

[26] Y. Lecun, L. Bottou, Y. Bengio, and P. Haffner, “Gradient-based learning applied to documentrecognition,” inProceedings of IEEE, vol. 86, IEEE Computer Society, 1998, pp. 2278–2324.

[27] R. P. N. Rao and D. H. Ballard, “Predictive coding in the visual cortex: A functional interpretationof some extra-classical receptive-field effects,”Nature Neuroscience, vol. 2, no. 1, pp. 79–87,1999.

[28] T. S. Lee and D. Mumford, “Hierarchical Bayesian inference in the visual cortex,”Journal of theOptical Society of America A, vol. 20, pp. 1434–1448, 2003.

[29] A. Srivastava, X. Liu, and U. Grenander, “Universal analytical forms for modeling image probabil-ities,” IEEE Transactions on Pattern Analysis Machine Intelligence, vol. 24, no. 9, pp. 1200–1214,2002.

92

[30] E. Bienenstock, S. Geman, and D. Potter, “Compositionality, MDL priors, and object recognition,”Advances in Neural Information Processing Systems, vol. 9, pp. 838–844, 1997.

[31] D. Mumford, “Pattern theory: A unifying perspective,” inPerception as Bayesian Inference,D. Knill and W. Richard, Eds., New York, NY: Cambridge University Press, 1996, p. 2562.

[32] N. Jojic and B. Frey, “Learning flexible sprites in video layers,” inProceedings of the IEEE Confer-ence on Computer Vision and Pattern Recognition, vol. 1, IEEE Computer Society, 2001, pp. 199–206.

[33] F. R. Kschischang, B. J. Frey, and H.-A. Loeliger, “Factor graphs and the sum-product algorithm,”IEEE Transactions on Information Theory, Special Issue on Codes on Graphs and Iterative Algo-rithms, vol. 47, pp. 498–519, February 2001.

[34] G. Hinton, “Product of experts,” inProceedings of the International Conference on Artificial Neu-ral Networks, vol. 1, 1999, pp. 1–6.

[35] A. P. Dempster, N. M. Laird, and D. B. Rubin, “Maximum likelihood from incomplete data via theEM algorithm,”Journal of the Royal Statistical Society, Series B, vol. 39, no. 1, pp. 1–38, 1977.

[36] N. Jojic, N. Petrovic, B. Frey, and T. Huang, “Transformed hidden Markov models: Estimatingmixture models of images and inferring spatial transformations in video sequences,” inProceed-ings of the IEEE Conference on Computer Vision and Pattern Recognition, vol. 2, IEEE ComputerSociety, June 2000, pp. 2026–2033.

[37] C.-E. Guo, S.-C. Zhu, and Y. N. Wu, “Modeling visual patterns by integrating descriptive andgenerative methods,”International Journal of Computer Vision, vol. 53, no. 1, pp. 5–29, 2003.

[38] W. Gilks, S. Richardson, and D. Spiegelhalter,Markov Chain Monte Carlo in Practice. London:Chapman & Hall, 1996.

[39] Z. Tu, X. Chen, A. Yuille, and S.-C. Zhu, “Image parsing: Unifying segmentation, detection, andrecognition,” inProceedings of the International Conference on Computer Vision, vol. 1, 2003,pp. 18–25.

[40] H. Chen and S.-C. Zhu, “A generative model of human hair for hair sketching,” inProceedingsof the IEEE Conference on Computer Vision and Pattern Recognition, vol. 2, IEEE ComputerSociety, 2005, pp. 74–81.

[41] J. Binder, D. Koller, S. Russell, and K. Kanazawa, “Adaptive probabilistic networks with hiddenvariables,”Machine Learning, vol. 29, no. 2-3, pp. 213–244, 1997.

[42] E. Bauer, D. Koller, and Y. Singer, “Update rules for parameter estimation in Bayesian networks,”in Proceedings of the Thirteenth International Conference on Uncertainty in Artificial Intelligence,1997, pp. 3–13.

[43] P. Aarabi and S. Zaky, “Robust sound localization using multi-source audiovisual informationfusion,” Information Fusion, vol. 3, no. 2, pp. 209–223, 2001.

[44] A. Blake and M. Isard,Active Contours. Secaucus, NJ: Springer, 1998.

[45] M. Brandstein and D. Ward,Microphone Arrays. Berlin, Germany: Springer, 2001.

[46] R. Cutler and L. Davis, “Look who’s talking: Speaker detection using audio and video coore-lation,” in Proceedings of IEEE Conference on Multimedia and Expo, IEEE Computer Society,2000, pp. 1589–1592.

93

[47] A. Garg, V. Pavlovic, and J. Rehg, “Audio visual speaker detection using dynamic Bayesian net-works,” in Proceedings of IEEE Conference on Automatic Face and Gesture Recogniton, IEEEComputer Society, 2000, pp. 384–389.

[48] J. Vermaak, M. Gangnet, A. Blake, and P. Perez, “Sequential Monte Carlo fusion of sound andvision for speaker tracking,” inProceedings of International Conference on Computer Vision,IEEE Computer Society, June 2000, pp. 741–746.

[49] Y. Rui and Y. Chen, “Better proposal distributions: Object tracking using unscented particle filter,”in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, vol. 2, IEEEComputer Society, 2001, pp. 786–794.

[50] M. J. Beal, N. Jojic, and H. Attias, “Graphical model for audiovisual object tracking,”IEEE Trans-actions on Pattern Analysis and Machine Intelligence, vol. 25, pp. 828–836, July 2003.

[51] C. H. Knapp and G. C. Carter, “The generalized correlation method for estimation of time delay,”IEEE Transactions on Acoustics, Speech, and Signal Processing, vol. 24, pp. 320–327, August1979.

[52] T. Kailath, “The divergence and bhattacharya distance measures in signal selection,”IEEE Trans-actions on Communication Technology, vol. 15, pp. 52–60, 1967.

[53] D. Comaniciu, V. Ramesh, and P. Meer, “Real-time tracking of non-rigid objects using mean shift,”in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, vol. 2, IEEEComputer Society, June 2000, pp. 142–149.

[54] M. Rahurkar, A. Sethi, and T. Huang, “Robust speaker tracking by fusion of complementary fea-tures from audio and video modalities,” inWorkshop on Image Analysis for Multimedia InteractiveServices, WIAMIS, 2005.

[55] A. Sethi, M. Rahurkar, and T. S. Huang, “Variable module graphs: A framework for inferenceand learning in modular vision systems,” inProceedings of the International Conference on ImageProcessing, vol. 2, 2005, pp. 1326–1329.

[56] N. Oliver, B. Rosario, and A. Pentland, “A Bayesian computer vision system for modeling humaninteractions,”IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 22, no. 8,pp. 831–843, 2000.

[57] C. Stauffer and W. E. L. Grimson, “Learning patterns of activity using real-time tracking,”IEEETransactions on Pattern Analysis Machine Intellegence, vol. 22, no. 8, pp. 747–757, 2000.

[58] C. R. Wren, A. Azarbayejani, T. Darrell, and A. Pentland, “Pfinder: Real-time tracking of thehuman body,”IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 19, no. 7,pp. 780–785, 1997.

[59] B. J. Frey and N. Jojic, “A comparison of algorithms for inference and learning in probabilisticgraphical models,”IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 27, no. 9,pp. 1392–1416, 2005.

[60] J. Pearl, “Fusion, propagation, and structuring in belief networks,”Artificial Intelligence, vol. 29,no. 3, pp. 241–288, 1986.

[61] R. M. Neal and G. E. Hinton, “A new view of the EM algorithm that justifies incremental, sparseand other variants,” inLearning in Graphical Models, M. I. Jordan, Ed., Norwell, MA: KluwerAcademic Publishers, 1998, p. 355368.

94

[62] M. Isard and A. Blake, “Condensation; conditional density propagation for visual tracking,”Inter-national Journal of Computer Vision, vol. 29, no. 1, pp. 5–28, 1998.

[63] R. J. McEliece, D. J. C. MacKay, and J. F. Cheng, “Turbo-decoding as an instance of Pearls “beliefpropagation” algorithm,”IEEE Journal on Selected Areas in Communications, vol. 16, pp. 140–152, Feb 1998.

[64] D. Heckerman, “A tutorial on learning with Bayesian networks,” pp. 301–354, 1999.

[65] T. S. Jaakkola, “Variational methods for inference and estimation in graphical models,” Ph.D.dissertation, Massachusetts Institute of Technology, Cambridge, MA, 1997.

[66] E. B. Sudderth, A. T. Ihler, W. T. Freeman, and A. S. Willsky, “Nonparametric belief propagation,”in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, vol. 1, IEEEComputer Society, 2003, pp. 605–612.

[67] D. Margaritis, “Learning Bayesian network model structure from data,” Ph.D. dissertation,Carnegie Mellon University, Pittsburg, PA, 2003.

[68] R. T. Collins, A. J. Lipton, and T. Kanade, “Special section on video surveillance,”IEEE Transac-tions on Pattern Analysis and Machine Intelligence, vol. 22, no. 8, pp. 745–746, 2000.

[69] J. M. Ferryman, S. J. Maybank, and A. D. Worrall, “Visual surveillance for moving vehicles,”International Journal of Computer Vision, vol. 37, no. 2, pp. 187–197, 2000.

[70] R. Cucchiara, “Multimedia surveillance systems,” inProceedings of the Third ACM InternationalWorkshop on Video Surveillance and Sensor Networks, 2005, pp. 3–10.

[71] Y. Li, L. Xu, J. Morphett, and R. Jacobs, “An integrated algorithm of incremental and robust PCA,”in Proceedings of the International Conference on Image Processing, vol. 1, 2003, pp. 245–248.

[72] J. Lim, D. Ross, R. Lin, and M. Yang, “Incremental learning for visual tracking,”Advances inNeural Information Processing Systems, pp. 793–800, 2005.

[73] J. B. Tenenbaum, V. de Silva, and J. C. Langford, “A global geometric framework for nonlineardimensionality reduction,”Science, vol. 290, no. 5500, pp. 2319–2323, 2000.

[74] S. Roweis and L. Saul, “Nonlinear dimensionality reduction by locally linear embedding,”Science,vol. 290, no. 5500, pp. 2323–2326, 2000.

95

AUTHOR’S BIOGRAPHY

Amit Sethi was born in Jalandhar (Punjab), India, in 1978. He received his bachelors degree in

electrical engineering in 1999 from Indian Institute of Technology, New Delhi, India, and his masters

degree in 2001 in general engineering from the University of Illinois at Urbana-Champaign, USA.

Since 2001 he has been a research assistant in the Department of Electrical and Computer Engineer-

ing and at the Beckman Institute for Advanced Science and Technology. He was a visiting researcher

and a summer research intern at NEC Labs, Cupertino, CA, during the spring and summer of 2003,

respectively. His research interests include applications of graphical models and machine learning to

multimedia processing, image understanding and video understanding.

96

c 2006by Amit Sethi. All rights reserved.asethi/pub/thesis.pdfB.Tech., Indian Institute of...

Documents

Transcript of c 2006by Amit Sethi. All rights reserved.asethi/pub/thesis.pdfB.Tech., Indian Institute of...