c 2006by Amit Sethi. All rights reserved.asethi/pub/thesis.pdfB.Tech., Indian Institute of...
Transcript of c 2006by Amit Sethi. All rights reserved.asethi/pub/thesis.pdfB.Tech., Indian Institute of...
c©2006 by Amit Sethi. All rights reserved.
INTERACTION BETWEEN MODULES IN LEARNING SYSTEMS FOR VISION APPLICATIONS
BY
AMIT SETHI
B.Tech., Indian Institute of Technology, New Delhi, 1999M.S., University of Illinois at Urbana-Champaign, 2001
DISSERTATION
Submitted in the partial fulfillment of the requirementsfor the degree of Doctor of Philosophy in Electrical Engineering
in the Graduate College of theUniversity of Illinois at Urbana-Champaign, 2006
Urbana, Illinois
ABSTRACT
Complex vision tasks such as event detection in a surveillance video can be divided into subtasks
such as human detection, tracking, recognition, and trajectory analysis. The video can be thought of
as being composed of various features. These features can be roughly arranged in a hierarchy from
low-level features to high-level features. Low-level features include edges and blobs, and high-level
features include objects and events. Loosely, the low-level feature extraction is based on signal/image
processing techniques, while the high-level feature extraction is based on machine learning techniques.
Traditionally, vision systems extract features in a feedforward manner on the hierarchy; that is,
certain modules extract low-level features and other modules make use of these low-level features to
extract high-level features. Along with others in the research community we have worked on this design
approach. We briefly present our work on object recognition and multiperson tracking systems designed
with this approach and highlight its advantages and shortcomings. However, our focus is on system
design methods that allow tight feedback between the layers of the feature hierarchy, as well as among
the high-level modules themselves. We present previous research on systems with feedback and discuss
the strengths and limitations of these approaches. This analysis allows us to develop a new framework
for designing complex vision systems that allows tight feedback in a hierarchy of features and modules
that extract these features using a graphical representation. This new framework is based on factor
graphs. It relaxes some of the constraints of the traditional factor graphs and replaces its function nodes
by modified versions of some of the modules that have been developed for specific vision tasks. These
modules can be easily formulated by slightly modifying modules developed for specific tasks in other
vision systems, if we can match the input and output variables to variables in our graphical structure. It
also draws inspiration from product of experts and Free Energy view of the EM algorithm. We present
experimental results and discuss the path for future development.
iii
To my parents
iv
ACKNOWLEDGMENTS
I thank Professor Thomas S. Huang for the invaluable guidance, encouragement, and inspiration that
he has given me over the course of my studies. He has been helpful, understanding, and patient during
the tough times to bring this work to fruition. He knows how to provide a nurturing environment to his
students. I thank Professor David J. Kriegman at University of California, San Diego, for introducing
me to computer vision and machine learning and for his support during the early part of my graduate
studies. I also thank the rest of my doctoral committee members, Professors Narendra Ahuja, Robert
M. Fossum, and Yi Ma, for their advice and support.
I thank my current and former colleagues for their camaraderie and research inputs, especially Man-
dar Rahurkar, Aleksandar Ivanovic, Dr. Nemanja Petrovic, Dr. Ashutosh Garg, Mithun Das Gupta,
Shyamsundar Rajaram, Cagrı K. Daglı, Jilin Tu, Yue Zhou, Maha El Choubassi, and Dennis Lin. I also
thank Professor Brendan J. Frey at University of Toronto and Professor Lester Loschky at Kansas State
University for the fruitful collaboration with them.
I thank my friends from outside the research realm, especially Dr. Rajesh Kumar, Dr. Murali Manoj,
Zakia Khan, Rekha Santhanam, Soumya Jana, and Dr. Balakumar Balasubramaniam for being close and
supportive friends, and the numerous discussions towards finding a meaning in life. I also thank Sachin
Sharma, Nitin Kumar, Srinivasan Rajagopal, Gaurav Gupta, Anurag Sethi, Sunita Singh, Parag Bhide,
Dr. Vaibhav Donde, Vijay Thakur, and Kirti Joshi for their support and encouragement.
Finally, I want to thank my parents Mrs. Vyjayanti Sethi and Col. Anand M. Sethi, and brothers
Anuj and Gautam for their love, support, and encouragement throughout my life, which helped me reach
where I am today.
v
TABLE OF CONTENTS
CHAPTER 1 INTRODUCTION . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11.1 Nature of Visual Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.1.1 Constraints on pixels in visual data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21.1.2 Hierarchical representation of variables from visual data . . . . . . . . . . . . . . . . . 3
1.2 Machine Learning and Video Understanding . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41.3 Original Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61.4 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
CHAPTER 2 MODULAR FEEDFORWARD VISION SYSTEMS . . . . . . . . . . . . . . . . . . . . . 82.1 Object Recognition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
2.1.1 Scenario . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92.1.2 Theory: feature mapping and matching . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92.1.3 Apparatus . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112.1.4 Object modeling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 122.1.5 Object recognition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 132.1.6 Results and discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
2.2 Multiple Object Tracking and Event Detection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 152.2.1 Human detection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 152.2.2 Multiple human tracking algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 162.2.3 Tracking results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 202.2.4 Event detection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
2.3 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
CHAPTER 3 VISION SYSTEMS WITH FEEDBACK AND GENERATIVE MODELS . . . . . . 233.1 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
3.1.1 Connectionist models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 243.1.2 Information-theoretic models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 243.1.3 Generative models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
3.1.3.1 Probabilistic graphical models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 263.1.3.2 Models related to pattern theory and generative modeling . . . . . . . . . 27
3.1.4 Comparison of connectionist, information-theoretic, and generative models . . . . 273.2 Application: Multimodal Person Tracking . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
3.2.1 Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 313.2.1.1 Time delay of arrival estimation using audio signals . . . . . . . . . . . . . 313.2.1.2 Visual tracker . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 333.2.1.3 Multimodal object tracking . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
3.2.2 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
vi
3.3 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
CHAPTER 4 BACKGROUND FOR VARIABLE/MODULE GRAPHS . . . . . . . . . . . . . . . . . . 394.1 Differences between Feedforward Modular and Generative Design . . . . . . . . . . . . . . . . 394.2 A Unifying View . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
4.2.1 Variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 414.2.2 Constraints on variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
4.2.2.1 Modeling constraints as probabilities . . . . . . . . . . . . . . . . . . . . . . . . 434.3 Probability Density Modeling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
4.3.1 Mixture form . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 464.3.2 Product form . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 474.3.3 Probabilistic graphical models with product form . . . . . . . . . . . . . . . . . . . . . . 48
4.3.3.1 Factor graphs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 494.3.3.2 Bayesian networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
4.3.4 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 514.3.4.1 Advantages of product form . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 524.3.4.2 Limitations of probabilistic graphical models . . . . . . . . . . . . . . . . . . 55
CHAPTER 5 VARIABLE/MODULE GRAPHS: FACTOR GRAPHS WITH MODULES . . . . . 575.1 Replacing Functions in Factor Graphs with Modules . . . . . . . . . . . . . . . . . . . . . . . . . . 575.2 System Modeling using V/M Graphs and its Relation to the Product Form . . . . . . . . . . 585.3 Inference . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 605.4 Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 615.5 Free-Energy View of EM Algorithm and V/M Graphs . . . . . . . . . . . . . . . . . . . . . . . . . 63
5.5.1 Online and local E-step . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 645.5.2 Online and local M-step . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 655.5.3 PDF softening . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67
5.6 Prescription . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68
CHAPTER 6 SOME APPLICATIONS OF V/M GRAPHS . . . . . . . . . . . . . . . . . . . . . . . . . . . 706.1 Scenario . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 716.2 Application: Person Tracking . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71
6.2.1 Message passing and learning schedule . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 726.2.2 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73
6.3 Application: Multiperson Tracking . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 776.3.1 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77
6.4 Application: Trajectory Modeling and Prediction . . . . . . . . . . . . . . . . . . . . . . . . . . . . 796.4.1 Trajectory modeling module . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 796.4.2 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80
6.5 Application: Event Detection Based on Single Target . . . . . . . . . . . . . . . . . . . . . . . . . 826.5.1 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83
6.6 Application: Event Detection Based on Multiple Targets . . . . . . . . . . . . . . . . . . . . . . . 846.6.1 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85
CHAPTER 7 CONCLUSIONS AND FUTURE WORK . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 887.1 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 887.2 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89
vii
REFERENCES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91
AUTHOR’S BIOGRAPHY . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96
viii
CHAPTER 1
INTRODUCTION
It is estimated that over 80% of all information around us is captured by our sight in the form of
visual data. Thus, it is not surprising that since the advent of photography, video capture, and television,
images and video have become the increasingly important sources of information capture, exchange,
and storage. Today’s digital technology and affordability of capture and storage devices for image and
video have lead to a visual information boom. Automatic methods of processing visual information
have become necessary to deal with this information explosion. Image processing, image compression,
video processing, video compression, image understanding, and computer vision have become impor-
tant research fields that are stepping up to the challenge of automatic processing and handling of visual
information.
There has been tremendous progress in research and development in the fields of image and video
compression, editing, and analysis software, leading to their effective usability and commercialization.
However, success in developing general methods of analyzing video in a wide range of scenarios remains
elusive. The main reason for this is the number of scenario dependent parameters affecting various pixels
in a video or across videos. Moreover, the sheer amount of raw data in video streams is voluminous.
Yet, the problem of image or video understanding is often ill-posed, making it even more difficult to
solve the problems based on the given data alone. It is, therefore, important to understand the nature
of the generation of the visual data itself. It is also important to understand the features of visual data
that human users would be interested in, and how those features might be extracted. How these features
are related to each other and how the modules extracting them might interact with each other will be of
special interest in designing vision systems.
1
The human visual system is a ready example of a system that works successfully in extracting the
features of interest. Thus, studies in neuroscience and human visual perception have also deeply affected
design of automatic computer vision and video understanding systems. However, human brain and its
visual pathway is still far from being fully understood.
1.1 Nature of Visual Data
A digital video is a discrete version of a natural light signal captured by camera. For a video oft
frames where each frame ism rows byn columns of pixels, with three channels per pixel, this video lies
in a raw pixel space of3mnt dimensions. Assuming8 bits per channel there are224mnt possible videos,
or points in this space. Not all points lying in this discrete space represent videos that are plausible
naturally generated videos. This means that the actual videos lie on a low-dimensional subspace of this
space that represents the raw pixels. There are many constrains that define this subspace.
1.1.1 Constraints on pixels in visual data
Pixels in a frame represent the light bouncing off of the surface of various objects. Objects have
finite dimensions and predictable appearance. This predictability is sometimes calledspatial coherence
in video. Moreover, due to laws of physics, objects have limited acceleration and speeds. For example,
if we know the world record of human speeds for various activities, we can expect the humans in a given
video to be slower than those speeds. This predictability over time is known astemporal coherencein
video.
With an object- and event-centric view of the visual signals, it is easy to describe how an image or a
video was generated. Roughly speaking,objectsare distinct contiguous entities in space that constitute
an image or a frame. In a video the state of these objects is usually a function of time. For example, the
pose or position of an object might change with time. Videoeventsare usually defined as a function of
the states or time-series of these states of the objects.
Objects also have motion constraints that limit the variation in their shape that they can assume.
For example, humans have constraints on how their joints bend. Depending on the context, even the
type of the objects (including humans) and events (including human activity) encountered is limited.
2
For example, in a scenario where a fixed camera is monitoring an indoor cafe, we do not expect to see
crocodiles and airplanes as objects and volcano eruptions as events. Thus, there are a lot of constraints
leading to predictability in videos that can be learnt and interpreted.
1.1.2 Hierarchical representation of variables from visual data
Humans naturally describe images and videos as compositions of objects and events, as opposed to
being a composition of pixels as they are often captured and stored in machines. Moreover, humans
would also describe that certain features such as individual pixels and edges, to be caused by the objects
and events present in the video. The reasoning of causality follows a certain path. For example, the
presence of a foreground object against a background object is the cause of the color that we see at
certain pixel locations, and the edges that we see at the boundary of the objects. This cause-and-effect
relationship between various features leads to a hierarchy of features where the causes are usually the
high-level entities. This is a natural outcome or cause of the high-level features that interest humans
and is both part of visual cognition and necessary for our daily lives and survival. For example, we are
interested in knowing whether we are in the way of a speeding car or not more than we are interested in
the dominant color in the scene.
Thus, some of the most challenging and interesting tasks that we want computer vision systems to
perform are detection and recognition of these high-level features such as objects, their states, and events
in images and videos. For example, in a surveillance video, we would like to detect any suspicious
activity by the humans under surveillance. Such suspicious activity can be termed anunusual event.
However, these high-level features are difficult to define mathematically and algorithmically. This also
makes their extraction more difficult. In many cases, we are also interested in extraction of these features
in a manner that is invariant to many other factors. For example, if we want to define if there is a face
of a particular person present in an image or not, we want the extraction of this information to be
invariant to the lighting conditions, location of the face, its pose, and its attire or facial hair etc. Due
to such challenges, face detection is still an active research problem. On the other hand, extracting a
corner, without any context, just depends on the local distribution of intensities, and can be defined
3
Table 1.1 Characteristics and examples of the low-level and high-level features that represent videodata.
Low-Level Features High-Level FeaturesEffect of high-level features Cause of low-level featuresSemantically less meaningful Semantically more meaningfulComponents of a high-dimensional representationComponents of a low-dimensional representationEasy to define Difficult to defineEasy to extract Difficult to extractWeaker requirement of invariant extraction Stronger requirement of invariant extractionWeaker notion of invariance Stronger notion of invarianceE.g., edges, color histogram, corners, optical flowE.g., face(yes/no), walking
mathematically in a rather simple way (such as a high second eigenvalue of the correlation matrix of
image intensities in the neighborhood of the potential corner).
Table 1.1 summarizes the characteristics and examples of low-level and high-level features in video
data.
Much of the research in computer vision aims to bridge thesemantic gapbetween the low-level rep-
resentation of the data and the high-level semantic concepts (or features) that we are usually interested
in. In a generative model of the image or the video, any variable that cannot be observed (usually, only
the pixels can be observed) are known ashidden variables. Variables that are irrelevant to the interpreta-
tion of the desired concept, yet that affect the observed data, are usually a subset of the hidden variables.
While deciding on the possible outcomes of the hidden variables that we are interested in, we need to be
able to take into account the possible values of the other hidden variables and their probabilities. Due
to the difficulty of obtaining reliable functions that describe the extraction of high-level concepts in a
manner that is invariant to the value of oft-numerous irrelevant hidden variables, we need to rely on
adaptable functions that can make use of previous examples. This is where machine learning finds use
in computer vision applications.
1.2 Machine Learning and Video Understanding
Machine learning allows one to leave some function parameters to be estimated from the data itself
while estimating the function that maps points from one space to another (such as from the data space to
the feature space). When the estimation process makes use of labeled pairs of associated points from the
4
input and output spaces, it is known assupervised learning. Another name for the value in the output
space islabel of the point in the input space. When the estimation process is required to find these
labels on its own as well as learn the function parameters, it is known asunsupervised learning, which
includes dataclustering(or dividing data points into clusters based on some measure of closeness). Use
of some labeled points and some unlabeled points makes the techniquesemisupervised.
There are two distinct schools of thought in machine learning, especially when applied to high-level
feature extraction in computer vision: discriminative and generative. In thediscriminative approach,
classification and recognition systems are built bottom up, starting with low-level features in the hierar-
chy, extracting higher, and higher levels of features or concepts. The design is concerned with accurate
mapping between the input and output spaces, without any explicit emphasis on how a label from the
output space could be the cause of generation of the data point in the input space. Usually, supervised
methods are necessary to train discriminative approaches for image and video processing that provide
a meaningful output, since it is difficult to predict how a given clustering criteria will affect the final
outcome in a complex vision task.
On the other hand, it is hypothesized that humans infer the state of the world around them by
matching and validating the input signals against a model of the world that is already in their mind [1].
This means that humans are tweaking a model of how the data were generated in a top down fashion to
validate the observed data. Such models that describe the process of how the data were generated belong
to the second school of thought in machine learning and are known asgenerative models. Generative
models naturally describe how the high-level concepts give rise to the observed low-level data. Hidden
variables can also be incorporated in a generative model in a principled manner. Probabilistic graphical
models [2] are tools that help in describing such models graphically and the inherent uncertainty within
the model probabilistically. The generative approach is also closely related to a human description of
cause and effect relationship between various features of the data.
Another related field that aims to simplify (or compress) the description of data and make its inter-
pretation robust to noise is information theory. There seems to be a direct intuitive relation between an
5
efficient representation (or compression) of the data and the description of the generative process. Thus,
we are also witnessing an increased application of information theoretic concepts to machine learning.
1.3 Original Contributions
In this work, we explore the the two prevalent frameworks for designing computer vision systems.
We design and implement computer vision systems based on these two frameworks and also study
systems implemented by others. We analyze their performance. We critically compare their advantages
and disadvantages.
We also come up with a new framework that combines the advantages of the two existing frame-
works. We explore the theoretical underpinnings of this new framework to some depth. We design
computer vision systems based on this new framework and analyze the results of our experiments. We
present suggestions for further exploration and improvement.
1.4 Overview
We study the frameworks to design complex vision systems to extract high-level features based on
machine learning. Scenarios of interest will be object recognition, multitarget tracking, and event de-
tection in surveillance videos. We start with the more proliferate approach in designing vision systems
for these tasks- the feedforward design and move towards the more intuitive generative approach, ad-
dressing some of the issues faced by the generative approaches. We then design a new framework that
combines the advantages of both of these approaches.
In Chapter 2, we present our work on the feedforward way of assembling feature extraction mod-
ules in a hierarchy to design a vision system for a given task. We present the results and discuss the
limitations of these methods. We also discuss why these approaches are so popular.
In Chapter 3, we present how some of the existing techniques address the issue of feedback among
various modules or units in a vision system. We present some of our work on ad hoc feedback design.
We discuss the strengths and limitations of these techniques.
6
In Chapter 4, we compare the two frameworks critically and present a view to see them from a
common viewpoint. We lay further foundation to seamlessly borrow ideas from either framework in
order to design a new hybrid framework called V/M graphs.
In Chapter 5 we develop the V/M graphs framework that can be viewed as a hybrid between the
discriminative approach and generative models. The framework uses the known modules that work
well for different vision tasks, which may have been designed using a discriminative approach. These
modules can be modified and fit into a complex generative model to simplify computation.
In Chapter 6, we present some applications and results using the V/M graphs framework. We high-
light the qualities of this new framework that it borrows from the two existing frameworks.
We conclude our work in Chapter 7. We also present future work that needs to be done to establish
whether the new framework can become a useful tool to design complex vision systems. We also suggest
some directions that can be taken to further develop this line of research. Applications that can come
out of this work are also suggested.
7
CHAPTER 2
MODULAR FEEDFORWARD VISION SYSTEMS
Complex tasks such as object recognition, multitarget tracking, and event detection have attracted
a lot of attention in the computer vision and video understanding community. Most of the systems for
these tasks work in a feedforward manner. Image processing and other techniques extract low-level
features, usually in the form of a feature map of the image or the frame. Based on these low-level
features, the object of interest is segmented. Attributes, which are again low-level or mid-level features
are extracted for the object. These features are compared to a model of the object for object recognition,
or model of an event for event recognition. There are many techniques in object recognition and event
detection that fit this description. In this chapter we present a brief review and our work on vision
systems without feedback between modules.
2.1 Object Recognition
Object recognition is a high-level task. Most object recognition systems are based on the assumption
that the object can be accurately segmented from the image. Based on the segmentation results, certain
features are extracted. The extracted features are matched against a model of the object, which, in turn,
is based on these features. The models are usually formed in a supervised manner by presenting a some
exemplars of the object to a modeling system. Some examples from the vast number of systems that fit
this approach are [3–5].
8
2.1.1 Scenario
Based on the assumption that the object can be segmented from the image, in our work on object
recognition [6], we addressed the case where the only reliable information that can be extracted from
the image of the object is its silhouette. This is true when the object is bounded by a smooth textureless
surface. We further assumed the weak-perspective (or scaled orthographic) case where the rays from
the object are (approximately) parallel when they hit the image plane. The concept of the silhouette
formation from the occluding contour is depicted in Figure 2.1.
Viewing OccludingContour
Image
Contour
Cylinder
Figure 2.1Occluding boundaries under orthographic projection.
2.1.2 Theory: feature mapping and matching
Every two-dimensional (2-D) point on a plane has an equivalent dual in the 2-D line space (since
both can be defined by two numbers). Extending this for a curveΓ in 2-D, its dualΓ′ can be defined as
the loci of its tangents, where each tangent corresponds to a point on the curveΓ. We use this dual to
obtain an invariant representation of the silhouette that helps us in object recognition.
For each tangent orientation, there are two parallel tangents that enclose the silhouette between
them. These tangents represent the convex hull of the object. When the silhouette is complicated
enough, there are more than two parallel tangents for any given orientation of tangents. In general,
there is an even number of tangents of any given orientation unless the orientation represents one of
9
the special points. For example, if we rotate an object and densely sample the images for the changing
viewpoint/pose, we will see some of these tangents move closer or apart. The special points represent
singular viewpoints/poses where the points of contact for two tangents merge resulting in merging and
further disappearance of these two tangents if continued in the same direction of viewpoint/pose change.
These singularities are similar to the concept of aspect graphs [7,8].
When the distance between parallel tangents is divided by the distance between the outermost tan-
gents for this given orientation, the resulting vector of normalized distances is a scale-invariant property
of the shape. The locus of thus normalized intertangent distances for different orientations is called
the signatureof the silhouette. So, the signature can be parameterized by the orientation angle of the
tangents which spans a180◦ space. The points on the signature corresponding to different angles do not
have the same dimension. The dimension is given by the number of parallel tangents for that orientation
minus 2 (the ones corresponding to the convex hull). The extraction of one data point for the signature
for a given orientation of the tangents is shown in Figure 2.2.
O
dd
d1
23
O
Figure 2.2 The scalarsd1 and d2 defining a point on the signature are determined using distancesbetween parallel tangents on the original closed curve corresponding to the orientation (perpendicularto) line∆θ at angle theta to a reference orientation (such asx-axis). These distances are normalized bydividing them by the distance between the outermost tangents;d3.
The silhouette corresponds to an actual three-dimensional (3-D) curve on the surface of the object
where the view direction grazes the surface as shown in Figure 2.1, known as theoccluding contour.
10
Let us call this contourC1. The set of parallel tangent linesT θ11 grazing the silhouetteS1 for a given
orientationθ1 in a given imageI1 thus correspond to a set of parallel tangent planesL grazing the
surface of the object and containing the view direction. The orientation of linesθ1 in the imageI1 also
corresponds to an orientation about the view axis in the real world. Let the view direction (in the real
world) beV1. Let the set of points where the tangent planes touch the object beP , and the set of points
on the silhouetteS1 in the imageI1 whereT θ11 touches the silhouette beP1.
Now, if we view the object from another directionV2 that lies on the same set of planesL (such
thatV1 × V2 is normal to the plane), the corresponding occluding contourC2 will intersect the previous
occluding contourC1 at precisely the same pointsP where the planes graze the surface of the object.
This is trivially true, because the new view direction lies in the same set of planes. The set of pointsP
is known asfrontier pointsfor the two view directionsV1 andV2. Since the relative/normalized distance
between these planes remains unchanged (as these are the same planes for the two view directions or
the two images), the corresponding set of tangent linesT θ22 (for some orientationθ2 in the imageI2) to
the silhouetteS2 in the new imageI2 (which are the image of the set of planesL) will have the same
relative separation as the lines in the setT θ11 . This is true if some of the original set of frontier pointsP
are not occluded in eitherI1 or I2. Thus, if we normalize these distances between tangents in setT θ22
by dividing by the distance between the outermost tangents in this set, we will get the same signature
point corresponding to that of the setT θ11 , which can thus be matched. Such a match will be invariant
to image scaling and aspect ratio changes between various images, and will depend on the geometry
of the object. If the set of parallel tangents is sufficiently rich (and the object geometrically complex),
with some diligence such a matching can serve as surface geometry based object recognition. For more
details, one could see [6].
2.1.3 Apparatus
A camera with a zoom lens (high focal length) was used to take images of the object in a near
weak-perspective (scaled-orthographic) setting. The object was put on a pan-tilt turntable in front of a
back-lit screen or a dull black cloth, and a sequence of images in various poses (or equivalently, various
view directions) was taken. The high-contrast that the opaque object made in front of the bright back-lit
11
screen or the dull black cloth made the image processing required to extract the silhouette of the object
easier. The change in the pan-tilt parameters of the turntable determined the pose of the object, or
equivalently, the view angle. Some of the object modeling shots are shown in Figure 2.3.
Figure 2.3Some images from the object modeling sequence of the ‘Phone’ object.
2.1.4 Object modeling
Six objects were modeled using the system. Representative images of these objects are shown in
Figure 2.4. These objects were put on the turntable and the table was rotated to simulate one great circle
of the view sphere.
Figure 2.4Representative images of the six objects used for object modeling.
12
The Canny edge detector was used to obtain object boundaries as linked, closed curves. To prevent
the program from becoming confused by the internal edges while extracting the silhouette, some of
the internal edges were removed by hand. The silhouette was smoothed and its dual in tangent space
was calculated. While modeling the object from various images of the object, such signatures were
calculated and integrated into an object model. The complete system diagram for the object modeling
system is shown in Figure 2.5.
Figure 2.5Object modeling system diagram.
2.1.5 Object recognition
Test images for recognition were shot from novel viewpoints that did not coincide with the great
circle that the modeling images were shot from. The silhouette from the test image was also extracted
and smoothed. The dual of the silhouette is extracted. In the matching phase the signature of the dual
is matched against the signature of the stored object models. Assuming orthographic projection, even
if the test image is taken from an angle not seen earlier, the signatures will match for the same object.
The reason for this expected match is that the silhouettes are orthographic projections of 3-D curves that
correspond to the grazing of the view direction of the actual object surface, as explained in Section 2.1.2.
The diagram for the system that matches the test images to stored models is shown in Figure 2.6. More
details are given in [6].
Figure 2.6Object matching system diagram.
13
2.1.6 Results and discussion
The recognition system was tested on a set of six objects. The objects are shown in Figure 2.4. The
recognition system achieved more than 90% recognition rate for this set of six objects. The results are
shown in Figure 2.7. More experiments on cluttered background were also conducted and showed some
initial success [6]. These concepts were extended to non-orthographic setting in [9]. Later, the same
principles were also used in calculating structure from motion for smooth textureless objects [10,11].
Figure 2.7 Some test images from novel (unseen) viewpoints that were correctly recognized. Thecontour is shown as a thick dotted line around the object.
The success of the modeling and recognition was found to be critically dependent on accurate ex-
traction of the object silhouettes. This required an elaborate setup of clutter-free background for taking
images. From a practical system point of view, a clutter-free background is rarely available in the real
world. One has to model the clutter and take noise into consideration for any practical application.
In fact, segmentation and recognition together can be viewed as chicken-and-egg problems [12], since
the success of recognition depends on accurate segmentation, and segmentation itself can benefit from
recognition. We explore systems with feedback for chicken-and-egg problems in Chapter 3.
14
2.2 Multiple Object Tracking and Event Detection
Similar to object recognition, most approaches in event detection have taken the feedforward ap-
proach from low-level to high-level feature extraction without much use of feedback in the system.
These approaches focus on object tracking based on background subtraction and blob extraction mod-
ules whose output serve as input to object tracking modules. Some of the extracted features of the
objects and their position are used to form behavior models for event detection. Typical examples of
such systems are [13–17].
Our work on multiple object tracking and event detection is based on a similar approach [18]. The
objective is to track multiple humans in an indoor environment and detect whether certain events have
taken place. The human trajectories are used to trigger specific events. The multiple object tracking
method works on fixed cameras.
2.2.1 Human detection
Human detection starts with an adaptive background modeling module which deals with changing
illuminations and does not require objects to be constantly moving. A Gaussian-mixture-based back-
ground modeling method [19] is used to generate a binary foreground mask image. An object-detection
module takes the foreground pixels generated by background modeling as input and outputs the prob-
abilities of object-detection. It searches over the foreground pixels and gives the probability of each
location where a certain scale object is found. Any object-detection approach can be fit into this part.
In our implementation, we apply a neural-network-based object-detection module to detect pedestrians.
Each foreground blob is potentially the image of a person. Each pixel location is applied to a neural
network that has been trained for this task. The neural network generates a score, or probability, indica-
tive of the probability that the blob around the pixel does in fact represent a human of some scale. A
particular part of the detected person (e.g., the approximate center of the top of the head) is illustratively
used as the location of the object, which is shown as a light spot in Figure 2.8(c). The lighter spot
demonstrates the higher detection score. The neural network searches over each pixel at a few scales.
The detection score corresponds to the best score, i.e., the largest detection probability, among all scales.
15
(a) (b) (c)
Figure 2.8Results of the human detection. (a) The original frame, (b) background mask, and (c) humandetection probability map.
2.2.2 Multiple human tracking algorithm
The tracking algorithm accepts the probabilities of preliminary object-detection and keeps multiple
hypotheses of object trajectories in a graph structure, as shown in Figure 2.9. Each hypothesis consists
of the number of objects and their trajectories. The first step in tracking is to extend the graph to include
the most recent object-detection results, that is, to generate multiple hypotheses about the trajectories.
An image-based likelihood is then computed to give a probability to each hypothesis. This computation
is based on the object-detection probability, appearance similarity, trajectory smoothness, and image
foreground coverage and compactness. The probabilities are calculated based on a sequence of images;
therefore, they are temporally global representations of hypotheses likelihood. The hypotheses are
ranked by their probabilities and the unlikely hypotheses are pruned from the graph in the hypotheses-
management step. In this way a limited number of hypotheses are maintained in the graph structure,
which improves the computation efficiency.
In the graph structure (Figure 2.9), the graph nodes represent the object-detection results. Each node
is composed of the object-detection probability, object size or scale, location, and appearance. Each link
in the graph is computed based on position closeness, size similarity and appearance similarity between
two nodes (detected objects). The graph is extended horizontally in time. In this section we describe
16
three steps of the tracking algorithm: hypotheses generation, likelihood computation and hypotheses
management.
Figure 2.9Multiobject tracking graph structure.
Given object-detection results in each image, the hypotheses generation step firstly calculates the
connections between the maintained graph nodes and the new nodes from current image. The main-
tained nodes include the ending nodes of all the trajectories in maintained hypotheses. They are not
necessarily from the previous image since object-detection may have missing detections. The connec-
tion probabilitypcon is computed according to Equation (2.1):
pcon = wa × pa + wp × pp + ws × ps (2.1)
In Equation (2.1)wa, wp, andws are the weights in the connection probability computation; that is,
the connection probability is a weighted combination of appearance similarity probabilitypa, position
closeness probabilitypp, and size similarity probabilityps. We prune the connections whose proba-
bilities are very low for the sake of computation efficiency. As shown in Figure 2.9, the generation
process takes care of object occlusion by track splitting and merging. When a person appears from
occlusion, the occluding track splits into two tracks, on the other hand, when a person gets occluded, the
corresponding node is connected (merged) with the occluding node. The generation process deals with
missing data naturally by skipping nodes in graph extensions; that is, the connection is not necessarily
17
built on two nodes from consecutive image frames. The generation handles false detections by keep-
ing some of the hypotheses that exclude the nodes corresponding to the false detections. It initializes
new trajectories for some nodes depending on their (weak) connections with existing nodes and their
locations (at appearing areas, such as doors, view boundaries). The multiple object tracking algorithm
keeps all possible hypotheses in the graph structure. At each local step, it extends and prunes the graph
in a balanced way to maintain the hypotheses as diversified as possible and delays the decision of most
likely hypothesis to a later step.
The likelihood or probability of each hypothesis generated in the first step is computed according to
the connection probability, the object-detection probability, trajectory analysis, and the image likelihood
computation. The hypothesis likelihood is accumulated over image sequences, and likelihood for frame
i is incrementally calculated based on the likelihood in imagei− 1, as show in Equation (2.2):
li = li−1 + Pn +
∑nj=1 log(pconj ) + log(pobjj ) + log(ptrjj )
n+ Limg (2.2)
In Equation (2.2)li is the image likelihood in theith frame,n represents the number of objects in
current hypothesis.pconj denotes the connection probability ofjth trajectory computed in Equation
(2.1). If jth trajectory has missing detection in current frame, a small probability, i.e., missing prob-
ability, is assigned topconj . The termpobjj is the object-detection probability, andptrjj measures the
smoothness ofjth trajectory. We use the average likelihood of multiple trajectories in the computation.
The metric prefers the hypotheses with better human detections, stronger similarity measurements and
smoother tracks.Limg is the image likelihood of the hypothesis. It is composed of two items as shown
in Equation (2.3):
Limg = lcov + lcomp (2.3)
The termlcov in Equation (2.3) is further represented in Equation (2.4), andlcomp is represented in
Equation (2.5)
lcov = log
(|A ⋂
(⋃n
j=1 Bj + c||A|+ c
)(2.4)
18
lcomp = log
(|A ⋂
(⋃n
j=1 Bj + c||∑n
j=1 Bj |+ c
)(2.5)
In Equations (2.3) and (2.4)lcov calculates the hypothesis coverage of the foreground pixels, and in
Equations (2.3) and (2.5)lcomp measures the hypothesis compactness.A denotes the sum of foreground
pixels, andBj represents the pixels covered byjth node (or track). The symbol⋂
denotes the set
intersection and⋃
the set union. The numerators in bothlcov andlcomp represent the foreground pixels
covered by the combination of multiple trajectories in current hypothesis, therefore,lcov represents the
foreground coverage of the hypothesis, the higher the larger coverage, andlcomp measures how much
the nodes overlap with each other, the larger the less overlap and the more compact. c is a constant.
These two values give a spatially global explanation of the image (foreground) information.
The hypothesis likelihood is a value refined over time. It provides a global description of object-
detection results. Generally speaking, the hypotheses with higher likelihood are composed of better
object-detections with good image explanation. It tolerates missing and false detections since it has a
global view of image sequences.
This step ranks the hypotheses according to their likelihood values. To avoid combinatorial explo-
sion in graph extension, we only keep a limited number of hypotheses and prune the graph accordingly.
The hypotheses management step deletes the out-of-date tracks, which correspond to the objects which
are gone for a while, and keeps a short list of active nodes which are the ending nodes of the trajectories
of all the kept hypotheses. The number of active nodes is the key to determine the scale of graph ex-
tension, therefore, a careful management step assures efficient computation. The design of this multiple
object tracking algorithm follows two principles. (1) We keep as many hypotheses as possible and make
them as diversified as possible to cover all the possible explanations of image sequences. The top hy-
pothesis is chosen at a later time to guarantee it is an informed and global decision. (2) We make local
prunes of unlikely connections and keep only a limited number of hypotheses. With reasonable assump-
tions of these thresholds, the method achieves real-time performance in a not-too-crowded environment.
The graph structure is applied to keep multiple hypotheses and make reasonable prunes for both reliable
performance and efficient computation. The tracking module provides feedbacks to the object-detection
19
module to improve the local detection performance. According to the trajectories in the top hypothesis,
the multiple object tracking module predicts the most likely locations to detect objects. This interaction
tightly integrates the object-detection and tracking, and makes both of them more reliable.
2.2.3 Tracking results
The multiple object tracking method has been tested on two cameras. The first scenario includes
two persons coming into the door about the same time. Figure 2.10(a) shows 4 images from the se-
quence with overlaid bounding boxes showing the human detection results. The darker the bound box,
the higher the detection probability. Figure 2.10(b) demonstrates the multitracks with the largest prob-
ability generated by the multiple object tracking. The tracks are overlaid on the detection score map.
Different intensities represent different tracks. The human detection based on each image is certainly not
perfect. In the first and third images, the human detector misses the person in the back due to occlusion
and the person in the front due to distortion, respectively. There are false detections in the forth image
caused by background noise and people interaction. However, the multiple object tracking method man-
ages to maintain the right number of tracks and their configurations, as shown in Figure 2.10(b), because
it searches for the best explanation sequence of the observations over time. Figure 2.11 demonstrates
an example of multiple people tracking with crossing tracks. The example first shows the lady opens
the door for the person in gray shirt, then the person in dark shirt follows and goes into the area. Fig-
ure 2.11(a) shows the images from the sequence, and (b) demonstrates the tracking result. Interestingly,
there is one short track close to the up-left corner of the result image because one person is standing
inside the door and the human detection consistently detects him through the glass window. Therefore,
four tracks are shown in Figure 2.11(b), the short track for the standing person, the long track for the
lady, the light track for the guy in gray shirt, and the dark track for the guy in dark shirt.
However, there are cases (not displayed) where the human detection completely fails to pick a human
for a few consecutive frames. In such a scenario, as expected, the tracking and event detection that
depend on human detection results also fail. Also, the appearance matching is based on color-histogram
matching, and sometimes it causes tracks to crossover in error.
20
(a) (b)
Figure 2.10 Tracking results with missing/false human detections: (a) original images with overlaidbounding boxes showing the human detection results, (b) multiple object tracking result overlaid on thehuman detection map.
(a) (b)
Figure 2.11Tracking results of crossing tracks: (a) original images with overlaid bounding boxes show-ing the human detection results, (b) multiple object tracking result overlaid on the human detection map.
2.2.4 Event detection
Event detection was based simply on empirically determined hard-coded rules that the multiple
human tracks were tested against to test occurrence of events of interests. The exact nature of events is
not important for testing a simplistic event detection technique such as this, where there is no learning
involved at the event detection level.
2.3 Discussion
There are several observations in order. Feedforward approaches based on feature detection is the
popular choice to assemble complex vision systems among the research community. Attempts to im-
prove performance aim at improving the individual performance of the modules in the feedforward
hierarchy. A considerable amount of research has been done in coming up with features invariant to
21
transforms that are irrelevant to the task at hand [3, 20, 21]. The success largely depends on the indi-
vidual performance of modules in the hierarchy chain. There is information loss whenever a feature
extractor maps an image to a lower dimensional space. It is not always obvious how the information
extracted by the feature detectors will be the only information needed by the next level.
As discussed in the cases above, the feedforward systems depend on good performance of a few
critical modules. For example, the object-detection system in Section 2.1 critically depends on suc-
cessful extraction of object silhouette, and the event detection system in Section 2.2 depends on the
human detection, color-histogram extraction and interframe object matching criteria. Most interesting
real-world scenarios encountered are far too complicated, cluttered, and novel to depend on one or two
specialized modules. Moreover, there is no principled mechanism in feedforward and discriminative
design to make use of feedback from the high-level modules to help in learning or inference at the lower
levels or among the various high-level modules themselves. There is no use of synergetic inference and
learning between modules either.
In the area of multiple object tracking, there isn’t any database that the research community can
benchmark against. Such databases exist for the face recognition community, such as the Yale Face
Database [22] or FERET Face Database [23]. Moreover, most of the research is driven by target ap-
plications and scenarios. Unlike well-segmented faces under different pose and lighting conditions, for
example, it is difficult to lay down the specific scenarios for multiobject tracking. The nature of the
problem of occlusion and changing lighting conditions are unique to every scenario.
For the task of event detection, specifying the scenario becomes even more difficult. It seems that
a tight coupling between various modules and levels of the hierarchy of features is the key to produce
more generalized systems. Chapter 3 deals with this issue further.
22
CHAPTER 3
VISION SYSTEMS WITH FEEDBACK AND GENERATIVE MODELS
Many subproblems tackled by a vision system are interrelated. For example, in a generic image,
segmentation and recognition of an object depend on each other where the position, scale, pose and
configuration of an object cannot be known beforehand. Vision systems that use correct detection and
segmentation as a precondition to recognition, pose estimation, etc., have to depend on the accuracy of
detection and segmentation results. On the other hand, a recognition system can provide feedback about
the accuracy of different segmentation results thereby narrowing the search space of recognition based
on partial detection and segmentation. This can help the detection and segmentation in certain cases of
ambiguity. Such ideas have been expressed before in [24, 25]. Similarly, in hierarchical representation
of the visual data, feedback from higher level can not only help in inference in face of ambiguity and
noise at lower levels (and in turn help refine the inference at higher levels), it can also help the learning
at lower levels to tune feature detection parameters so that the results of higher level processing can be
improved.
3.1 Related Work
Previous work on systems with explicit or implicit feedback between modules that handle different
tasks can be classified based on their design. The designs differ in the basic unit module that makes
up the system as well as in the way these modules interact with each other. While there has been work
done on interconnection of linear and simple nonlinear modules, we are not aware of any work on
interconnection of complex modules. Probabilistic graphical models use the interconnected graphical
structure, but they do not have processing modules per se.
23
3.1.1 Connectionist models
Connectionist models include neural-network-type models where various vision tasks such as fea-
ture detection, segmentation, and recognition are solved implicitly and jointly in a network of very sim-
ple and often similar processing units. Inspired by an interconnection-of-neurons model of the human
brain, there are no clear-cut modules that perform subtasks separately or explicitly. Examples include
the Convolutional Neural Networks [26] that perform the task of digit localization and recognition.
Similarly, the Cresceptron is used for joint segmentation and recognition of objects from images [12],
where the system is trained by presenting exemplars of segmented objects, which triggers creation of a
connectionist network to model the shape and appearance of the object. During the testing phase, the
network solves the segmentation and recognition jointly.
There are clear advantages in using these methods as shown by the results in the related publica-
tions [12,26]. Chief among the advantages is competitive accuracy and cosolving two or more interde-
pendent tasks. However, it is usually computationally and memory intensive to train and process data
using these models. Moreover, it is not clear how to extend these models for more complicated tasks,
including tasks involving temporal sequences, such as tracking and event detection. Also, for most
tractable models the feedback is limited to training and not the inference.
3.1.2 Information-theoretic models
Information-theoretic models are based on principles of information compression and transmission,
that is, information theoretic concepts such as mutual or relative entropy of different variables. In
layered information-theoretic models, higher-level layers predict the output of the lower-level layers,
and lower-level layers pass on the prediction error back to the higher-layer to help adjust parameters
for prediction. Ideas that represent explicit hierarchical representations and feedback in the hierarchy
include [27] and [28].
In [27], an image is coded using a hierarchy of linear predictors in a cascade array, where the
receptive field of a linear predictor maintains spatial contiguity. A predictor at a given layer tries to
predict the output of the layer below, and the layer below passes only the error in prediction to the layer
24
above. Such predictive mechanism is claimed to be a part of human visual cortex. However, linear
prediction and the architecture presented in this work is restrictive and its application to problems more
complex than learning local image structures is not clear.
The idea presented in [28], however, takes a probabilistic approach in a similar predictive hierarchy
compared to [27]. The idea is to pass the information down the hierarchy in the form of probabilistic
priors. It is related to an intuitively elegant theory of how vision patterns are generated, and how learning
inference for vision can be related to information theory, calledpattern theory. Some of the other
representative publications from the pattern theory group are [29–31]. However, application of this
theory to some complex vision problems is yet to be seen.
Machine learning when applied to vision can be linked to information theory in the sense that the
higher level features are a low-dimensional compressed version of the low-level raw-pixel information
that we are trying to draw inference from.
3.1.3 Generative models
A totally different class of models compared to discriminative models are generative models that
model the generative process or the hypotheses that give rise to the observed data. The learning task
consists of estimating free parameters of the generative model. The cost function to be optimized for
learning is usually based on maximizing the likelihood of the observed data. Since observed data are
all that we are sure of, maximizing the probability of the data (in the probability space defined by them)
would make a lot of sense, if we want to make a complete use of the observed data. A noise model is
often assumed in order to prevent over-fitting of the model to the data.
The assumption behind these models is that we know (or at least have an idea of) the process that
generated the observed visual data. Feedback and interaction between various constituents is implicitly
designed in the generative process itself. For example, we know that certain scenes are generated due
to interaction of objects. For simplicity, this interaction can be modeled as an interaction between
layers [32]. The layers in front occlude the layers at the back. The features due to appearance of the
individual layers and due to the interaction and occlusion between layers can be treated in a unified
manner if we are try to generate the whole scene using the layered model. For this, we need to infer the
25
hidden variables and learn the parameters that define the layers and match against the whole observed
data to see if it can be generated with a high likelihood. As far as videos are concerned, usually the
difference between variables and parameters is that the parameters remain fixed (or change slowly) over
time, while variables change from one frame to another.
3.1.3.1 Probabilistic graphical models
The mathematical property of these models that makes them suitable for graphical representation
is the factorization property. Probability distribution defined over all random variables in the model
rarely has all variables dependent on all the other variables. Most variables are dependent on a small
set of other variables. Thus, a large number of pairs of variables in the model are mutually independent
given other variables. This reduces then-dimensional probability distribution function ofn variables to
a product of a number of simpler factors, where each factor involves only a subset of then variables.
A factor graph represents the factorization of a function graphically [33]. It is a bipartite graph,
which means it has two sets of nodes, with each node connected by edges only to some nodes in the
other set that it is not a part of. Thus, if the edges that connect nodes from one set to the other are
removed, the graph is left with no edges. The nodes in the first set are called thevariable nodes. Each
such node represents one of the variables of the overall function. The nodes in the other set are called
function nodes. Each function node represents one of the factor functions that must be multiplied to get
the overall (often probability) function of all the variables. A function node is connected to only those
variable nodes which represent the variables that are the arguments of the factor function represented by
the function node.
Other graphical models such as Bayesian networks, and Markov random fields can be converted to
factor graphs. Hence, we shall deal with the most general model, that is the factor graphs. One of the
main advantages of factor graphs is a local inference algorithm called the sum-product algorithm [33].
Pearl’s belief propagation algorithm for Bayesian networks can be viewed as a special case of the sum-
product algorithm. Learning in factor graphs can be formulated in a number of ways.
26
Another related model that generalizes over factor graphs is product of experts [34]. A useful
approximate learning algorithm based on contrastive divergence has been suggested [34]. However, a
local inference mechanism such as the Sum-Product algorithm has not been devised.
The research in [2,32] most relevant to the applications that we are interested in (tracking, separation
into layers) uses the EM algorithm [35]. Although an online version of the tracking algorithm was
also published [36], it is still based on very limiting assumptions about change of appearance, and is
computationally very expensive. It is not clear how it can be used in a real-world online system that
processes a video in real-time while producing useful tracking results for multiple targets.
3.1.3.2 Models related to pattern theory and generative modeling
Models related to pattern theory make use of information theoretic insight to come up with gener-
ative models of the data. One of the successful applications of pattern theory and generative models
is in texture recognition [37]. Approximate inference techniques such as Markov chain Monte Carlo
(MCMC) [38], and jump-diffusion process are used to infer the hidden variables in the generative model
to fit the observed data. Other successful applications of pattern theory and generative models have been
in image parsing [39] and human hair modeling [40].
However, these models suffer from drawbacks similar to other generative models (such as proba-
bilistic graphical models), such as slow inference and learning. This is actually of a larger concern in
the applications that have come out of pattern theory, since they use slow statistical sampling methods
such as MCMC. Thus, although the results present a promising possibilities, the applications have been
limited to static image understanding tasks, and have not scaled to video processing.
3.1.4 Comparison of connectionist, information-theoretic, and generative models
With a simple argument one can prove that an accurate generative model is lowest dimensional rep-
resentation of the data. By accurate generative model, we mean that the data were indeed generated as
described by the generative process described by model, and assumption about the unpredictable noise
in the generative process is also accurate. Let us assume that a generative model withn independent
parameters was used to generate a data set. Any other learnt description of the data set withm indepen-
27
dent parameters will fail to account for a variation due to at least one of the parameters ifm < n, due
to the independence assumption.
Generative models share certain characteristics with information theoretic models. First, while gen-
erative models represent data efficiently by using a smaller dimension of parameters and hidden vari-
ables than the dimension of the observed variable space, the information theoretic models equivalently
find an efficient representation or coding of the data through information theoretic means. Second, al-
though generative models are explicitly designed such that one can sample hidden variables from them
to generate a plausible observed data point, with some imagination and modification such a sampling
and generation of plausible data points is also possible using the information theoretic models even
though they are not explicitly designed for such a sampling process.
On the other hand, while probabilistic graphical models and connectionist models share the graph
structure, the similarity ends here. There are no explicit criteria that lay emphasis on the model’s abil-
ity to generate (or sample) the observed data in connectionist models. However, there is some loose
connection between the two models when it comes to learning model parameters making use of the
graphical structure of the model. Methods such as gradient backpropagation [26] in connectionist mod-
els are similar to some of the gradient-based learning methods that piggyback on local message passing
algorithms in probabilistic graphical models [41,42].
3.2 Application: Multimodal Person Tracking
Feedback can also be built in an ad hoc fashion, leading to boosting-type methods. In surveillance,
target tracking and signal enhancement in sound are two of the important tasks. These two problems
can be solved jointly and synergetically. The spatial motion of a moving target can be followed using
video data, captured by a camera. If the object emits sound (e.g., person speaking) audio data can be
used to estimate the time delay of arrival of sound between two (or more) microphones and thus used for
tracking. Tracking using audio is robust to occlusions and variations in lighting whereas tracking using
video alone gives us bothx- andy-coordinates. This point is demonstrated in Figure 3.1, where the visual
modality loses track of the region of interest (ROI) due to occlusion. The occlusion is simulated as a
28
column of random noise pixels. Thus, intuitively it is obvious that these modalities should complement
each other and when used together should provide a more robust system with collective capabilities that
is more than sum of its parts. The audio and video signals are correlated at various levels; lip movement
of the speaker is correlated with the amplitude of part of the signal and can also help us narrow down the
ROI to sound generating source. Also the time delay of arrival (TDOA) between the two microphones
is correlated with the position of the speaker in the image. It has been shown that humans use TDOA
as a main cue for source localization [43]. We also exploit TDOA for the audio-based estimate of the
person’s position using two microphones. When visual tracking fails due to occlusion, instability of the
tracker, or corruption of frames by random noise, audio modality can be used to reinitialize the visual
tracker. On the other hand, when visual tracking is robust, the estimate of the position of the object can
be used to get an estimate of the time delay of the component of the sound coming from the target at the
microphone pair, thus helping in source separation and noise cancelation.
Figure 3.1 Top row shows tracking performance using video alone in presence of occlusion (randomnoise strip). Note that in the rightmost frame, the tracker is unable to follow the subject. Bottomrow shows tracking performance using both audio and video. Now the target is being followed afterocclusion.
We consider a surveillance system with audio and video subsystems. The system blocks are shown in
Figure 3.2. These subsystems can be used together either by viewing the problem as a feature integration
problem or these subsystems can interact amongst themselves to give a better solution than what either
one of them can generate individually. We demonstrate that by using audio alongside video our system
is robust to occlusion. It is also performs robustly when some frames are totally corrupted by noise.
Audio is also used for automatic initialization of visual tracking. Results of the video tracking are used
29
to estimate the time delay for the audio signal generated by the target in a robust manner. This delay is
further used to separate the target audio signal from the background noise. To our knowledge neither
problem has been attempted to be solved using a multimodal approach.
Figure 3.2System diagram.
There has been substantial work done in tracking moving objects using video [32,36,44]. Tracking
people using microphone arrays has also been done [45]. However, the problem of using these two
modalities together is relatively new and growing fast. In [46] the problem of speaker detection is
addressed by using a time delayed neural network to learn correlation within single microphone channel
and an image in video sequence. Cutler and Davis [46] and Garg et al. [47] also address a similar
problem by using multiple audio and video features, such as skin color, lip motion, etc., in a probabilistic
approach. The particle filtering approach of [44] was extended by Vermaak et al. [48] to include audio
for tracking by modeling cross-correlations between signals from the microphones in a array. The
tracking algorithm in [49] extends this approach using a unscented particle filter. Another approach
using graphical models was used by Beal et al. [50] for audiovisual object tracking. However, none
of these works deal with occlusion of the target or corruption of some frames by random noise. The
problem of audio source separation using visual tracking has also not been addressed.
30
3.2.1 Algorithm
We start by developing the audio and video subsystems of the surveillance system independently.
Then we combine the two subsystems to deal with problems of visual tracking; initialization, occlusion,
and frames corrupted by noise. We also solve the problem of source separation and noise cancelation in
audio using the result of visual tracking.
We intentionally chose simple algorithms for each subsystem, which when combined with visual
modality would do a better job than other sophisticated algorithms. We reiterate that focus of the paper
in on combining these modalities in a synergetic manner.
3.2.1.1 Time delay of arrival estimation using audio signals
As shown in Figure 3.3, let us assume that the target objectT moves in thex-direction in a plane
parallel to the image planeO′I of an ideal pinhole camera with focal pointO, whereO′ is the location
where the optical axis meets the image plane, andI is the image of the target object. Thus,OO′
represents the optical axis andO′I represents the image plane. LetT ′ be the projection ofT on the
focal planeOT ′. Thus,z = TT ′ is the distance of the plane of motion of the object from the focal plane
of the camera. Let the microphonesM1 andM2 be placed at a distanced = OM1 = OM2 each on
either side of the focal pointO of the camera. Let the distance of the object from the optical axis be
x = OT ′. The angleφ1 = ∠TM1T′ follows the relation given in Equation (3.1) in triangle4TT ′M1.
tanφ1 =z
x + d=
z
m(x)(3.1)
wherem(x) is a linear function of the position of the object in thex-direction in the image;x = O′I,
which mapsx to x + d = OT ′. Assumingφ1 ≈ φ2 = ∠TM2T′, we also get Equation (3.2):
cosφ1 =δ
2d(3.2)
Using Equations (3.1) and (3.2) we get Equation (3.3):
δ =δ
c=
2d
ccos(cot−1(l(x))) (3.3)
31
wherel(x) is a linear function ofx, δ is the time delay between the sound signals that arrive at the
microphonesM1 andM2, δ is the approximate extra distance that the sound has to travel to reachM1
when compared toM2, andc is the speed of sound.
Figure 3.3Geometry of time delay.
In the real world, however, some of the assumptions (such as the object moving exactly parallel
to the image plane) will not hold. Therefore, we estimate the mapping betweenx (the position of the
object inx-direction in the image), andδ is estimated using a calibration process.
Our framework requires only 15 frames for learning the mapping from the time delayδ to x-
coordinate of the object:x. This process is calledcalibrationof the audio, and it is a one time increment
and an offline training process. After calibration, the audio subsystem can pass an estimate of the object
location to the video subsystem for every video frame.
The delayδ is estimated as follows. We consider windowed audio frames ofN samples with 75%
overlap. It should be noted that several audio frames make up one video frame in terms of time. We
used a coherence measure of cross-correlation to determine the delay between the two microphones:
Rij(τ) =N−1∑
n=0
xi[n]xj [n− τ ] (3.4)
32
wherexi[n] is the discrete sample signal received by microphonei, andτ is the TDOA between the two
received signals. In our case we had two microphones. The cross-correlation is maximal whenRij(τ)
is maximal and whenτ is equal to the offset between the two signals. The complexity in computing
Rij(τ) using Equation (3.4) isO(N2). This can be approximated by computing the inverse Fourier
transform of the cross-spectrum as given by:
Rij(τ) ≈N−1∑
n=0
Xi(K)Xj(K)∗ej2πkτ/N (3.5)
We have developed this algorithm closely along the lines of Knapp and Carter [51]. Also note that
δ follows the relation in Equation (3.6):
δ =τ
fs(3.6)
wherefs is the sampling rate of the sound signal. In actuality, the calibration process fixes the mapping
betweenτ (which is estimated for each visual frame) andx.
The mean of thex’s collected for all the audio frames corresponding to a video frame gives the
audio estimate of the position of the object inx-direction at timet: xaudiot . The inverse of the standard
deviation of thex’s collected for all the audio frames corresponding to a video frame gives the confi-
dence in audio-based estimate of the positionAudConf , which is used in combining the two modalities
to improve object tracking as described later.
3.2.1.2 Visual tracker
Once initialized by hand or automatically (by audio-based estimation, as shown later), the visual
tracker maximizes the Bhattacharya coefficient [52] between the histogram of features extracted from
the target region in the previous frame and that extracted from the potential regions in the current frame.
The feature that we use is the grayscale intensity frequency distribution in the regions. Our visual tracker
is a simplified version of the mean shift tracker [53], which finds rectangular window in current frame
which is a translated version of the rectangular window from the previous frame. The matching finds an
exact solution by searching with one pixel-shifts in a region around the expected position of the window
33
in the current frame, and picking the window that gives the maximum Bhattacharya coefficient with the
window from the previous frame. This is described in Equation (3.7):
xvideot = arg max
∑nk=1
√hk
t (x)hkt−1(xt−1)
x∈N (xt−1)(3.7)
In Equation (3.7)xvideot is the position of the window in the framet (current frame),hk
t (x) is
the histogram forkth feature in thetth frame at the window around positionx, andN (xt−1) is the
neighborhood aroundxt−1, which is defined as a rectangular region aroundxt−1 plus a fraction of the
previous motion vector if the previous motion vector is trustworthy, as determined by the maximum
Bhattacharya coefficient. To our knowledge, such use of Bhattacharya coefficient to determine the
confidence in visual tracking has not been explored before. This gives a simple tracking strategy that
can reduce the search space for the new window by predicting the position of the window in the next
frame in using a simple strategy.
3.2.1.3 Multimodal object tracking
For initializing the visual tracker in the first frame, the position of the target as determined by the
audio subsystem is used. After this the visual tracker performs two-frame tracking as described above.
For cases where the visual tracking fails, a criterion to determine the failure was determined. In such
a scenario, the estimate of the position of the target determined by audio was used to reinitialize the
tracker. The estimate of the target position was fed back to the audio subsystem to estimate the delay
associated with the sound component from the target to perform noise cancelation.
Failure of visual tracker was determined as follows. If the tracker loses the object due to drift, or
occlusion, it settles on the background. When this happens, the tracking window stops moving and
settles on a constant window that does not change. This will also result in the maximum Bhattacharya
coefficient becoming close to one. Thus, when these two conditions happen simultaneously for consec-
utive frames, it is an indication of tracker failure. When frames become totally corrupted, or when the
tracker suddenly loses track of the object, the maximum Bhattacharya coefficient will approach zero.
This was the other criterion to indicate failure of visual tracking. The third way to determine failure was
34
the indication of a highly confident estimate of target position from the audio subsystem that was far
away from the visual tracking-based estimate. This has be summarized in Equation (3.8):
V isFail =
TRUE if bcMax ≥ θ1
AND |xvideot − xt−1| = 0;
TRUE else ifbcMax ≤ θ2;
TRUE else ifAudConf ≥ θ3
FALSE else.
(3.8)
whereV isFail is the Boolean flag that tells whether visual tracking has failed or not,bcMax is the
maximum Bhattacharya coefficient of matching the window from previous frame with windows in the
current frame,xt is the position of the window intth frame,AudConf is the confidence of the audio
subsystem in its prediction of the position of the target, andθ1, θ2, andθ3 are empirically determined
thresholds. The positionxt is set toxaudiot (which is the estimate of the position of the target as deter-
mined by the audio subsystem) whenV isFail is TRUE and toxvideot if V isFail is FALSE, to give a
robust estimate ofxt.
3.2.2 Results
We tested the algorithm on several test cases, and having a multimodal object tracking would im-
proved the performance a purely visual tracker. In some of the cases where the visual tracker failed, the
cause was the tracker drifting on to a matching background, a change in its appearance, an occlusion of
the target, and corruption of frames with random noise. We also tested the algorithm for noise cancela-
tion for source separation of the sound coming from the source of interest. Target motion was mostly
horizontal and translational without any significant movement in they-direction. However, there was
significant change in the appearance of the object due to rotation and articulation. The algorithm was
consistently able to track the speaking target in the presence of background noise, occlusion or even
when frames were corrupted with noise.
The video capture rate was 15 frames per second while audio was digitized at 44 100 Hz. Thus,
we have 2940 audio samples for each video frame. The horizontal direction represents the time along
35
0 10 20 30 40 50 60 700
50
100
150
200
250
300
350
Time →
Est
imat
ed X
Co−
ordi
nate
→
Tracking Performance when Occlusion Occurs
Tracking Using A/VTracking Using Audio OnlyTracking using Video Only
Figure 3.4Thex-coordinate of the output of the audio, video, and combined trackers for occlusion.
the sequence. Since the delay locationsτ had to be mapped to image locations, 10 manually annotated
frames were used only once, and thereafter only raw data were given to the algorithm. No model
parameters were set by hand and no initialization was required.
We present the results on two sequences which had occlusion and or had dropped frames. Audio
waveforms were consistently corrupted with background noise. Occlusions were simulated by making
an occluding bar of pixels using random noise. When visual tracker reaches the portion where occlusion
occurs it loses the target and is unable to locate it again as indicated by the the constant estimate of the
x-coordinate. When we added the audio stream which is not affected by the occlusion the tracking
performance improves as seen in Figure 3.4 where tracker now uses the estimates given by the audio
stream and is thus able to locate the target.
Frame corruptions were simulated by replacing the entire intermittent frames with random noise
pixels. The improvement in the performance by having a additional audio modality is demonstrated in
Figures 3.5 and 3.6. Thus in both these case visual tracker lost the target, but was able to follow the
target with the estimates from audio, consistently. The work has been accepted or publication and will
appear as [54]. We are omitting the results on noise cancelation here.
36
Figure 3.5 Top row shows tracking performance using video alone when frames are dropped due tocorruption by random noise. In the frames after the noisy frame, the tracker is unable to follow thesubject. Bottom row shows tracking performance using both audio and video. Now the target is beingfollowed even after frame drops due to noise.
0 10 20 30 40 50 60 700
50
100
150
200
250
300
350
Time →
Est
imat
ed X
Co−
ordi
nate
→
Tracking Performance when Frames are Noisy
Tracking Using A/VTracking Using Audio OnlyTracking using Video Only
Figure 3.6Thex-coordinate of the output of the audio, video, and combined trackers for noisy frames.
37
3.3 Discussion
In the human tracking system developed with an ad hoc feedback mechanism between subsystems
dealing with the audio and video modalities respectively, we saw a definitive improvement in results. A
more principled manner to integrate different modalities and build a tighter feedback mechanism in the
system will be to use a generative model. The performance of a completely generative model is limited
by the designer’s understanding of the generative process and the ability to convert that understanding
into a generative model: functions of a factor graph, for example. Moreover, for complex systems it
may be very tricky to formulate the algorithms such as the sum-product algorithm for inference and
the EM algorithm for learning. Approximations may be needed for the marginalization and other inte-
gration procedures involved in the sum-product algorithm and the EM algorithm. On the other hand,
some of the specialized modules developed for feedforward systems perform this marginalization for
their output variables in some sense. With some modification, these can inspire new forms of function
approximations for factor graphs. These ideas are explored further in Chapter 4 to inspire the design of
the new framework described in Chapter 5 for designing complex vision systems.
38
CHAPTER 4
BACKGROUND FOR VARIABLE/MODULE GRAPHS
In Chapters 2 and 3, we saw that for solving complex vision problems, there are two types of system
design approaches used in the research community. In the more widely used approach of feedforward
modular design, systems are designed as an interconnection of modules with one-way flow of infor-
mation from one module to another. The desired high-level task is done by a module at the end of the
processing chain. In the second approach, there is implicit or explicit feedback between various modules
to solve related tasks; treating them as interrelated subproblems. The mathematical theory for genera-
tive modeling approach is better understood as compared to that for ad hoc feedback approaches and
connectionist models, since the generative models model the probability distribution function (PDF) of
the combined set of observed and hidden variables. In this chapter, we develop a hybrid approach, called
variable/module graphs (or V/M graphs) [55], by combining the two approaches in order to benefit from
the advantages of both.
4.1 Differences between Feedforward Modular and Generative Design
Design of a feedforward modular vision system [6, 18, 56–58] usually follows a familiar path. One
starts by identifying the end variables (or high-level features) that need to be estimated for each observed
data point. The image can be treated as a high-dimensional vector, which serves as the data point. For
a video, images or frames form a temporal sequence. The video clips can also be treated as a single
data points by concatenating their frames. Then one identifies some intermediate variables (or low-level
features) that can be extracted and can possibly help the detection of the desired high-level features.
If the low-level features can be extracted satisfactorily according to the context and the scenario, it
39
becomes easier to extract the high-level features. However, many of the low-level features are very
difficult to detect due to phenomena such as occlusion, lighting and shadows. Thus, one tries to design
modules to take the observed data (raw pixels) as input and produce low-level features as output in a
manner that is robust to some of these phenomena. Similarly, one designs modules to take low-level
features as inputs and produce high-level features as outputs as robustly as possible. There may be other
tiers in between, depending on the designer’s imagination and the need of the task at hand.
On the other hand, while designing a generative model for video processing [32, 36, 37, 39], one
hypothesizes a generative process of how the high-level variables lead to the generation of the observed
variables. Any intermediate or end variable that is not observed directly is called ahidden variable. The
emphasis of design is not on individual modules or feature extraction but on the generative process as a
whole. The interaction between variables is coded in form of conditional independence of each variable
with respect to other variables. When a subset of variables are directly dependent on each other, this
relation is coded as a joint density function of these variables.
The differences between the two approaches is tabulated in Table 4.1.
4.2 A Unifying View
With some insight and imagination, one can see that there are common points between the two ap-
proaches. First, one needs to identify the hidden variables associated with the observed variables. Some
of these hidden variables form the end variables, whose values we are interested in computing. If we
consider a joint space spanned by observed and hidden variables, it is obvious that not all combinations
of different values (or not all points in this joint space) are equally likely. For example, if the position
of a person in a frame is such that it covers the pixel at position(x, y), then the intensity and color
of the pixel is more likely to be drawn from the appearance of the person than from that of the back-
ground. Such quantification can be coded as a hard or a soft constraint between the values of different
variables. Although it is done differently in the two approaches, identification of hidden variables and
quantification of constraints between variables is the two points that are common between them.
40
Table 4.1Differences between modular and generative design.
Attribute Modular Design Generative Design
Design
Identify end-variablesIdentify features that canor need to be extractedIdentify feature of features
Hypothesize a generative processand dependence structureGraphically represent structureApproximate inference and learning
DiagnosticsImprove individual modulesReplace modules with better ones
Review generative process anddependence structureReview approximations
Advantages
Straightforward, discriminativedesignBottom-up speedy inferenceLess parameters; faster learningComplex joint-PDF models forspecific features
Natural and intuitive representationof the generative anatomyMake principled and completeuse of the information availableLocal message-passing for inferenceEM algorithm for holistic learning
Shortcomings
May not use all informationbecause of feature extractionNo feedback among modulesDepends on individual performanceof modulesMay not be representative of thegenerative anatomy
Limited types of PDFs can be modeledEven approximations are slowOnline learning not always easyLocal learning not always easy
4.2.1 Variables
The inference that we need to make from a video is usually posed as a variable estimation problem.
Since the variable representing the inference is not directly observed, it is called ahidden variablein
the terminology of probabilistic graphical models [59]. There are other hidden variables in the system
that help simplify the relation between the observed variables and the hidden variable that needs to be
inferred. For example, if we want to infer if a person entered or exited the scene, the pixels of different
frames of the video will form the observed variables, the variable representing the “enter” or “exit”
event will be the hidden variable to be inferred. In this case, the variable representing the position of
the person in a given frame can be an example of a hidden variable that we do not really care about
by itself. However, this variable helps define the relation between the pixels and the “enter/exit” event
variable in a more structured manner. The estimation of event variable only needs to be based on the
temporal sequence of the position variable, while the position variable could be directly related to the
41
observed pixel variables. In other words, the event variable can be conditionally independent of the
pixel variables, given the position variable.
In probabilistic graphical models (or generative models), these variables form nodes of the graph.
On the other hand, in modular design, the output of modules can be thought of as the hidden variables.
This can help us think of both the design paradigms from a common viewpoint, which will help us
develop the V/M graph framework to design vision systems.
Generative models, especially those using probabilistic graphical models, work with probability
distributions of the variables instead of their single values. On the other hand, in modular systems,
modules usually spit out single values of various variables. One can think of these single values as
dirac-delta probability distributions or probability mass functions of these variables. This will further
help us with putting the ideas from both the design paradigms together for the V/M graph framework.
4.2.2 Constraints on variables
Although, the ways by which constraints on variables are formulated in the two paradigms seem
very different, they can be viewed from a common viewpoint. As we shall see, the overall knowledge of
these constraints is usually made up of a collection of subconstraints, where each of the subconstraints
is defined only on a subset of the (observed and/or hidden) variables.
In modular design, the working of a processing module defines the output for a given input, or in
other words, constraints the output variable based on the input variables. This can be thought of as a
constraint on the joint value of input and output variables. Another point worth noting is that a module
usually puts constraints on only the subset of variables, where the subset is defined by the variables
forming the input and output of the module.
On the other hand, constraints in probabilistic graphical models are defined as joint-probability
density functions between different variables. The graphical structure takes advantage of the fact that
many of the net joint density between all the variables can be expressed as a product of different joint
density functions, where each of these function terms in the product takes only a subset of the variables
as its arguments.
42
Thus, in both the paradigms, the net constraint on the joint space of observed and hidden variables
is expressed as a collection of subconstraints, where each of these subconstraints is defined only on a
subset of all the variables. This gives us the common ground to view the expression of constraints in
both paradigms from the same viewpoint after viewing the variables in the same setting in Section 4.2.1.
4.2.2.1 Modeling constraints as probabilities
Building upon the idea of constraints on variables, one could think of the subconstraints put by a
module as a joint-probability distribution between the input and the output variables. Approximation to
the actual distribution can theoretically be made by extensive sampling of input and their corresponding
output values from the module. Each of the modules puts one or more of these subconstraints on
the distribution of these variables. The coexistence of these subconstraints occurs in anAND fashion,
making it similar to the product form used in probabilistic graphical models.
Thus, our learning goal would be to estimate the parameters of this joint-distribution that models the
probability of observed data points by optimizing some cost function. The most common cost function
used is the likelihood of the data itself, and the estimation process is known asmaximum likelihood
estimation (ML estimation).
The goal of inference process would be to estimate the (statistics/distribution of the) hidden variables
of interest, given the parameters and observed variables. Thus, we need to know the parameters to infer
the hidden variables. On the other hand, we can estimate the parameters easily if we know the hidden
variables associated with the data (observed variables). In other words, the inference problem is the dual
of the learning problem, and their solutions are mutually dependent. EM algorithm [35], in the two steps
it iterates over, the E step and the M step, refines the estimate of hidden variables and the parameters in
turns.
In V/M graphs, we will make use of the graphical structure of the probabilistic graphical models to
model the structure of dependencies and constraints on the variables, while we will use the processing
of modules as approximations to some of the functions for parameter learning and marginalization over
probability distributions for inference to speed up the process. This also has another added advantage
that we can use naturally more complex and task specific constraints that some modules can define.
43
4.3 Probability Density Modeling
In the previous section, we established how we can view variables and constraints on variables
in modular and generative design from the same perspective. This requires modeling constraints on
variables as probability density functions. There are many advantages of modeling the constraints as
probability densities as listed below.
The advantages of using probability density modeling for expressing constraints on variables include
the following:
1. Probabilistic modeling provides a way to deal with uncertainties.Video processing is notoriously
afflicted by uncertain data. For example, edges become weak in low light or saturation, or dis-
appear behind occluding surfaces. Inferences cannot be made without taking different sources of
information into account. In such a scenario, probabilistic modeling provides a way to remain
non-committal to an early decision until more information can be incorporated.
2. Probability densities functions can be broken into simpler subfunctions using different density
modeling techniques.Such a property is useful when the net constraint on the variables can be
expressed as combination of subconstraints, as is often the case.
3. Bayesian modeling provides a mathematically sound feedback mechanism.While dealing with
variables at different levels, feedback from higher-level variables can help disambiguate the in-
ference at lower-levels, since higher-level variable are usually inferred by using more information
than a lower-level variable. Bayesian modeling provides us with the mathematical tool that al-
lows us bidirectional flow of information. In a loose sense, by “lower-level” variables we mean
the variables that are closely related to the observed variables or raw-data in a graphical structure
of conditional dependencies.
4. It gives access to well-understood algorithmic tools.Tools for inference such as the message
passing algorithms such as the sum-product algorithm [33] or belief propagation [60] are avail-
able for probabilistic models. Similarly, well-understood learning algorithms such as the EM
algorithm [35] and its variants [61] are available for probabilistic models.
44
While modeling a vision system, generative modeling provides a more principled way to model the
data under probability theory. A generative model is usually formulated to explain the entire data in
form of a cost function that represents the data likelihood. Using such cost functions in a generative
setting allows us to account for all the information present in the data, and we do not have to worry
about information loss that is usually associated with feature extraction modules. The information loss
can be quantified in the assumptions made on various probability functions. However, an accurate
model of the entire data in interesting scenarios of video understanding is usually mathematically and
computationally prohibitive. Thus, it becomes necessary to make approximations during modeling,
inference, and learning. For example, when a camera is fixed, it is reasonable to assume that the change
in human appearance will be due to changes in joint-angles; It is prohibitive to model all the joints of
the human body and match their angles against the appearance. Approximate models can model the
changing human appearance as a linear combination of eigen-bases of appearance maps or by modeling
a subset of (salient) joint-angles of a human body. For many practical applications, such modeling
approximations produce acceptable results. However, these simpler models may be inaccurate and
incomplete for certain more difficult high-level tasks such as multi-object tracking with partial or brief
occlusions.
On the other hand, researchers using modular approaches have already improved performance of
individual modules in isolation. For example, background subtraction [19, 56] and contour tracking in
simple scenarios using Kalman filters [58] or particle filters [62] have reached a certain level of maturity.
If we think about it, these modules are ready sources of inspiration and functional approximations for
modeling some of the constraints that can be put on joint-probability space of observed and hidden
variables. So, while the intuition of mutual conditional independence can be coded in a graphical form,
as done in probabilistic graphical models, the formulation of some of the constraints between mutually
dependent variables can be done using modules used in non-generative modular vision systems. This,
as we shall see in Chapter 5, is the key idea behind V/M graphs.
The goal of a generative model is to fit a plausible probability model over the joint space of ob-
served and hidden variables such that it best explains the observed data. Modules that in effect constrain
45
the joint-probability space can be combined in different ways as subconstraints on the joint-probability
space. The two most well-studied ways to combine simpler functions or subconstraints to model com-
plex probability densities are the mixture form and the product form. In the following, we investigate
where and why the product form is more well-suited than the mixture form for combining subcon-
straints.
4.3.1 Mixture form
In the mixture form, a PDF is modeled by a weighted sum of different functions. These functions
are commonly known as mixture components and the weights are known as mixing coefficients. The
mixing coefficients are decided according to the fraction of the total density represented by the cor-
responding component. As an example, the densityp(x) can be represented as a sum ofn mixture
densities represented bypi(x)’s (where,i ranges from1 to n) with mixing coefficientsωi’s as shown in
Equation (4.1):
p(x) =n∑
i=1
ωipi(x) (4.1)
It should be noted that ifpi(x)’s represent valid probability densities, then forp(x) to be a valid
probability density, the mixing coefficients satisfy Equation (4.2):
n∑
i=1
ωi = 1 (4.2)
A mixture form would be useful for modeling a PDF with a complex shape, if its support (probability
space) can be broken into regions where the shape of the PDF in the region can be approximated by a
simple function. In such a scenario, mixture components will act as function approximators in the given
component regions. In the ensuing and the following discussions, adjectivessimpleandcomplexused
for functions refer to their mathematical or computational tractability and the number of parameters
needed to define them. A common function form for the mixing coefficients is the Gaussian form,
leading to mixture of Gaussian modeling [19].
46
Mixture modeling can also take the form of a classification problem where one would want to
determine the underlying distribution of a set of data points with the underlying assumption that each
data point is drawn according to one of the simpler component densities. The mixing coefficients form
the prior probability of the component, while the class label is represented by the mixture that the data
point is supposedly drawn from. The EM algorithm has been used extensively in the literature [35] to
determine parameters of mixture models in an unsupervised manner. However, putting subconstraints
together is different than the classification problem, and as we shall see, the product form may be better
suited for putting this task.
4.3.2 Product form
Approximating a probability density using a product form involves combining subconstraints or dis-
tribution functions by multiplying them together. Such a model is also known asproduct of experts[34].
A product of experts combines different distributions or “experts” by multiplying them and renormal-
izing the output to arrive at a joint-distribution. For example, if a joint-distributions for a set of five
variablesx1 throughx5 can be expressed using four “experts”A, B, C, andD, such a combination
would be described by Equation (4.3):
p(x1, x2, x3, x4, x5) ∝
pA(x1, x2, x3, x4, x5)
× pB(x1, x2, x3, x4, x5)
× pC(x1, x2, x3, x4, x5)
× pD(x1, x2, x3, x4, x5)
(4.3)
In Equation (4.3)p(.) is the joint-probability distribution of the five variables, andpA(.), pB(.),
pC(.), andpD(.) are the factor probability distributions (or “opinions”) given by the four experts. We are
neglecting the normalization needed so that each of the terms in the Equation (4.3) is a valid probability
distribution that integrates to 1.
While modeling a probability distribution over a high-dimensional space using a product form, the
term contributed by each expert need not involve all the variables. This makes it more efficient to model
distributions over high dimensional spaces as individual experts can specialize over a small subsets of
47
variables. Indeed, as we shall see later, this property is used by probabilistic graphical models, such as
Bayesian networks and factor graphs, in order to simplify modeling of the joint-distribution.
The individual experts need to agree on the correct solution (the high probability region of the joint-
distribution). However, each expert is allowed to make mistakes that falsely allot high probability in
certain regions, which should actually be low probability regions of the joint space. This will work as
long as the regions where each expert makes mistakes do not coincide. In other words, all the experts
should not waste their probability distribution on the same low probability region; instead, they should
correct each other’s mistakes. Such an expression is suitable when different experts are looking at
different aspects of a complex task. Modules that perform each of these subtasks will represent different
experts.
Care must be taken, however, while designing the output of the experts. If an expert is wrongly
overconfident that a region should have a low probability, no matter how much the other experts try to
raise the probability of the region, they might never be able to offset the negative opinion of a single
overconfident expert. One way to alleviate this problem is to increase the entropy of individual experts
slightly by adding small uniform distribution terms to their outputs (and renormalizing). This ensures
that every region of the space is assigned a nonzero probability, no matter how small. As we shall see
later, it also has mathematical advantages.
4.3.3 Probabilistic graphical models with product form
Probabilistic graphical models make use of the property of the product form that a component func-
tion (of the factorizable joint-PDF) can take subsets of variables as its arguments instead of the whole
set of variables. The subsets of variables occurring together as arguments in a given function define a
structure of conditional dependencies among the variables. Under certain conditions one can change
from one graphical model to another while coding the same set of conditional dependencies [59]. Here
we revisit two closely related graphical models: factor graph and Bayesian network. V/M graph can be
viewed as a generalization over these two graphical models.
48
4.3.3.1 Factor graphs
A factor (function) of a product term can selectively look at a subset of dimensions while leaving the
other dimensions that are not in the subset for others factors to constrain. In other words, only a subset
of variables may be part of the constraint space of a given expert. This leads to the graphs structure
of a factor graph, where the edges between a factor function node and variable nodes exist only if the
variable appears as one of the arguments of the factor function. This also establishes an equivalence
between factor graphs and the product of experts. An example of this equivalence is shown in Equation
(4.4):
p(x1, x2, x3, x4, x5) ∝
pA( x1, x2, x3, x4, x5 )
× pB( x1, x2, x3, x4, x5 )
× pC( x1, x2, x3, x4, x5 )
× pD( x1, x2, x3, x4, x5 )
∝
fA( x1, x2 )
× fB( x2, x3 )
× fC( x1, x3 )
× fD( x3, x4, x5 )
(4.4)
In Equation (4.4),fA(x1, x2), fB(x2, x3), fC(x1, x3), andfD(x3, x4, x5) are the factor functions of
the factor graph. As we can see, all we have to do is to add dummy variables to each of the functions to
establish equivalence to a corresponding expert in the product of experts form. In this example, function
fA is equivalent to expertpA, functionfB is equivalent to expertpB, and so on. Now, we can freely
borrow ideas from both types of models. The factor graph expressed mathematically in Equation (4.4)
can be expressed graphically as shown in Figure 4.1.
Inference in factor graphs can be made using a local message passing algorithm called the sum-
product algorithm [33]. The algorithm reduces the exponential complexity of calculating the probability
distribution over all the variables into more manageable local calculations at the variable and function
49
Figure 4.1Factor graph for the example given in the text.
nodes. The local calculations depend only on the incoming messages from the nodes adjacent to the
node at hand (and the local function, in case of function nodes). The messages are actually distributions
over the variables involved. For a graph without cycles, the algorithm converges when messages pass
from one end of the graphs to the other and back. For many applications, even when the graph has
loops, the messages converge in a few iterations of message passing. Turbo codes in signal processing
make use of this property of convergence of loopy propagation [63]. The message passing clearly is a
principled form of feedback or information exchange between modules. We will make use of a variant
of message passing for our new framework because exact message passing is not feasible for complex
vision systems.
In the message passing algorithm, there are three main types of calculations. The first type of calcu-
lation is that of calculating a message from a variable node to one of its adjacent function nodes. This
message is obtained by multiplying and normalizing the product of all the incoming messages to that
variable node from other function nodes adjacent to it, except for the message from the function node we
are sending the message to. The second type of calculation is that of a message from a function node to
a variable node. This is done by multiplying all the incoming messages to the function from the variable
nodes adjacent to the function node, except that from the variable node we are sending the message to,
50
with the local function at the function node, and marginalized over all the other variables except the
variable of the variable node we are sending the message to. The third type of calculation is just the
calculation of local belief at a variable node, and it amounts to multiplying all the incoming messages at
that node and normalizing the product. All these messages have to be appropriately normalized so that
they represent probability distributions (that integrate to 1).
4.3.3.2 Bayesian networks
A Bayesian network is a directed graph of nodes. Each node represents a variable, and directed
edges represent dependence relations among the variables. If there is a directed edge from a nodeA to
another nodeB, then we say thatA is a parent ofB, andB is the child ofA. In other words, the variable
at the head of the edge is conditionally dependent on the variable at the tail. The joint-probability of all
the variables is expressed as the product of their conditional probabilities given their parents.
Given a Bayesian network, it can be converted into a factor graph [59]. Thus, ideas from Bayesian
networks are also applicable to factor graphs with appropriate modifications. For example, the inference
algorithm known as belief propagation for Bayesian networks (also known as Belief Networks [60]) is
an instance of the sum-product algorithm used for factor graphs [33].
Results on learning parameters [41,42] and structure [64] in Bayesian networks can also be adapted
to factor graphs. Since we concentrate only on learning parameters in this work and not the structure,
we shall see how the gradient descent-based parameter learning algorithms [41,42] are actually another
instance of online EM algorithm, and applicable to V/M graphs.
4.3.4 Discussion
A cascade of modules can also be thought of as a special case of implementation of the product
form, where each module in the cascade is a factor of the product, and it puts constraints only on its
input and output variables. We have established the characteristics of the mixture form and the product
form, and also seen how the product form lends itself to factorization that is the basis of the probabilistic
graphical models. It is now easy to further justify the advantages of the product form over the mixture
form.
51
4.3.4.1 Advantages of product form
It can be shown that the product form has some definite advantages over the mixture form. The
mixture form is suited for partitioning of the data space into different simple high probability sub-
regions. Each subregion can be represented by a component of a mixture. When the components
themselves have shapes very different from these sub-regions, we may require a large number of such
components. Moreover,if-then-elsetype conditional gating is much easier to obtain using a product
form than using a mixture form. Certain useful learning algorithms such as the EM algorithm [35] are
applicable to log probabilities or log likelihoods. Log forms decompose to much simpler sums when
the product of probabilities is involved. Effective variable partitioning, dimension partitioning, and
gating is very cumbersome with mixture forms. Moreover, by adding some uniform distributions to the
factors of a product form (and renormalizing), one can obtain the function of a mixture form as well.
The converse, that is, expressing a product form as a mixture form, is not easy. Thus, there are clear
advantages to using the product form of modeling joint-distribution over observed and hidden variables.
To get a more intuitive feel for the expressive power of the product form, consider a set of functions
pi(x) indexed byi, wherei ranges from1 to n. As expressed in Equation (4.1), if these functions are
probability distribution, a new distribution can be expressed as a mixture of these functions by taking
a weighted sum. Each of these weights have to be non-negative and sum to1. Now, let us make some
assumptions about the probability space and the functions that we are dealing with. Let us assume that
we are dealing with a finite space, that is, the variablex is defined on a closed setX, such that the
Equation (4.5) holds for some finite positive constantk (0 < k < ∞):
∫
x∈X1 dx = k (4.5)
Let us also assume that all the functions that we speak of are zero outside this setX. Let us also assume
that the functions are highly discriminative, that is, they have a low entropy. Such an assumption is
not unreasonable, since while dealing with high-dimensional data, the underlying structure in the data
will occupy a much smaller subspace. This means that the data points will be sparsely distributed, and
therefore functions representing the data population will be highly discriminative with strong peaks.
52
Such functions will be close to zero (since they cannot be negative and must integrate to1) in most of
the space.
Now, let us modify the functions a little by adding a small constantαi to each functionpi. By small
we mean thatαi << max(pi(x)). The resulting function can be renormalized to integrate to1 over the
setX. Essentially, we are creating a weighted mixture of a uniform distribution overX and the original
function pi. For small enough weight given to the uniform, this should not affect the data-modeling
capabilities ofpi. Here, the assumption is that we are interested in capturing relative probabilities of
high-probability regions accurately, rather than worry about the low-probability regions. As long as
the low-probability regions will remain sufficiently unlikely to be sampled upon random sampling for
example, it does not matter whether instead of the correct cumulative probability ofε, our model assigns
an5ε2 to that region (ifε → 0).
Now, let us consider a product of such modified functions. Equation (4.6) considers such a product to
yield a new functionp(x), while neglecting normalization needed to yield valid probability distributions
for the time being:
p(x) ∝ ∏i(αi + pi(x))
=∏
i αi +∏
i αi∑
jpj(x)αj
+∏
i αi∑
j
∑k>j
pj(x)pk(x)αjαk
+ . . . +∏
j pj(x)(4.6)
In Equation (4.6), if we can safely neglect higher order terms from third term onwards (in the
last line) provided one condition is satisfied. If the functionspi have nonoverlapping high-probability
regions, and if in their low probability regions the functions are so low that they are negligible in com-
parison to a constantαi, (pi(x) << αi in most of the space), the third term (and subsequent terms) will
fade in comparison to the second term, leaving us with the first two terms. Renormalizing the product
will leave us with a uniform (first term) and a sum of functionspi’s with different weighting factors
related toαi’s. Thus, for a slightly modified function (by addition of a constant) and under certain con-
ditions on these functions, we can get a mixture form by taking the product. This is further illustrated
graphically in Figure 4.2. In this figure, we choose thepi’s to be Gaussian functions, so the modified
functionsαi + pi(x) are uni-Gauss functions (sum of a uniform and a Gaussian). We show how closely
53
the sum and the product match on the high-probability region when the peaks of the original functions
are nonoverlapping.
−3 −2 −1 0 1 2 3 40
0.2
0.4
0.6
0.8
1
1.2
1.4
Figure 4.2Product and mixture of uni-Gauss functions coincide over compact supports when peaks areapart: Solid and dotted lines are the original one-dimensional uni-Gauss functions (x-axis representsthe free variable, andy-axis represents the function value), while dashed and dash-dotted lines arenormalized mixture (sum) and product, respectively. All functions are defined in the range -3 to 4 only(and can be assumed zero outside it).
When the high-probability regions of the original functions coincide to some extent, such claims
about representing mixtures through products by adding small uniform terms cannot be guaranteed. Yet,
we can find constants that, if added to the original function, can produce fairly good results. Figure 4.3
shows this for two functions whose peaks are not that far apart. For the sake of the figure, the constants
were found by trial and error. However, in such cases, mixture modeling is also not an elegant way to
model the data, as we are left increasingly unsure of which mixture component the data point belongs
to. On the other hand, as we shall see later, this is where the product form can come in handy when
its different “experts” model different dimensions or aspects of the data. Thus, in our opinion, product
form is much versatile than the mixture form in probabilistic modeling by breaking down the original
function into simpler components.
54
−3 −2 −1 0 1 2 3 40.05
0.1
0.15
0.2
0.25
0.3
0.35
0.4
0.45
0.5
0.55
Figure 4.3 Product and mixture of uni-Gauss functions coincide over compact supports even whenpeaks are fairly close: Solid and dotted lines are the original one-dimensional uni-Gauss functions (x-axis represents the free variable, andy-axis represents the function value), while dashed and dash-dottedlines are normalized mixture (sum) and product, respectively. All functions are defined in the range -3to 4 only (and can be assumed zero outside it).
4.3.4.2 Limitations of probabilistic graphical models
Despite all their advantages, probabilistic graphical models have limitations. As listed in Table 4.1,
the main limitations arise from the calculations needed to estimate probability densities. For example,
local message-passing algorithms for inference such as the sum-product algorithm [33], require two
kinds of calculations; product of probability distributions and marginalization. Of the two, marginaliza-
tion, in general, is difficult to perform since it involves integration (or summation) over a subset of its
argument space. Even while calculating the product, in order for the product to be a valid distribution,
it has to be normalized so that it integrates to1. The normalization constant is known as thepartition
function, which again is a result of integration over the entire argument space.
Approximations for many operations in graphical models have been developed over the years. Sta-
tistical sampling methods such as different forms of MCMC [38] for performing integration, or particle
filtering [62] to perform propagation of distributions over time have been developed and applied to
55
generative models. Variational methods [65] that approximate the overall cost function by a surrogate
distribution have also been developed.
The sampling methods are usually very slow as they require large number of samples to approximate
complex distributions. On the other hand, variational methods are limited by the form of the surrogate
distribution and its complexity, as these determine its capacity to model the actual distribution.
56
CHAPTER 5
VARIABLE/MODULE GRAPHS: FACTOR GRAPHS WITHMODULES
With an understanding of the strengths and limitations of modular and generative design, we are
now in a position to develop a hybrid framework to design modular vision systems. In this new frame-
work, which we callvariable/module graphsor V/M graphs[55], we aim to borrow the strengths of
both modular and generative design. From the generative models in general and probabilistic graphical
models in particular, we want to keep the principled way to explain all the information available and the
relations between different variables using a graphical structure. From the modular design, we want to
borrow ideas for local and fast processing of information available to a given module as well as online
adaptation of model parameters.
5.1 Replacing Functions in Factor Graphs with Modules
In Section 4.2, we showed how both the design frameworks can be viewed from the same perspec-
tive; identification of variables and designing constraints on the joint-probability spaces. From that
perspective, modules in modular design constrain the joint-probability space of observed and hidden
variables just as the factor functions in factor graphs. However, there are crucial differences. Without
loss of generality, we will continue our discussion on graphical models based on factor graphs, since
many of the other graphical models can be converted to factor graphs.
Modules in modular design take (probability distributions of) various variables as inputs, and pro-
duce (probability distributions of) variables as outputs. Producing an output can be thought of as passing
a message from the module to the output variable. This is comparable to part of the message passing
57
algorithm in factor graphs, that is, passing a message from the function node to a variable node. This
calculation is done by multiplying messages from all the other variable nodes (except the one that we
are sending the message to) to the factor function at the function node and marginalizing the product
over all the other variables (except the one that we are sending the message to). Processing of a module
can be thought of as an approximation to this calculation.
However, the notion of a variable node does not exist in modular design. Let us, for a moment, imag-
ine that modules are not connected to each other directly. Instead, let us imagine that every connection
that connects output of a module to the input of another module is replaced by a node connected to the
output of the first module and input of the second module. This node represents the output variable of
the first module, which is same as the input node of the second module. Let us call this thevariable
nodemuch as we would call it in case of factor graphs.
In other words, a cascade of modules in a modular system is nothing but a cascade of approximators
to function nodes (separated by variable nodes, of course). If we generalize this notion of intercon-
nection of module ormodule nodesvia variable nodes, we get a graph structure. Such a graph will be
bipartite in much the same way as a factor graph. We shall call such a graph aV/M graph. To put it
yet another way, if we replace the function nodes in a factor graph by modules, we get a V/M graph- a
bipartite graph in which the variables represent one set of nodes (calledvariable nodes), and modules
represent the other set of nodes (calledmodule nodes).
5.2 System Modeling using V/M Graphs and its Relation to the ProductForm
A factor graph is a graphical representation of the factorization that a product form represents. Since
the V/M graph can be thought of as a generalization of the factor graph, what does this mean for the the
application of product form to the V/M graph? In essence, we are still modeling the overall constraints
on the joint-probability distribution using a product form. However, the rules of message passing have
been relaxed. This makes the process an approximation to the exact product form. We compare this
58
approximation to the well-known variational methods [65] of approximation in generative models in
Sections 5.4 and 5.5.
To see how we are still modeling the joint-distribution over the variables using a product form, let us
start by analyzing the role of modules. A module takes the value of the input variable(s)xi and produces
a probability distribution over the output variable(s)xj . This is nothing but the conditional distribution
over the output variables given the input variable, orp(xj |xi). Thus, each module is nothing but an
instantiation of such conditional density functions.
In a Bayesian network, similar conditional probability distributions are defined, with an arrow rep-
resenting the direction of causality. This makes it a simple case to define the module as a single edge
or a set of multiple edges going from the input to the output, converting the whole V/M graph into a
Bayesian network, which is another graphical representation of the product form, as described in Sec-
tion 4.3.3. Also, since the Bayesian network can always be converted into a factor graph [59], we can
convert a V/M graph into a factor graph. However, processing modules are many times arranged in a
bottom up fashion, whereas the flow of causality in a Bayesian network is top-down. This is not a prob-
lem, since we can use Bayes’ rule to reverse the flow of causality. Once we have established a module
as an equivalent of a conditional density, manipulation of the structure is easy, and it always remains in
the purview of product form modeling of the joint distribution. However, the similarity between V/M
graphs and probabilistic graphical models ends here on a theoretical level. As we shall in Section 5.3,
the inference mechanisms that are applied in practice to graphical models are not applied in the exact
same manner to V/M graphs. One of the reasons for this is that modules do not produce a functional
form of the conditional density functions. They only produce a black-box from which we can sample
output (distribution) for given sample points of input, and not the other way around. Thus, in practice,
application of Bayes’ rule to change the direction of causality is not as easy as it is in theory. We use
comodules, at times, for flow of messages in the other direction to a given module. This makes V/M
graph an approximation to an equivalent Bayesian network, or at best, a highly loopy graphical model.
59
5.3 Inference
In a factor graph, calculating the messages from variable nodes to function nodes, or the belief at
each variable node is usually not difficult. When the incoming messages are in a nonparametric form,
any kind of resampling algorithm or nonparametric belief propagation [66] can be used. What is more
difficult is the integration or summation associated with the marginalization needed to calculate the
message from a function node to a variable node. Another difficulty that we face here is the complexity
with which we can design the local function at a function node. Since we also need to calculate the
messages using products and marginalization (or sum), we need to devise functions that model the sub-
constraint as well as lend themselves to easy and efficient marginalization (or approximation thereof).
If one were to break the function down into more subfunctions, there is a trade-off involved between
network complexity and function complexity for a manageable system.
This is where we can make use of the modules developed for other systems. The output of a module
can be viewed as a marginalization operation used to calculate message sent to the output variable.
Now, the question arises what we can say about the message sent to the input variable. If we really
cannot modify the module to send a message to what was the input variable in the original module,
we can view it as passing a uniform message (distribution) to the input variable. To save computation
this message can be totally discounted during calculations that require combination of this message
with other messages. However, in this framework, we encourage modifying existing modules to pass
information backwards as well. A way to do this is to associate a comodule with the module that does
the reverse of the processing that the module does. For example, if a module takes in a background
mask and outputs probability map of the position of a human in the frame, the comodule will provide
some probability map of pixels belonging to background or foreground (human) given the position of
human to this comodule.
Let us now see what we gain by introducing modified modules as approximation to functions and
their message calculation procedures. Basically, we get computationally cheap approximations to com-
plex marginalization operations over functions that will be difficult to perform from first principles or
statistical sampling; the approach used with generative models until now. Whether this kind of message
60
passing will converge or not even for graphs without cycles remains to be seen in theory, however, we
have found the results to be convincing for the applications that we implemented it for as shown in
Chapter 6.
5.4 Learning
There are a few issues that we would like to address while designing learning algorithms for complex
vision systems. The first issue is that when the data and system complexity are prohibitive for batch
learning, we would really like to have designs that lend themselves to online learning. The second
major issue is the need to have a learning scheme that can be divided into steps that can be performed
locally at different module or function nodes. This makes sense, since the parameters of a module are
usually local to the module. Especially in an online learning scheme, the parameters should depend only
on the local module and the local messages incident on the function node.
We shall derive learning methods for V/M graphs based on those for probabilistic graphical models.
Although, methods for structure learning in graphical models have been explored [64, 67], we will
limit ourselves for the time being to parameter learning. Structure learning is a suggestion for future
work in Section 7.2. In line with our stated goals in the paragraph above, we will consider online and
local parameter learning algorithms for probabilistic graphical models [41, 42] while deriving learning
algorithms for V/M graphs.
Essentially, parameter adjustment is done as a gradient ascent over the log likelihood of the given
data under the model. While formulating the gradient ascent over the cost function, due to the factoriza-
tion of the joint-probability distribution, derivative of the cost function decomposes into a sum of terms,
where each term pertains to local functions. A similar idea can be extended to our modified factor graphs
or V/M graphs. However, the mathematics may not be straightforward because of the approximations
made to the factorization.
Now, we shall derive a gradient ascent based algorithm for parameter adjustment for V/M graphs.
Our goal is to find the model parameters that maximize the data likelihoodp(D), which is a standard
goal used in the literature [35,41], since (observed) data are what we have and seek to explain, while rest
61
of the (hidden) variables just aid in modeling the data. Each module will be represented by a conditional
density functionpωi(xi|Ni). Here,xi represents the output variable of theith module,Ni represents the
input set of variables to theith module, andωi represents the parameters associated with the module. We
will make the assumption that data points are independently identically distributed (iid), which means
that for data pointsdj (wherej ranges from1 to m, the number of data points) and the data likelihood
p(D), Equation (5.1) holds:
p(D) =m∏
j=1
p(dj) (5.1)
In principle, we can choose any monotonically increasing function of the likelihood, and we chose
the ln(.) function to convert the product into a sum. This means that for the log likelihood, Equation
(5.2) holds:
ln p(D) =m∑
j=1
ln p(dj) (5.2)
Therefore, when we maximize the log likelihood with respect to the parametersωi’s, we can concentrate
on maximizing the log likelihood of each data point by gradient ascent, and adding these gradients
together to get the complete gradient of the log likelihood over the entire data. Thus, at each step we
need to deal with only one data point and accumulate the result as we get more data points. This is
significant in developing online algorithms that deal with limited (one) data point(s) at a time. In case
where we tune the parameters slowly, this is in essence like a running average with a forgetting factor.
Now, taking the partial derivative of the log likelihood of one data pointdj with respect to a param-
eterωi, we get Equation (5.3). Since we will getp(dj |xi, Ni) as a result of message passing, and we
will get p(xi|Ni) as the output of the processing module, all these computations can be done locally at
the modulei itself. The probability densitiesp(dj) andp(Ni) are nonnegative functions that only scale
the gradient computation, and not the direction of the gradient. With V/M graphs, when we are not
even expecting to calculate the gradient, we will only try to do a generalized gradient ascent by going
in the direction of positive gradient. It suffices that as an approximate greedy algorithm we move in the
general direction of increasingp(xi|Ni) and hope thatp(dj |xi, Ni), which is a marginalization of the
62
product ofp(xk|Nk) over manyk’s, will follow an increasing pattern as we spread the procedure over
manyk’s (modules). The greedy algorithm should be slow enough in gradient ascent that it can capture
the trend over manyj’s (data points) when run online:
∂ ln p(dj)∂ωi
=∂
∂ωip(dj)
p(dj)
=∂
∂ωi
(∫xi,Ni
p(dj |xi, Ni)p(xi, Ni) dxi dNi
)
p(dj)
=∂
∂ωi
(∫xi,Ni
p(dj |xi, Ni)p(xi|Ni)p(Ni) dxi dNi
)
p(dj)
=
∫xi,Ni
∂∂ωi
(p(dj |xi, Ni)p(xi|Ni)p(Ni)) dxi dNi
p(dj)
=
∫xi,Ni
p(Ni) ∂∂ωi
(p(dj |xi, Ni)p(xi|Ni)) dxi dNi
p(dj)
(5.3)
This sketches the general insight into the learning algorithm. The sketch is in line with a similar
derivation for Bayesian network parameter estimation in [41], where the scenario is much better defined
than the scenario defined here for V/M graphs. In Section 5.5.2, we provide another viewpoint to justify
the same steps.
5.5 Free-Energy View of EM Algorithm and V/M Graphs
For generative models, the EM algorithm [35] and its online, variational, and other approximations
have been used as the learning algorithm of choice. Online methods work by maintaining sufficient
statistics at every step for theq-function that approximates the probability distributionp of hidden and
observed variables. We use a free-energy view of the EM algorithm [61] to justify a way of designing
learning algorithms for our new framework. In [61] the online or incremental version of EM algorithm
was justified using a distributed E-step. We extend this view to justify local learning at different module
nodes. Being equivalent to a variational approximation to the factor graph means that some of the con-
cepts applicable to generative models, such as variational and online EM algorithm, can be applicable to
63
the V/M graphs. We use this insight to compare inference and learning in V/M graphs to the free-energy
view of the EM algorithm [61].
5.5.1 Online and local E-step
Let us assume thatX represents the sequence of observed variablesxi, andY represents the se-
quence of hidden variablesyi. So, we are modeling the generative processp(xi|yi, θ), with some prior
on yi; p(yi), given system parametersθ (which is same for all pairs(xi, yi). Due to the Markovian
assumption ofxi being conditionally independent ofxj givenY , wheni 6= j, we get Equation (5.4):
p(X|Y, θ) =∏
i
p(xi|yi, θ) (5.4)
We would like to maximize the log likelihood of the observed dataX. EM algorithm does this by
alternating between an E-step as shown in Equation (5.5) and an M-step shown in Equation (5.6) in each
iteration with iteration numbert:
Compute distribution:qt(y) = p(y|x, θ(t−1)) (5.5)
Compute arg max:θ(t) = arg max Eqt [log P (x, y|θ)]θ
(5.6)
Going by the free energy view of the EM algorithm [61], the E- and M-steps can be viewed as
alternating between maximizing the free energy with respect to theq-function and the parametersθ.
This is related to the minimization of free energy in statistical physics. The formulation of free energy
F is given in Equation (5.7):
F (q, θ) = Eq[log(x, y|θ)] + H(q)
= −D(q‖pθ) + L(θ)(5.7)
In Equation (5.7),D(q‖p) represents the Kullback-Leibler divergence (KL-divergence) betweenq
andp given by Equation (5.8), andL(θ) represents the data likelihood for the parameterθ. In other
64
words, the EM algorithm alternates between minimizing the KL-divergence betweenq andp, and max-
imizing the likelihood of the data given the parameterθ:
D(q‖p) =∫
yq(y) log
q(y)p(y)
dy (5.8)
The equivalence of the regular form of EM and the free-energy form of EM has already been estab-
lished in [61]. Further, sinceyi’s are independent of each other, theq(y) andp(y) terms can be split into
a products of differentq(yi)’s andp(yi)’s, respectively. This is used to justify the incremental version
of EM algorithm that incrementally runs partial or generalized M-steps on each data point. This can
also be done using sufficient statistics of the data collected until that data point, if it is possible to define
sufficient statistics for a sequence of data-points.
Coming back to the message passing algorithm, for each data point, when message passing con-
verges, the beliefs at each variable node give a distribution over all the hidden variables. If we look at
theq-function, it is nothing but an approximation of the actual distribution over the variables -p, and we
are trying to minimize the KL-divergence between the two. Now, we can get the sameq-function from
the converged messages and beliefs in the graphical model. Hence, one can view message passing as a
localized and online version of the E-step.
5.5.2 Online and local M-step
Now, let us have a look at the M-step. M-step involves maximizing the likelihood with respect to
the parametersθ. When performed online for a particular data point, it can be thought of as a stochastic
gradient ascent version of Equation (5.6). Making use of the sufficient statistics will definitely improve
the approximation of the M-step since it will use the entire data presented until that point, instead of
a single data point. Now, if we take the factorization property of the joint-probability function into
account, we can also see that the M-step can be distributed locally for each component of the parameter
θ associated with each module or function node. This justifies the localized parameter updates based
on gradient ascent shown in [41, 42]. This is another critical insight that will help us use the online
learning algorithms devised for various modules to be used as local M-steps in our systems. Due to the
65
integration involved with the marginalization over the hidden variables while calculating the likelihood,
this will be an approximation of the exact M-step. Determining the conditions where this approximation
should work will be part of our future work.
One issue that still remains is the partition function. With all the local M-steps maximizing one
term of the likelihood in a distributed fashion, it is likely that the local terms increase infinitely, while
the actual likelihood does not. This problem arises when insufficient care is taken to normalize the
likelihood by dividing it with a partition function. While dealing with sampling-based numerical inte-
gration methods such as MCMC [38], it becomes difficult to calculate the partition function. This is
because methods such as importance sampling and Gibbs sampling used in MCMC deal with surrogate
q-function, which is usually a constant multiple of the targetq-function. The multiplication factor can
be assessed by integrating over the entire space, which is difficult. There are two ways of getting around
this problem. One way was suggested in [34] as maximizing the contrastive divergence instead of the
actual divergence. The other way is to put some kind of local normalization in place while calculat-
ing messages sent out by various modules. As long as the multiplication factor of theq-function does
not increase beyond a fixed number, we can guarantee that maximizing the local approximation of the
components of the likelihood function will actually improve system performance.
In the M-step of the EM algorithm, we minimizeQ(θ, θ(i−1)) with respect toθ, as described in
Equation (5.9):
M-step:θ(i) ← arg maxθ
Q(θ, θ(i−1)) (5.9)
In Equation (5.10), we show how this minimization can be distributed over different components of
the parameter variableθ:
Q(θ, θ(i−1)) = E[log p(X,Y |θ)|X, θ(i−1)]
=∫h∈H log p(X, Y |θ)f(Y |X, θ(i−1))dh
=∫h∈H(
∑mi=1 log p(xi, yi|θi))f(Y |X, θ(i−1))dh
=∑m
i=1
∫h∈H log p(xi, yi|θi)f(Y |X, θ(i−1))dh
(5.10)
66
In Equation (5.10), part of the generative model is represented by the functionf(Y |X, θ), which
describes how the observation setY can be generated if the hidden variable setX and the parameter
vectorθ were given. Also, the observation set and the hidden variable set are broken into pairs(xi, yi),
which justifies distributing the M-step over these pairs.
5.5.3 PDF softening
Until now, PDF softening was only intuitively justified [34]. In this section, we revisit the intuition,
and justify the concept mathematically in Equation (5.11):
D(q ‖ p)
=∫
x∈Xq(x) log
q(x)p(x)
dx
=∫
x∈Xq(x) log q(x) dx−
∫
y∈Xq(y) log p(y) dy
=∫
x∈Xq(x) log
∏i qi(x)∫
w∈X
∏j qj(w) dw
dx−∫
y∈Xq(y) log p(y) dy
=∫
x∈Xq(x)
∑
i
(log qi(x))− log
∫
w∈X
∏
j
qj(w) dw
dx−
∫
y∈Xq(y) log p(y) dy
=∫
x∈X
∑
i
(q(x) log qi(x))− q(x) log
∫
w∈X
∏
j
qj(w) dw
dx−
∫
y∈Xq(y) log p(y) dy
=∑
i
(∫
x∈Xq(x) log qi(x) dx
)−
∫
z∈Xq(z) log
∫
w∈X
∏
j
qj(w) dw
dz −
∫
y∈Xq(y) log p(y) dy
=∑
i
(∫
x∈Xq(x) log qi(x) dx
)− log
∫
w∈X
∏
j
qj(w) dw
∫
z∈Xq(z)dz −
∫
y∈Xq(y) log p(y) dy
=∑
i
(∫
x∈Xq(x) log qi(x) dx
)− log
∫
w∈X
∏
j
qj(w) dw
−
∫
y∈Xq(y) log p(y) dy
(5.11)
As shown in Equation (5.11), if we want to decrease the KL-divergence between the surrogate
distributionq and the actual distributionp, we need to minimize the sum of three terms. The first term
on the last line of the equation is minimized if there is an increase in the high-probability region as
67
defined byq, which is actually a low-probability region for an individual componentqi. This means,
that this term prefers diversity among differentqi’s, sinceq is proportional to the product ofqi’s. Thus,
the low-probability regions ofq need not be low probability regions of a givenqi. On the other hand,
the third term is minimized, if there is an overlap between the high-probability region as defined byq
and the high-probability region defined byp and between the low-probability region as defined byq and
the low-probability region defined byp. In other words, surrogate distributionq should closely model
the actual distributionp.
Hence, overall, the model seeks a good fit in the product, while seeking diversity in individual terms
of the product. It also seeks not-so-high-probability regions of individualqi’s to overlap with high-
probability regions ofq. Whenp has a peaky (low-entropy) structure, these goals may seem conflicting.
However, this problem can be alleviated if the individual experts cater to different dimensions or aspects
of the probability space, while each individual distribution has high enough entropy. This justifies soft-
ening the PDFs. This can be done by adding a high-entropy distribution such as a uniform distribution
(which has provably the highest entropy), by raising the distribution to a fractional power, or by raising
the variance of the peaks. Intuitively, this means that we want to strike a balance between useful opinion
expressed by an expert and being overcommitted to any particular solution (high-probability region).
5.6 Prescription
With the discussion on the theoretical justification of the design of V/M graphs complete, in this
section we want to summarize how to design a V/M graph for a given application. In Chapter 6, we
will present experimental results of successful design of vision systems for complex tasks using V/M
graphs.
To design a V/M graph for an application, we will follow the following guidelines:
1. Identify the variables needed to represent the solution.
2. Identify the intermediate hidden variables.
3. Suitably break down the data into a set of observed variables.
68
4. Identify the processing modules that can relate and constrain different variables.
5. Ensure that there is enough diversity in the processing modules.
6. Lay down the graphical structure of the V/M graph similar to how one would do that for a factor
graph, using modules instead of function nodes.
7. Redesign each module so that it can tune online to increase local joint-probability function in an
online fashion.
8. Ensure that the modules have enough variance or leniency to be able to recover from mistakes
based on the redundancy provided by the presence of other modules in a graphical structure.
9. If a module has no feedback for a variable node, this can be considered to be a feedback equivalent
of a uniform distribution. Such a feedback can be dropped from calculating local messages to save
computation.
Once the system has been designed, the processing will follow a simple message passing algorithm
while each module will learn in a local and online manner. If the results are not desirable, one would
want to replace some of modules with better estimators of the given task, or make the graph more robust
by adding more (and diverse) modules, while considering making modules more lenient.
69
CHAPTER 6
SOME APPLICATIONS OF V/M GRAPHS
The V/M graph framework was developed in Chapters 4 and 5 to help design fast online learning
applications for accomplishing complex video processing and understanding tasks. In this chapter, we
report design and experimental results of several applications related to the broad problem of automated
surveillance.
Vision systems for automated surveillance have evoked a lot of interest due to the increase in security
concerns all over the world and the availability of superior computing power and cheap cameras [68–70].
Of particular interest is automatic event detection in surveillance video. Event is a high-level semantic
concept and is not very easy to define in terms of low-level raw data. This gap between the available
data and the useful high-level concepts is known as thesemantic gap. It can be safely said that the vision
systems, in general, aim to bridge the semantic gap in visual data processing.
Variables representing high-level concepts such as events can be conveniently defined over lower-
level variables such as position of people in a frame; provided that the defining lower-level variables
are reliably available. For example, if we were to decide whether a person came out or went in through
a door, we can easily decide this if the sequence of the position of the person (and the position of the
door) in various frames in the scene was available to us. This is the rationale behind modular design,
where in this case, one would devise a system for person tracking and the output of the tracking module
would be used by an event detection module to decide whether the event has taken place or not.
70
6.1 Scenario
The scenario that we considered for our experiments is related to the broad problem of automated
surveillance. We assume a fixed camera in our experiments, but such an assumption is not a necessity
for the application of V/M graphs, since it is a generic framework to design a vision system based on
interaction between various processing modules. We also consider the scene to be staged indoors, and
again, this has no bearing on the utility of V/M graphs for vision systems that can deal with outdoor
scenes.
In the following experiments, we concentrate on several applications of V/M graphs in the surveil-
lance setting. We will roughly proceed from simpler tasks to increasingly complex tasks. While doing
so, many times we will incrementally build upon previously accomplished sub-tasks. This will also
showcase one of the advantages of V/M graphs; namely, easy extendability.
6.2 Application: Person Tracking
We start with the most basic experiment, where we build an application for tracking a single target
(person) using a fixed indoor camera. In this application, we identify five variables that affect inference
in a frame: the intensity map (pixel values) of the frame (or, the observed variable(s)), the background
mask, the position of the person in the current frame, the position of the person in previous frame, and
the velocity of the person in previous frame. These variables are represented asx1, x2, x3, x4, and
x5, respectively, in Figure 6.1. The variables exchange information through modulesFA, FB, FC , and
FD. Module FA represents the background subtraction module that maintains an eigen background
model [56] as system parameters, using a modified version online learning algorithm for performing
principal component analysis (PCA) as described in [71]. While it passes information fromx1 to x2, it
does not pass it the other way round, as image intensities are evidence, hence fixed. ModuleFC serves as
the interface between the background mask and the position of the person. In effect, we run an elliptical
gaussian filter, roughly of the size of a person/taeget, over the background map and normalize its output
as a map of the probability of a person’s position. ModuleFB serves as the interface between the image
intensities and the position of the person in the current framex3. Since it is computationally expensive
71
to perform operations on every pixel location, we sample only a small set of positions to confirm if the
image intensities around that position resemble the appearance of the person being tracked. The module
maintains an online learning version of eigen-appearance of the person as system parameters based on
a modification of a previous work [72]. It also does not pass any message tox1. The position of the
person in the current frame is dependent on the position of the person in the previous framex4 and the
velocity of the object in the previous framex5. Assuming a first-order motion model, which is encoded
in FD as a Kalman filter, we connectx3 to x4 andx5. Bothx4 andx5 are assumed fixed for the current
frame; therefore,FD only passes the message forward tox3 and does not pass any message tox4 or x5.
Figure 6.1V/M graph for single-target tracking application.
6.2.1 Message passing and learning schedule
The message passing and learning schedule used was as follows:
1. Initialize a background model.
2. If a large contiguous foreground area is detected, initialize a person-detection moduleFC and
tracking-related modulesFB andFD.
3. Initialize the position of the person in the previous frame as the most likely position according to
the background map.
4. Initialize the velocity of the person in the previous frame to be zero.
72
For every frame, the following will occur:
1. Propagate a message fromx1 to FA as the image.
2. Propagate a message fromx1 to FB as the image.
3. Propagate messages fromx4 andx5 to FD.
4. Propagate a message fromFD to x3 in form of samples of likely position.
5. Propagate a message fromFA to x2 in form of a background probability map after an eigen-
background subtraction.
6. Propagate a message fromx2 to FC in form of a background probability map.
7. Propagate a message fromFC to x3 in form of a probability map of likely positions of the object
after filtering ofx2 by an elliptical gaussian filter.
8. Propagate a message fromx3 to FB in form of samples of likely position.
9. Propagate a message fromFB to x3 in form of probabilities at samples of likely position as
defined by the eigen appearance of the person maintained atFB.
10. Combine the incoming messages fromFB, FC , andFD at x3 as the product of the probabilities
at the samples generated byFD.
11. Infer the highest probability sample as the new object position measurement. Calculate current
velocity.
12. Update online eigen models atFA, andFB.
13. Update motion model atFD.
6.2.2 Results
We ran our person tracker in both single person and multi-person scenarios using grayscale indoor
sequences 320× 240 in dimensions using a fixed camera. People appeared to be as small as 7× 30
73
pixels. It should be noted that no elaborate initialization and no prior training was done. The tracker
was required to run and learn on the job, fresh out of the box. The only prior information used was the
approximate size of the target, which was used to initialize the elliptical filter. Some of the successful
results on difficult sequences are shown in Figure 6.2. Running on unoptimized MATLAB code on a
2.4 GHz computer, the system performs at about 2 frames per second.
(a) Brief occlusion of the object.
(b) Nearly camouflaged object.
Figure 6.2Results of the tracking application (white rectangles around targets) zoomed in and croppedfor clarity.
The tracker could easily track people successfully after complete but brief occlusion, owing to the
integration of a background subtraction, eigen-appearance, and motion models. The system successfully
picks up and tracks a new person automatically when he/she enter the scene, and gracefully purges the
tracker when the person is no longer visible. As long as a person is distinct from the background for
some time during a sequence of frames, the online adaptive eigen appearance model successfully tracks
the person even when they are subsequently camouflaged into the background. Note that any of the
74
tracking components in isolation would fail in difficult scenarios such a complete occlusion, widely
varying appearance of people, and background camouflage.
The tracker was not a complete success, and it lost track of the object in rather difficult situations
when the target goes into occlusion uncovering a background object not only matched the greylevel, but
also matched the shape of the target being tracked. Such a tracking failure is shown in Figure 6.3.
Figure 6.3Sequence showing failure for a difficult situation.
To alleviate the problem of losing track because of occlusion, coupled with matching of background
objects in appearance, we changed our model to include more information. Specifically, we used color
frames, instead of grayscale frames. The V/M graphs remains the same, as shown in Figures 6.1. The
tracking results improved tremendously, and are shown in Figure 6.4. In this figure, the trajectory of
the center of the bounding box (from previous frames) is plotted in green, and the bounding box in the
current frame is shown as a white rectangle. Note that some trajectories pass behind objects and are still
not lost.
Even the improved tracker was not perfect. When we tried yet another difficult scenario, where
the target suddenly changes velocity by a large amount (the person starts running after slow walking),
the tracker loses track of the object and initializes a new track where the object appears in the next
frame. We think that the nature of the motion model (first-order linear), cannot take into account large
accelerations. This can be alleviated by a more sophisticated motion model. The failure results are
shown in Figure 6.5.
75
Figure 6.4Different successful tracking sequences after using color information.
Figure 6.5Even color tracker can lose track under large acceleration of the object.
76
6.3 Application: Multiperson Tracking
To adapt the single person tracker developed in Section 6.2 for multiple targets, we need to modify
the V/M graph depicted in Figure 6.1. In particular, we will need at least one position variable for each
target being tracked. We will also need one variable representing the position in the previous frame
and one representing the velocity in the previous frame for each object. On the module side, we will
need one module each for each object representing the appearance matching, elliptical filtering on the
background map, and Kalman filter. The resulting V/M graph is shown in Figure 6.6. The message
passing and learning schedule was pretty much the same as given in Section 6.2.1, except the steps
specific to the target were performed for each target being tracked.
Figure 6.6V/M graph for multiple-target tracking application (here, two targets).
6.3.1 Results
We ran our person tracker to track multiple persons’ grayscale indoor sequences 320× 240 in
dimensions using a fixed camera. People appeared to be as small as 7× 30 pixels. It should be noted
that no elaborate initialization and no prior training was done. The tracker was required to run and learn
77
on the job, fresh out of the box. The results are shown in Figure 6.7. Running on unoptimized MATLAB
code on a 2.4 GHz computer, the system performs at about 2 frames per second.
Figure 6.7 Results of the multitarget tracking application (white rectangles around targets) zoomed inand cropped for clarity.
We also modified the tracker to take into account the color information. The tracking results im-
proved, and the results are shown in Figure 6.8.
Figure 6.8Different successful tracking sequences involving multiple targets and using color informa-tion.
The multiperson tracker was unable to deal with situations where two persons would walk together
and one would fully or partially occlude the other all the time. In such a case, many times only one
78
person would be tracked successfully. An example of this failure is shown in Figure 6.9. This problem
can be solved by explicit occlusion modeling into the generative model that would lead to the V/M
graph.
Figure 6.9Failure due to one person occluding the other.
6.4 Application: Trajectory Modeling and Prediction
A tracking system can be an essential part of a trajectory modeling system. Many interesting events
in a surveillance scenario can be recognized based on trajectories. People walking into restricted areas,
violations at access controlled doors, moving against the general flow of traffic are examples of few
interesting events that can be extracted based on trajectory analysis. This will allow us to detect unusual
events that are based on unusual trajectories. With this framework, it is easy to incrementally build a
trajectory modeling system on top of a tracking system with interactive feedback from the trajectory
models to improve tracking results.
6.4.1 Trajectory modeling module
We add a trajectory modeling moduleFE connected tox3 andx4 which represent the position of
the object being tracked in the current frame and the previous frame respectively. The factor graph of
the extended system is shown in Figure 6.10.
79
Figure 6.10V/M graph for trajectory modeling system.
The trajectory modeling module stores the trajectories of the people, and predicts the next position
of the object based on previously stored trajectories. The message passed fromFE to x3 is given in
Equation (6.1):
ptraj ∝ α +∑
i
wixpredi (6.1)
In Equation (6.1),ptraj is the message passed fromFE to x3, α is a constant added as a uniform
distribution,i is an index that runs over the stored trajectories,wi is the weight calculated based on how
close is the trajectory to the position and direction of the current motion, andxpredi is the next point to
the current closest point on the trajectory to the object position in the previous frame. The predicted
trajectory is represented by variablex6.
6.4.2 Results
This is a very simple trajectory-modeling module, and the values of various constants were set
empirically, although no elaborate tweaking was necessary. As shown in Figure 6.11, we can predict
the most probable trajectory in many cases where similar trajectories have been seen before.
80
Figure 6.11Sequences showing successful trajectory modeling. Object trajectory is shown in green,and predicted trajectory is shown in blue.
81
There are times when the target starts out close to a particular stored trajectory, but then goes on to
another path. Our system can correct its prediction (or stop giving wrong prediction) under this changing
scenario. Successful results are shown in Figure 6.12.
Figure 6.12Sequences showing successful adaptation of trajectory prediction. Each row represents onesequence.
The experiments are very preliminary at this stage, and more data will be needed to do some behavior
monitoring. Also, the trajectory modeling module is very rudimentary as it stores all the trajectories and
compares the position of object in the previous frame with all the trajectories in the memory. For
a system that will run for a long time collecting many trajectories, such a modeling and matching
scheme is certainly not computationally efficient. Other approaches to trajectory modeling such as
vector quantization [15] can be used to replace the trajectory modeling module in this framework.
6.5 Application: Event Detection Based on Single Target
The ultimate goal for automated video surveillance is to be able to do automatic event detection in
video. With trajectory analysis, we move closer to this goal, since there are many events of interest that
can be detected using trajectories. In this section, we present an application to detect the event whether
82
a person went in or came out of a secure door. To design this application, all we have to do is to add an
event detection module that is connected to the trajectory variable node, and add an event variable node
to the event detection module. The event detection module can work according to simple rules based on
the target trajectory.
We show the V/M graph used for this application in Figure 6.13. The event detection module applies
some simple rules on the trajectory to decide whether the person came out or went in. Specifically, it
checks the direction of the vector from the start-point of the trajectory to its end-point and divides the
direction space into two sets to make the decision. The decision is taken only when the track is lost, and
not while the object is still being tracked. Thus, the event variable has three states; “no event,” “came
out,” and “went in.”
Figure 6.13V/M graph for single track-based event detection system.
6.5.1 Results
The results were quite encouraging. We got 100% correct event detection results owing to reasonable
tracking performance. Some results are shown in Figure 6.14. In theory, one could also design an event
83
detection system that can give a feedback to the trajectory variable module. However, we assumed this
feedback to be a uniform distribution in this example, and did not use it in any calculations.
Figure 6.14Results of the event detection based on analysis of a single trajectory. Each row representsone sequence. The result is shown as a text label in the last frame of each sequence.
6.6 Application: Event Detection Based on Multiple Targets
We also designed applications for event-detection based on multiple trajectories. Specifically, we
designed applications to detect two people meeting in a cafe scenario, and piggybacking and tailgating
at secure doors. The event detection module worked according to simple rules based on the trajectories
of the targets. We show the V/M graph used for this application in Figure 6.15.
84
Figure 6.15V/M graph for multiple track-based event detection system.
The event detection module (FF ) in Figure 6.15 applies some simple rules on the trajectories of two
targets to decide whether the event has taken place or not. Specifically, to detect two people meeting,
it checks the that the trajectories of the two people converge and stay together for a while to make the
decision. For detecting piggybacking or tailgating, it checks whether the trajectory of the two targets
started together or not in order to infer whether the person swiping the card was aware of the presence
of the other person behind him/her. If s/he was, then it is piggybacking, else it is tailgating.
6.6.1 Results
We implemented three different multitarget event detection systems, one each for a type of event.
Two of these were for detecting conditions at a secure door entry-point into a building, that is, tailgating
and piggybacking. The system could pick up 80% of the instances tailgating and piggybacking from a
total of 5 examples in the video shot. Sample results are shown in Figures 6.16 and 6.17.
85
Figure 6.16Sequence showing a detected “piggybacking” event. The first two images show represen-tative frames of the second person following the first person closely, and the third image represents thedetection result using an overlayed semitransparent letter “P.”
Figure 6.17Sequence showing a detected “tailgating” event. The first two images show representativeframes of the second person following the first person at a distance (sneaking in from behind), and thethird image represents the detection result using an overlayed semitransparent letter “T.”
86
Sample result for the event detection system for the third type of event (“meeting for lunch”) is also
shown in Figure 6.18. All the results are preliminary examples of the potential of the system and are
by no means indicative of how it compares to other event detection systems. The main difficulty in a
comparison of different event detection systems is the lack of commonly agreed upon video data corpus
that can be used to benchmark different systems in the research community.
Figure 6.18Sequence showing a detected “meeting for lunch” event. The first two images show repre-sentative frames of the second person following the first person to the lunch table, and the third imagerepresents the detection result using an overlayed semitransparent letter “M.”
87
CHAPTER 7
CONCLUSIONS AND FUTURE WORK
Complex vision tasks such as event detection in a surveillance video can be divided into subtasks
such as human detection, tracking, recognition, and trajectory analysis. The video can be thought of
as being composed of various features. These features can be roughly arranged in a hierarchy from
low-level features to high-level features. Low-level features include edges and blobs, and high-level
features include objects and events. Loosely, the low-level feature extraction is based on signal/image
processing techniques, while the high-level feature extraction is based on machine learning techniques.
7.1 Conclusions
In this work we have shown that information extracted from an image or a video can be arranged in
a hierarchy of features. Towards the higher end of the hierarchy are semantically meaningful entities,
and towards the lower end are easy to extract features that are based on local appearance of the image
or the video. We characterized the differences between the nature of the two levels of the hierarchical
representation.
Many interesting tasks and applications such as object recognition, tracking, and event detection
require high-level feature extraction. Moreover, many of the high-level tasks are interrelated and can
benefit from each other using a feedback mechanism. A feedback mechanism from the high-level fea-
tures to the low-level features is also likely to help with the processing at the lower levels, and in turn
the higher levels itself. We presented out work on systems without feedback. We showed limitations of
systems without any feedback.
88
We also presented our work on a system with limited feedback, and commented on ad hoc methods
of designing systems with limited feedback. We presented examples of system design paradigms that
make extensive use of feedback and interaction between various units of a complex vision system.
Among those, probabilistic graphical models directly model the generative process of the observed
data. We commented on the limitations of these models.
We presented a new framework that makes use of the graphical structure of the probabilistic graph-
ical models. This framework takes a quantum jump in modeling simplification by replacing function
nodes of a factor graph with generic modules inspired by feedforward modular systems. Intuition behind
the design, inference and learning methods for the framework was developed based on the sum-product
algorithm, product of experts, free-energy view of the EM algorithm, and local parameter optimization
in factor graphs. Applications were developed using the new framework. Without using very sophisti-
cated individual modules, the results were found to be encouraging.
7.2 Future Work
On the theoretical side, future work includes mathematical analysis of conditions for convergence in
the message passing in the new framework for graphs without loops. More analysis of adding uniform
distributions to the output of modules is also needed. One could also analyze the effect of changing
modules. At this point, our guess is that changing a few subconstraints will not change the solution
drastically, as long as the new submodule assigns high-probability to the correct solution, and does not
make the same mistakes (assign high-probability to the same non-solution regions) as the other modules.
One thing that has not been touched in this work is learning the structure of the graph. Since we
have predefined modules that we fit in a graph, we only learn the parameters of the modules. In the
applications with dynamic graphs, such as when we are tracking a dynamic number of multiple targets,
we initialize and destroy modules and variable nodes on the fly, in an ad hoc fashion. Principled ways
to change the structure of the graph can be an interesting and useful direction to pursue.
On the application side, one could improve the multiperson tracking application with a more sophis-
ticated appearance model. One choice for this can be an online version of a manifold-learning based
89
approach [73,74]. One could also extend our preliminary work on trajectory modeling. To this end, we
will use a better trajectory modeling approach such as one based on vector quantization [15]. Another
extension of this work can be for an application that can do trajectory analysis and can detect some in-
teresting behavior patterns and unusual events in a surveillance scenario. The work can also be applied
to simple video stabilization applications.
Beyond the scope of the surveillance problems, we think that our work has a lot of potential. It lays
down some design ideas for complex vision systems, which can be applied to a vast number of vision
applications, and other applications where modular learning systems can be useful. Other areas that can
potentially benefit from this work are speech processing, text processing, data-mining, and multimodal
information fusion such as in multimedia systems. In-depth theoretical analysis can provide further
insight into extending and improving the system.
90
REFERENCES
[1] S. Lehar,The World in Your Head. Mahwah, NJ: Lawrence Erlbaum Associates, 2003.
[2] B. J. Frey and N. Jojic, “Transformation-invariant clustering using the EM algorithm,”IEEE Trans-actions on Pattern Analysis and Machine Intelligence, vol. 25, pp. 1–17, Jan 2003.
[3] D. G. Lowe, “Object recognition from local scale-invariant features,” inProceedings of the Inter-national Conference on Computer Vision, vol. 2, IEEE Computer Society, 1999, pp. 1150–1157.
[4] S. Belongie, J. Malik, and J. Puzicha, “Shape matching and object recognition using shape con-texts,” IEEE Transactions on Pattern Analysis Machine Intellegence, vol. 24, no. 4, pp. 509–522,2002.
[5] L. Fei-Fei, R. Fergus, and P. Perona, “A Bayesian approach to unsupervised one-shot learningof object categories,” inProceedings of the International Conference on Computer Vision, IEEEComputer Society, 2003, p. 1134.
[6] A. Sethi, D. Renaudie, D. Kriegman, and J. Ponce, “Curve and surface duals and the recognitionof curved 3D objects from their silhouettes,”International Journal of Computer Vision, vol. 58,no. 1, pp. 73–86, 2004.
[7] Y. L. Kergosien, “La famille des projections orthogonales d’une surface et ses singularites,” C.R.Acad. Sc. Paris, vol. 292, pp. 929–932, 1981.
[8] J. J. Koenderink and A. J. Van Doorn, “The singularities of the visual mapping,”Biological Cy-bernetics, vol. 24, pp. 51–59, 1976.
[9] S. Lazebnik, A. Sethi, C. Schmid, D. J. Kriegman, J. Ponce, and M. Hebert, “On pencils of tangentplanes and the recognition of smooth 3D shapes from silhouettes,” inProceedings of the EuropeanConference on Computer Vision, 2002, pp. 651–665.
[10] Y. Furukawa, A. Sethi, J. Ponce, and D. Kriegman, “Structure and motion from images of smoothtextureless objects,” inProceedings of the European Conference on Computer Vision, vol. 2, 2004,pp. 287–298.
[11] Y. Furukawa, A. Sethi, J. Ponce, and D. Kriegman, “Robust structure and motion from outlines ofsmooth curved surfaces,”IEEE Transactions on Pattern Analysis Machine Intellegence, vol. 28,pp. 302–315, Feb 2006.
[12] J. J. Weng, N. Ahuja, and T. S. Huang, “Learning recognition and segmentation using the crescep-tron,” International Journal of Computer Vision, vol. 25, no. 2, pp. 109–143, 1997.
[13] F. Porikli and T. Haga, “Event detection by eigenvector decomposition using object and framefeatures,” inProceedings of the IEEE Conference on Computer Vision and Pattern Recognition,vol. 7, IEEE Computer Society, 2004, p. 114.
91
[14] G. G. Medioni, I. Cohen, F. Bremond, S. Hongeng, and R. Nevatia, “Event detection and analysisfrom video streams,”IEEE Transactions on Pattern Analysis Machine Intellegence, vol. 23, no. 8,pp. 873–889, 2001.
[15] N. Johnson and D. Hogg, “Learning the distribution of object trajectories for event recognition,”in British Machine Vision Conference, vol. 2, 1995, pp. 583–592.
[16] C. Bregler, “Learning and recognizing human dynamics in video sequences,” inProceedings of theIEEE Conference on Computer Vision and Pattern Recognition, IEEE Computer Society, 1997,p. 568.
[17] N. Nguyen, H. Bui, S. Venkatesh, and G. West, “Recognizing and monitoring high-level behaviorsin complex spatial environments,” inProceedings of the IEEE Conference on Computer Vision andPattern Recognition, vol. 2, IEEE Computer Society, 2003, pp. 620–625.
[18] M. Han, A. Sethi, W. Hua, and Y. Gong, “A detection-based multiple object tracking method,” inProceedings of the International Conference on Image Processing, 2004, pp. 3065–3068.
[19] C. Stauffer and W. Grimson, “Adaptive background mixture models for real-time tracking,” inProceedings of the IEEE Conference on Computer Vision and Pattern Recognition, vol. 2, IEEEComputer Society, 1999, pp. 246–242.
[20] F. Jurie and C. Schmid, “Scale-invariant shape features for recognition of object categories,” inProceedings of the IEEE Conference on Computer Vision and Pattern Recognition, vol. II, IEEEComputer Society, 2004, pp. 90–96.
[21] J. L. Chen and A. Kundu, “Rotation and gray scale transform invariant texture identification usingwavelet decomposition and hidden Markov model,”IEEE Transactions on Pattern Analysis andMachine Intelligence, vol. 16, no. 2, pp. 208–214, 1994.
[22] A. S. Georghiades, P. N. Belhumeur, and D. J. Kriegman, “From few to many: Illumination conemodels for face recognition under variable lighting and pose,”IEEE Transactions on Pattern Anal-ysis Machine Intelligence, vol. 23, no. 6, pp. 643–660, 2001.
[23] P. J. Phillips, H. Moon, S. A. Rizvi, and P. J. Rauss, “The feret evaluation methodology for face-recognition algorithms,”IEEE Transactions on Pattern Analysis Machine Intelligence, vol. 22,no. 10, pp. 1090–1104, 2000.
[24] D. Marr,Vision. San Francisco, CA: W.H. Freeman, 1982.
[25] A. S. Ogale and Y. Aloimonos, “Shape and the stereo correspondence problem,”InternationalJournal of Computer Vision, vol. 65, no. 1, 2005.
[26] Y. Lecun, L. Bottou, Y. Bengio, and P. Haffner, “Gradient-based learning applied to documentrecognition,” inProceedings of IEEE, vol. 86, IEEE Computer Society, 1998, pp. 2278–2324.
[27] R. P. N. Rao and D. H. Ballard, “Predictive coding in the visual cortex: A functional interpretationof some extra-classical receptive-field effects,”Nature Neuroscience, vol. 2, no. 1, pp. 79–87,1999.
[28] T. S. Lee and D. Mumford, “Hierarchical Bayesian inference in the visual cortex,”Journal of theOptical Society of America A, vol. 20, pp. 1434–1448, 2003.
[29] A. Srivastava, X. Liu, and U. Grenander, “Universal analytical forms for modeling image probabil-ities,” IEEE Transactions on Pattern Analysis Machine Intelligence, vol. 24, no. 9, pp. 1200–1214,2002.
92
[30] E. Bienenstock, S. Geman, and D. Potter, “Compositionality, MDL priors, and object recognition,”Advances in Neural Information Processing Systems, vol. 9, pp. 838–844, 1997.
[31] D. Mumford, “Pattern theory: A unifying perspective,” inPerception as Bayesian Inference,D. Knill and W. Richard, Eds., New York, NY: Cambridge University Press, 1996, p. 2562.
[32] N. Jojic and B. Frey, “Learning flexible sprites in video layers,” inProceedings of the IEEE Confer-ence on Computer Vision and Pattern Recognition, vol. 1, IEEE Computer Society, 2001, pp. 199–206.
[33] F. R. Kschischang, B. J. Frey, and H.-A. Loeliger, “Factor graphs and the sum-product algorithm,”IEEE Transactions on Information Theory, Special Issue on Codes on Graphs and Iterative Algo-rithms, vol. 47, pp. 498–519, February 2001.
[34] G. Hinton, “Product of experts,” inProceedings of the International Conference on Artificial Neu-ral Networks, vol. 1, 1999, pp. 1–6.
[35] A. P. Dempster, N. M. Laird, and D. B. Rubin, “Maximum likelihood from incomplete data via theEM algorithm,”Journal of the Royal Statistical Society, Series B, vol. 39, no. 1, pp. 1–38, 1977.
[36] N. Jojic, N. Petrovic, B. Frey, and T. Huang, “Transformed hidden Markov models: Estimatingmixture models of images and inferring spatial transformations in video sequences,” inProceed-ings of the IEEE Conference on Computer Vision and Pattern Recognition, vol. 2, IEEE ComputerSociety, June 2000, pp. 2026–2033.
[37] C.-E. Guo, S.-C. Zhu, and Y. N. Wu, “Modeling visual patterns by integrating descriptive andgenerative methods,”International Journal of Computer Vision, vol. 53, no. 1, pp. 5–29, 2003.
[38] W. Gilks, S. Richardson, and D. Spiegelhalter,Markov Chain Monte Carlo in Practice. London:Chapman & Hall, 1996.
[39] Z. Tu, X. Chen, A. Yuille, and S.-C. Zhu, “Image parsing: Unifying segmentation, detection, andrecognition,” inProceedings of the International Conference on Computer Vision, vol. 1, 2003,pp. 18–25.
[40] H. Chen and S.-C. Zhu, “A generative model of human hair for hair sketching,” inProceedingsof the IEEE Conference on Computer Vision and Pattern Recognition, vol. 2, IEEE ComputerSociety, 2005, pp. 74–81.
[41] J. Binder, D. Koller, S. Russell, and K. Kanazawa, “Adaptive probabilistic networks with hiddenvariables,”Machine Learning, vol. 29, no. 2-3, pp. 213–244, 1997.
[42] E. Bauer, D. Koller, and Y. Singer, “Update rules for parameter estimation in Bayesian networks,”in Proceedings of the Thirteenth International Conference on Uncertainty in Artificial Intelligence,1997, pp. 3–13.
[43] P. Aarabi and S. Zaky, “Robust sound localization using multi-source audiovisual informationfusion,” Information Fusion, vol. 3, no. 2, pp. 209–223, 2001.
[44] A. Blake and M. Isard,Active Contours. Secaucus, NJ: Springer, 1998.
[45] M. Brandstein and D. Ward,Microphone Arrays. Berlin, Germany: Springer, 2001.
[46] R. Cutler and L. Davis, “Look who’s talking: Speaker detection using audio and video coore-lation,” in Proceedings of IEEE Conference on Multimedia and Expo, IEEE Computer Society,2000, pp. 1589–1592.
93
[47] A. Garg, V. Pavlovic, and J. Rehg, “Audio visual speaker detection using dynamic Bayesian net-works,” in Proceedings of IEEE Conference on Automatic Face and Gesture Recogniton, IEEEComputer Society, 2000, pp. 384–389.
[48] J. Vermaak, M. Gangnet, A. Blake, and P. Perez, “Sequential Monte Carlo fusion of sound andvision for speaker tracking,” inProceedings of International Conference on Computer Vision,IEEE Computer Society, June 2000, pp. 741–746.
[49] Y. Rui and Y. Chen, “Better proposal distributions: Object tracking using unscented particle filter,”in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, vol. 2, IEEEComputer Society, 2001, pp. 786–794.
[50] M. J. Beal, N. Jojic, and H. Attias, “Graphical model for audiovisual object tracking,”IEEE Trans-actions on Pattern Analysis and Machine Intelligence, vol. 25, pp. 828–836, July 2003.
[51] C. H. Knapp and G. C. Carter, “The generalized correlation method for estimation of time delay,”IEEE Transactions on Acoustics, Speech, and Signal Processing, vol. 24, pp. 320–327, August1979.
[52] T. Kailath, “The divergence and bhattacharya distance measures in signal selection,”IEEE Trans-actions on Communication Technology, vol. 15, pp. 52–60, 1967.
[53] D. Comaniciu, V. Ramesh, and P. Meer, “Real-time tracking of non-rigid objects using mean shift,”in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, vol. 2, IEEEComputer Society, June 2000, pp. 142–149.
[54] M. Rahurkar, A. Sethi, and T. Huang, “Robust speaker tracking by fusion of complementary fea-tures from audio and video modalities,” inWorkshop on Image Analysis for Multimedia InteractiveServices, WIAMIS, 2005.
[55] A. Sethi, M. Rahurkar, and T. S. Huang, “Variable module graphs: A framework for inferenceand learning in modular vision systems,” inProceedings of the International Conference on ImageProcessing, vol. 2, 2005, pp. 1326–1329.
[56] N. Oliver, B. Rosario, and A. Pentland, “A Bayesian computer vision system for modeling humaninteractions,”IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 22, no. 8,pp. 831–843, 2000.
[57] C. Stauffer and W. E. L. Grimson, “Learning patterns of activity using real-time tracking,”IEEETransactions on Pattern Analysis Machine Intellegence, vol. 22, no. 8, pp. 747–757, 2000.
[58] C. R. Wren, A. Azarbayejani, T. Darrell, and A. Pentland, “Pfinder: Real-time tracking of thehuman body,”IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 19, no. 7,pp. 780–785, 1997.
[59] B. J. Frey and N. Jojic, “A comparison of algorithms for inference and learning in probabilisticgraphical models,”IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 27, no. 9,pp. 1392–1416, 2005.
[60] J. Pearl, “Fusion, propagation, and structuring in belief networks,”Artificial Intelligence, vol. 29,no. 3, pp. 241–288, 1986.
[61] R. M. Neal and G. E. Hinton, “A new view of the EM algorithm that justifies incremental, sparseand other variants,” inLearning in Graphical Models, M. I. Jordan, Ed., Norwell, MA: KluwerAcademic Publishers, 1998, p. 355368.
94
[62] M. Isard and A. Blake, “Condensation; conditional density propagation for visual tracking,”Inter-national Journal of Computer Vision, vol. 29, no. 1, pp. 5–28, 1998.
[63] R. J. McEliece, D. J. C. MacKay, and J. F. Cheng, “Turbo-decoding as an instance of Pearls “beliefpropagation” algorithm,”IEEE Journal on Selected Areas in Communications, vol. 16, pp. 140–152, Feb 1998.
[64] D. Heckerman, “A tutorial on learning with Bayesian networks,” pp. 301–354, 1999.
[65] T. S. Jaakkola, “Variational methods for inference and estimation in graphical models,” Ph.D.dissertation, Massachusetts Institute of Technology, Cambridge, MA, 1997.
[66] E. B. Sudderth, A. T. Ihler, W. T. Freeman, and A. S. Willsky, “Nonparametric belief propagation,”in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, vol. 1, IEEEComputer Society, 2003, pp. 605–612.
[67] D. Margaritis, “Learning Bayesian network model structure from data,” Ph.D. dissertation,Carnegie Mellon University, Pittsburg, PA, 2003.
[68] R. T. Collins, A. J. Lipton, and T. Kanade, “Special section on video surveillance,”IEEE Transac-tions on Pattern Analysis and Machine Intelligence, vol. 22, no. 8, pp. 745–746, 2000.
[69] J. M. Ferryman, S. J. Maybank, and A. D. Worrall, “Visual surveillance for moving vehicles,”International Journal of Computer Vision, vol. 37, no. 2, pp. 187–197, 2000.
[70] R. Cucchiara, “Multimedia surveillance systems,” inProceedings of the Third ACM InternationalWorkshop on Video Surveillance and Sensor Networks, 2005, pp. 3–10.
[71] Y. Li, L. Xu, J. Morphett, and R. Jacobs, “An integrated algorithm of incremental and robust PCA,”in Proceedings of the International Conference on Image Processing, vol. 1, 2003, pp. 245–248.
[72] J. Lim, D. Ross, R. Lin, and M. Yang, “Incremental learning for visual tracking,”Advances inNeural Information Processing Systems, pp. 793–800, 2005.
[73] J. B. Tenenbaum, V. de Silva, and J. C. Langford, “A global geometric framework for nonlineardimensionality reduction,”Science, vol. 290, no. 5500, pp. 2319–2323, 2000.
[74] S. Roweis and L. Saul, “Nonlinear dimensionality reduction by locally linear embedding,”Science,vol. 290, no. 5500, pp. 2323–2326, 2000.
95
AUTHOR’S BIOGRAPHY
Amit Sethi was born in Jalandhar (Punjab), India, in 1978. He received his bachelors degree in
electrical engineering in 1999 from Indian Institute of Technology, New Delhi, India, and his masters
degree in 2001 in general engineering from the University of Illinois at Urbana-Champaign, USA.
Since 2001 he has been a research assistant in the Department of Electrical and Computer Engineer-
ing and at the Beckman Institute for Advanced Science and Technology. He was a visiting researcher
and a summer research intern at NEC Labs, Cupertino, CA, during the spring and summer of 2003,
respectively. His research interests include applications of graphical models and machine learning to
multimedia processing, image understanding and video understanding.
96